Refactoring the checkpointer's fsync request queue

Started by Thomas Munroabout 7 years ago93 messages

thomas.munro@enterprisedb.com

about 7 years ago

1 attachment(s)

Hello hackers,

Currently, md5.c and checkpointer.c interact in a way that breaks
smgr.c's modularity. That doesn't matter much if md.c is the only
storage manager implementation, but currently there are two proposals
to provide new kinds of block storage accessed via the buffer manager:
UNDO and SLRU.

Here is a patch that rips the fsync stuff out of md.c, generalises it
and puts it into a new translation unit smgrsync.c. It can deal with
fsync()ing any files you want at checkpoint time, as long as they can
be described by a SmgrFileTag (a struct type we can extend as needed).
A pathname would work too, but I wanted something small and fixed in
size. It's just a tag that can be converted to a path in case it
needs to be reopened (eg on Windows), but otherwise is used as a hash
table key to merge requests.

There is one major fly in the ointment: fsyncgate[1]/messages/by-id/20180427222842.in2e4mibx45zdth5@alap3.anarazel.de. Originally I
planned to propose a patch on top of that one, but it's difficult --
both patches move a lot of the same stuff around. Personally, I don't
think it would be a very good idea to back-patch that anyway. It'd be
riskier than the problem it aims to solve, in terms of bugs and
hard-to-foresee portability problems IMHO. I think we should consider
back-patching some variant of Craig Ringer's PANIC patch, and consider
this redesigned approach for future releases.

So, please find attached the WIP patch that I would like to propose
for PostgreSQL 12, under a separate Commitfest entry. It incorporates
the fsyncgate work by Andres Freund (original file descriptor transfer
POC) and me (many bug fixes and improvements), and the refactoring
work as described above.

It can be compiled in two modes: with the macro
CHECKPOINTER_TRANSFER_FILES defined, it sends fds to the checkpointer,
but if you comment out that macro definition for testing, or build on
Windows, it reverts to a mode that reopens files in the checkpointer.

I'm hoping to find a Windows-savvy collaborator to help finish the
Windows support. Right now it passes make check on AppVeyor, but it
needs to be reviewed and tested on a real system with a small
shared_buffers (installcheck, pgbench, other attempts to break it).
Other than that, there are a couple of remaining XXX notes for small
known details, but I wanted to post this version now.

[1]: /messages/by-id/20180427222842.in2e4mibx45zdth5@alap3.anarazel.de

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Refactor-the-checkpointer-request-queue.patchapplication/octet-stream; name=0001-Refactor-the-checkpointer-request-queue.patchDownload

From 64e8bffde370c1b4fcef17a5a372080cda012660 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Mon, 15 Oct 2018 22:48:05 +1300
Subject: [PATCH] Refactor the checkpointer request queue.

1.  Decouple the the checkpoint queue machinery from md.c, so that
future SMGR implementations can also use it.

2.  Keep file descriptors open to avoid losing errors on some OSes.
Craig Ringer discovered that our practice of closing files and then
reopening them in the checkpointer so it can call fsync(2) could
lose track of write-back errors on Linux.

Change to a model where file descriptors are sent to the checkpointer
via the ancillary data mechanism of Unix domain sockets, and the
oldest file descriptor for each given file is kept open, so that
the write-back errors cannot be lost.

On Windows, a pipe is the most natural replacement for a Unix domin
socket, but unfortunately pipes don't support multiplexing via
WSAEventSelect(), as used by our WaitEventSet machninery.  So use
"overlapped" IO, and add the ability to wait for IO completion to
WaitEventSet.  A new wait event flag WL_WIN32_HANDLE is provided
on Windows only, and used to wait for asynchronous read and write
operations over the checkpointer pipe.  For now file descriptors are
not transferred via the pipe on Windows.

3.  Consider fsync failures to be fatal; the status of data written
before a failed fsync(2) is unknown on Linux, even after a later
successful fsync(2) call, so it is unsafe to complete a checkpoint.

Author: Andres Freund and Thomas Munro
Reviewed-by: Thomas Munro, Dmitry Dolgov
Discussion:
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
Discussion: https://postgr.es/m/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com
---
 src/backend/access/transam/xlog.c         |   9 +-
 src/backend/bootstrap/bootstrap.c         |   1 +
 src/backend/commands/dbcommands.c         |   2 +-
 src/backend/commands/tablespace.c         |   2 +-
 src/backend/postmaster/bgwriter.c         |   1 +
 src/backend/postmaster/checkpointer.c     | 544 +++++++++------
 src/backend/postmaster/postmaster.c       | 123 +++-
 src/backend/storage/buffer/bufmgr.c       |   2 +
 src/backend/storage/file/fd.c             | 217 +++++-
 src/backend/storage/freespace/freespace.c |   5 +-
 src/backend/storage/ipc/ipci.c            |   2 +
 src/backend/storage/ipc/latch.c           |  12 +
 src/backend/storage/smgr/Makefile         |   2 +-
 src/backend/storage/smgr/md.c             | 791 ++-------------------
 src/backend/storage/smgr/smgr.c           |  63 +-
 src/backend/storage/smgr/smgrsync.c       | 803 ++++++++++++++++++++++
 src/backend/tcop/utility.c                |   2 +-
 src/backend/utils/misc/guc.c              |   1 +
 src/include/postmaster/bgwriter.h         |  24 +-
 src/include/postmaster/checkpointer.h     |  71 ++
 src/include/postmaster/postmaster.h       |   9 +
 src/include/storage/fd.h                  |  11 +
 src/include/storage/latch.h               |   1 +
 src/include/storage/smgr.h                |  24 +-
 src/include/storage/smgrsync.h            |  37 +
 25 files changed, 1711 insertions(+), 1048 deletions(-)
 create mode 100644 src/backend/storage/smgr/smgrsync.c
 create mode 100644 src/include/postmaster/checkpointer.h
 create mode 100644 src/include/storage/smgrsync.h

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7375a78ffcf..62c5f7e9b96 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/walwriter.h"
 #include "postmaster/startup.h"
 #include "replication/basebackup.h"
@@ -64,6 +65,7 @@
 #include "storage/procarray.h"
 #include "storage/reinit.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/backend_random.h"
 #include "utils/builtins.h"
@@ -8777,8 +8779,10 @@ CreateCheckPoint(int flags)
 	 * Note: because it is possible for log_checkpoints to change while a
 	 * checkpoint proceeds, we always accumulate stats, even if
 	 * log_checkpoints is currently off.
+	 *
+	 * Note #2: this is reset at the end of the checkpoint, not here, because
+	 * we might have to fsync before getting here (see smgrsync()).
 	 */
-	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
 	/*
@@ -9141,6 +9145,9 @@ CreateCheckPoint(int flags)
 									 CheckpointStats.ckpt_segs_recycled);
 
 	LWLockRelease(CheckpointLock);
+
+	/* reset stats */
+	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 578af2e66d8..43bc24953a4 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -31,6 +31,7 @@
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5342f217c02..4d56db8d7b8 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -47,7 +47,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "replication/slot.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index f7e9160a4f6..3096a2c904d 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -70,7 +70,7 @@
 #include "commands/tablespace.h"
 #include "common/file_perm.h"
 #include "miscadmin.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/standby.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index b1e9bb2c537..d373449e3f7 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -44,6 +44,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1a033093c53..29d3f937292 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,7 +46,10 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "postmaster/postmaster.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -101,19 +104,21 @@
  *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
 {
-	RelFileNode rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
+	uint32		type;
+	SmgrFileTag	tag;
+	bool		contains_fd;
+	int			ckpt_started;
+	uint64		open_seq;
 	/* might add a real request-type field later; not needed yet */
 } CheckpointerRequest;
 
+#define CKPT_REQUEST_RNODE			1
+#define CKPT_REQUEST_SYN			2
+
 typedef struct
 {
 	pid_t		checkpointer_pid;	/* PID (0 if not started) */
@@ -126,12 +131,9 @@ typedef struct
 
 	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
-	int			num_requests;	/* current # of requests */
-	int			max_requests;	/* allocated array size */
-	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
+	pg_atomic_uint32 num_backend_writes; /* counts user backend buffer writes */
+	pg_atomic_uint32 num_backend_fsync;	/* counts user backend fsync calls */
+	pg_atomic_uint64 ckpt_cycle; /* cycle */
 } CheckpointerShmemStruct;
 
 static CheckpointerShmemStruct *CheckpointerShmem;
@@ -171,8 +173,9 @@ static pg_time_t last_xlog_switch_time;
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
-static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void SendFsyncRequest(CheckpointerRequest *request, int fd);
+static bool AbsorbFsyncRequest(bool stop_at_current_cycle);
 
 /* Signal handlers */
 
@@ -182,6 +185,11 @@ static void ReqCheckpointHandler(SIGNAL_ARGS);
 static void chkpt_sigusr1_handler(SIGNAL_ARGS);
 static void ReqShutdownHandler(SIGNAL_ARGS);
 
+#ifdef WIN32
+/* State used to track in-progress asynchronous fsync pipe reads. */
+static OVERLAPPED absorb_overlapped;
+static HANDLE *absorb_read_in_progress;
+#endif
 
 /*
  * Main entry point for checkpointer process
@@ -194,6 +202,7 @@ CheckpointerMain(void)
 {
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext checkpointer_context;
+	WaitEventSet *wes;
 
 	CheckpointerShmem->checkpointer_pid = MyProcPid;
 
@@ -334,6 +343,21 @@ CheckpointerMain(void)
 	 */
 	ProcGlobal->checkpointerLatch = &MyProc->procLatch;
 
+	/* Create reusable WaitEventSet. */
+	wes = CreateWaitEventSet(TopMemoryContext, 3);
+	AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL,
+					  NULL);
+	AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+#ifndef WIN32
+	AddWaitEventToSet(wes, WL_SOCKET_READABLE, fsync_fds[FSYNC_FD_PROCESS],
+					  NULL, NULL);
+#else
+	absorb_overlapped.hEvent = CreateEvent(NULL, TRUE, TRUE,
+										   "fsync pipe read completion");
+	AddWaitEventToSet(wes, WL_WIN32_HANDLE, PGINVALID_SOCKET, NULL,
+					  &absorb_overlapped.hEvent);
+#endif
+
 	/*
 	 * Loop forever
 	 */
@@ -345,6 +369,7 @@ CheckpointerMain(void)
 		int			elapsed_secs;
 		int			cur_timeout;
 		int			rc;
+		WaitEvent	event;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -545,16 +570,14 @@ CheckpointerMain(void)
 			cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
 		}
 
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-					   cur_timeout * 1000L /* convert to ms */ ,
-					   WAIT_EVENT_CHECKPOINTER_MAIN);
+		rc = WaitEventSetWait(wes, cur_timeout * 1000, &event, 1, 0);
+		Assert(rc > 0);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
 		 */
-		if (rc & WL_POSTMASTER_DEATH)
+		if (event.events == WL_POSTMASTER_DEATH)
 			exit(1);
 	}
 }
@@ -890,16 +913,7 @@ ReqShutdownHandler(SIGNAL_ARGS)
 Size
 CheckpointerShmemSize(void)
 {
-	Size		size;
-
-	/*
-	 * Currently, the size of the requests[] array is arbitrarily set equal to
-	 * NBuffers.  This may prove too large or small ...
-	 */
-	size = offsetof(CheckpointerShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
-
-	return size;
+	return sizeof(CheckpointerShmemStruct);
 }
 
 /*
@@ -920,13 +934,13 @@ CheckpointerShmemInit(void)
 	if (!found)
 	{
 		/*
-		 * First time through, so initialize.  Note that we zero the whole
-		 * requests array; this is so that CompactCheckpointerRequestQueue can
-		 * assume that any pad bytes in the request structs are zeroes.
+		 * First time through, so initialize.
 		 */
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
-		CheckpointerShmem->max_requests = NBuffers;
+		pg_atomic_init_u64(&CheckpointerShmem->ckpt_cycle, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_writes, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_fsync, 0);
 	}
 }
 
@@ -1102,181 +1116,84 @@ RequestCheckpoint(int flags)
  * is theoretically possible a backend fsync might still be necessary, if
  * the queue is full and contains no duplicate entries.  In that case, we
  * let the backend know by returning false.
+ *
+ * We add the cycle counter to the message.  That is an unsynchronized read
+ * of the shared memory counter, but it doesn't matter if it is arbitrarily
+ * old since it is only used to limit unnecessary extra queue draining in
+ * AbsorbAllFsyncRequests().
  */
-bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+void
+ForwardFsyncRequest(const SmgrFileTag *tag, File file)
 {
-	CheckpointerRequest *request;
-	bool		too_full;
+	CheckpointerRequest request = {0};
 
 	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
+		elog(ERROR, "ForwardFsyncRequest must not be called in single user mode");
 
 	if (AmCheckpointerProcess())
 		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
+	request.type = CKPT_REQUEST_RNODE;
+	request.tag = *tag;
+#ifdef CHECKPOINTER_TRANSFER_FILES
+	request.contains_fd = file != -1;
+#else
+	request.contains_fd = false;
+#endif
 
 	/*
-	 * If the checkpointer isn't running or the request queue is full, the
-	 * backend will have to perform its own fsync request.  But before forcing
-	 * that to happen, we can try to compact the request queue.
+	 * Tell the checkpointer the sequence number of the most recent open, so
+	 * that it can be sure to hold the older file descriptor.
 	 */
-	if (CheckpointerShmem->checkpointer_pid == 0 ||
-		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
-		 !CompactCheckpointerRequestQueue()))
-	{
-		/*
-		 * Count the subset of writes where backends have to do their own
-		 * fsync
-		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
-		return false;
-	}
+	request.open_seq = request.contains_fd ? FileGetOpenSeq(file) : (uint64) -1;
 
-	/* OK, insert request */
-	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
-
-	/* If queue is more than half full, nudge the checkpointer to empty it */
-	too_full = (CheckpointerShmem->num_requests >=
-				CheckpointerShmem->max_requests / 2);
-
-	LWLockRelease(CheckpointerCommLock);
-
-	/* ... but not till after we release the lock */
-	if (too_full && ProcGlobal->checkpointerLatch)
-		SetLatch(ProcGlobal->checkpointerLatch);
+	/*
+	 * We read ckpt_started without synchronization.  It is used to prevent
+	 * AbsorbAllFsyncRequests() from reading new values from after a
+	 * checkpoint began.  A slightly out-of-date value here will only cause
+	 * it to do a little bit more work than strictly necessary, but that's
+	 * OK.
+	 */
+	request.ckpt_started = CheckpointerShmem->ckpt_started;
 
-	return true;
+	SendFsyncRequest(&request,
+					 request.contains_fd ? FileGetRawDesc(file) : -1);
 }
 
 /*
- * CompactCheckpointerRequestQueue
- *		Remove duplicates from the request queue to avoid backend fsyncs.
- *		Returns "true" if any entries were removed.
- *
- * Although a full fsync request queue is not common, it can lead to severe
- * performance problems when it does happen.  So far, this situation has
- * only been observed to occur when the system is under heavy write load,
- * and especially during the "sync" phase of a checkpoint.  Without this
- * logic, each backend begins doing an fsync for every block written, which
- * gets very expensive and can slow down the whole system.
+ * AbsorbFsyncRequests
+ *		Retrieve queued fsync requests and pass them to local smgr. Stop when
+ *		resources would be exhausted by absorbing more.
  *
- * Trying to do this every time the queue is full could lose if there
- * aren't any removable entries.  But that should be vanishingly rare in
- * practice: there's one queue entry per shared buffer.
+ * This is exported because we want to continue accepting requests during
+ * smgrsync().
  */
-static bool
-CompactCheckpointerRequestQueue(void)
+void
+AbsorbFsyncRequests(void)
 {
-	struct CheckpointerSlotMapping
-	{
-		CheckpointerRequest request;
-		int			slot;
-	};
-
-	int			n,
-				preserve_count;
-	int			num_skipped = 0;
-	HASHCTL		ctl;
-	HTAB	   *htab;
-	bool	   *skip_slot;
-
-	/* must hold CheckpointerCommLock in exclusive mode */
-	Assert(LWLockHeldByMe(CheckpointerCommLock));
-
-	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
-
-	/* Initialize temporary hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(CheckpointerRequest);
-	ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
-	ctl.hcxt = CurrentMemoryContext;
-
-	htab = hash_create("CompactCheckpointerRequestQueue",
-					   CheckpointerShmem->num_requests,
-					   &ctl,
-					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-	/*
-	 * The basic idea here is that a request can be skipped if it's followed
-	 * by a later, identical request.  It might seem more sensible to work
-	 * backwards from the end of the queue and check whether a request is
-	 * *preceded* by an earlier, identical request, in the hopes of doing less
-	 * copying.  But that might change the semantics, if there's an
-	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
-	 * we do it this way.  It would be possible to be even smarter if we made
-	 * the code below understand the specific semantics of such requests (it
-	 * could blow away preceding entries that would end up being canceled
-	 * anyhow), but it's not clear that the extra complexity would buy us
-	 * anything.
-	 */
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		CheckpointerRequest *request;
-		struct CheckpointerSlotMapping *slotmap;
-		bool		found;
-
-		/*
-		 * We use the request struct directly as a hashtable key.  This
-		 * assumes that any padding bytes in the structs are consistently the
-		 * same, which should be okay because we zeroed them in
-		 * CheckpointerShmemInit.  Note also that RelFileNode had better
-		 * contain no pad bytes.
-		 */
-		request = &CheckpointerShmem->requests[n];
-		slotmap = hash_search(htab, request, HASH_ENTER, &found);
-		if (found)
-		{
-			/* Duplicate, so mark the previous occurrence as skippable */
-			skip_slot[slotmap->slot] = true;
-			num_skipped++;
-		}
-		/* Remember slot containing latest occurrence of this request value */
-		slotmap->slot = n;
-	}
+	if (!AmCheckpointerProcess())
+		return;
 
-	/* Done with the hash table. */
-	hash_destroy(htab);
+	/* Transfer stats counts into pending pgstats message */
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
-	/* If no duplicates, we're out of luck. */
-	if (!num_skipped)
+	while (true)
 	{
-		pfree(skip_slot);
-		return false;
-	}
+		if (!FlushFsyncRequestQueueIfNecessary())
+			break;
 
-	/* We found some duplicates; remove them. */
-	preserve_count = 0;
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		if (skip_slot[n])
-			continue;
-		CheckpointerShmem->requests[preserve_count++] = CheckpointerShmem->requests[n];
+		if (!AbsorbFsyncRequest(false))
+			break;
 	}
-	ereport(DEBUG1,
-			(errmsg("compacted fsync request queue from %d entries to %d entries",
-					CheckpointerShmem->num_requests, preserve_count)));
-	CheckpointerShmem->num_requests = preserve_count;
-
-	/* Cleanup. */
-	pfree(skip_slot);
-	return true;
 }
 
 /*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ * AbsorbAllFsyncRequests
+ *		Retrieve all already pending fsync requests and pass them to local
+ *		smgr.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1284,54 +1201,121 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbAllFsyncRequests(void)
 {
-	CheckpointerRequest *requests = NULL;
-	CheckpointerRequest *request;
-	int			n;
-
 	if (!AmCheckpointerProcess())
 		return;
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
 	/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
-	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	for (;;)
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
+		if (!FlushFsyncRequestQueueIfNecessary())
+			elog(FATAL, "may not happen");
+
+		if (!AbsorbFsyncRequest(true))
+			break;
 	}
+}
+
+/*
+ * AbsorbFsyncRequest
+ *		Retrieve one queued fsync request and pass them to local smgr.
+ */
+static bool
+AbsorbFsyncRequest(bool stop_at_current_cycle)
+{
+	static CheckpointerRequest req;
+	int fd = -1;
+#ifndef WIN32
+	int ret;
+#else
+	DWORD bytes_read;
+#endif
+
+	ReleaseLruFiles();
 
 	START_CRIT_SECTION();
+#ifndef WIN32
+	ret = pg_uds_recv_with_fd(fsync_fds[FSYNC_FD_PROCESS],
+							  &req,
+							  sizeof(req),
+							  &fd);
+	if (ret < 0 && (errno == EWOULDBLOCK || errno == EAGAIN))
+	{
+		END_CRIT_SECTION();
+		return false;
+	}
+	else if (ret < 0)
+		elog(ERROR, "recvmsg failed: %m");
+#else
+	if (!absorb_read_in_progress)
+	{
+		if (!ReadFile(fsyncPipe[FSYNC_FD_PROCESS],
+					  &req,
+					  sizeof(req),
+					  &bytes_read,
+					  &absorb_overlapped))
+		{
+			if (GetLastError() != ERROR_IO_PENDING)
+			{
+				_dosmaperr(GetLastError());
+				elog(ERROR, "can't begin read from fsync pipe: %m");
+			}
 
-	CheckpointerShmem->num_requests = 0;
+			/*
+			 * An asynchronous read has begun.  We'll tell caller to call us
+			 * back when the event indicates completion.
+			 */
+			absorb_read_in_progress = &absorb_overlapped.hEvent;
+			END_CRIT_SECTION();
+			return false;
+		}
+		/* The read completed synchronously.  'req' is now populated. */
+	}
+	if (absorb_read_in_progress)
+	{
+		/* Completed yet? */
+		if (!GetOverlappedResult(fsyncPipe[FSYNC_FD_PROCESS],
+								 &absorb_overlapped,
+								 &bytes_read,
+								 false))
+		{
+			if (GetLastError() == ERROR_IO_INCOMPLETE)
+			{
+				/* Nope.  Spurious event?  Tell caller to wait some more. */
+				END_CRIT_SECTION();
+				return false;
+			}
+			_dosmaperr(GetLastError());
+			elog(ERROR, "can't complete from fsync pipe: %m");
+		}
+		/* The asynchronous read completed.  'req' is now populated. */
+		absorb_read_in_progress = NULL;
+	}
 
-	LWLockRelease(CheckpointerCommLock);
+	/* Check message size. */
+	if (bytes_read != sizeof(req))
+		elog(ERROR, "unexpected short read on fsync pipe");
+#endif
 
-	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+	if (req.contains_fd != (fd != -1))
+	{
+		elog(FATAL, "message should have fd associated, but doesn't");
+	}
 
+	RememberFsyncRequest(&req.tag, fd, req.open_seq);
 	END_CRIT_SECTION();
 
-	if (requests)
-		pfree(requests);
+	if (stop_at_current_cycle &&
+		req.ckpt_started == CheckpointerShmem->ckpt_started)
+		return false;
+
+	return true;
 }
 
 /*
@@ -1374,3 +1358,139 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+uint64
+GetCheckpointSyncCycle(void)
+{
+	return pg_atomic_read_u64(&CheckpointerShmem->ckpt_cycle);
+}
+
+uint64
+IncCheckpointSyncCycle(void)
+{
+	return pg_atomic_fetch_add_u64(&CheckpointerShmem->ckpt_cycle, 1);
+}
+
+void
+CountBackendWrite(void)
+{
+	pg_atomic_fetch_add_u32(&CheckpointerShmem->num_backend_writes, 1);
+}
+
+/*
+ * Send a message to the checkpointer's fsync socket (Unix) or pipe (Windows).
+ * This is essentially a blocking call (there is no CHECK_FOR_INTERRUPTS, and
+ * even if there were it'd be surpressed since callers hold a lock), except
+ * that we don't ignore postmaster death so we need an event loop.
+ *
+ * The code is rather different on Windows, because there we have to begin the
+ * write and then wait for it to complete, while on Unix we have to wait until
+ * we can do the write.
+ */
+static void
+SendFsyncRequest(CheckpointerRequest *request, int fd)
+{
+#ifndef WIN32
+	ssize_t ret;
+	int		rc;
+
+	while (true)
+	{
+		ret = pg_uds_send_with_fd(fsync_fds[FSYNC_FD_SUBMIT],
+								  request,
+								  sizeof(*request),
+								  request->contains_fd ? fd : -1);
+
+		if (ret >= 0)
+		{
+			/*
+			 * Don't think short writes will ever happen in realistic
+			 * implementations, but better make sure that's true...
+			 */
+			if (ret != sizeof(*request))
+				elog(FATAL, "unexpected short write to fsync request socket");
+			break;
+		}
+		else if (errno == EWOULDBLOCK || errno == EAGAIN
+#ifdef __darwin__
+				 || errno == EMSGSIZE || errno == ENOBUFS
+#endif
+				)
+		{
+			/*
+			 * Testing on macOS 10.13 showed occasional EMSGSIZE or
+			 * ENOBUFS errors, which could be handled by retrying.  Unless
+			 * the problem also shows up on other systems, let's handle those
+			 * only for that OS.
+			 */
+
+			/* Blocked on write - wait for socket to become readable */
+			rc = WaitLatchOrSocket(NULL,
+								   WL_SOCKET_WRITEABLE | WL_POSTMASTER_DEATH,
+								   fsync_fds[FSYNC_FD_SUBMIT], -1, 0);
+			if (rc & WL_POSTMASTER_DEATH)
+				exit(1);
+		}
+		else
+			ereport(FATAL, (errmsg("could not send fsync request: %m")));
+	}
+
+#else /* WIN32 */
+	{
+		OVERLAPPED overlapped = {0};
+		DWORD nwritten;
+		int rc;
+
+		overlapped.hEvent = CreateEvent(NULL, TRUE, TRUE, NULL);
+
+		if (!WriteFile(fsyncPipe[FSYNC_FD_SUBMIT],
+					   request,
+					   sizeof(*request),
+					   &nwritten,
+					   &overlapped))
+		{
+			WaitEventSet *wes;
+			WaitEvent event;
+
+			/* Handle unexpected errors. */
+			if (GetLastError() != ERROR_IO_PENDING)
+			{
+				_dosmaperr(GetLastError());
+				CloseHandle(overlapped.hEvent);
+				ereport(FATAL, (errmsg("could not send fsync request: %m")));
+			}
+
+			/* Wait for asynchronous IO to complete. */
+			wes = CreateWaitEventSet(TopMemoryContext, 3);
+			AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL,
+							  NULL);
+			AddWaitEventToSet(wes, WL_WIN32_HANDLE, PGINVALID_SOCKET, NULL,
+							  &overlapped.hEvent);
+			for (;;)
+			{
+				rc = WaitEventSetWait(wes, -1, &event, 1, 0);
+				Assert(rc > 0);
+				if (event.events == WL_POSTMASTER_DEATH)
+					exit(1);
+				if (event.events == WL_WIN32_HANDLE)
+				{
+					if (!GetOverlappedResult(fsyncPipe[FSYNC_FD_SUBMIT], &overlapped,
+											 &nwritten, FALSE))
+					{
+						_dosmaperr(GetLastError());
+						CloseHandle(overlapped.hEvent);
+						ereport(FATAL, (errmsg("could not get result of sending fsync request: %m")));
+					}
+					if (nwritten > 0)
+						break;
+				}
+			}
+			FreeWaitEventSet(wes);
+		}
+
+		CloseHandle(overlapped.hEvent);
+		if (nwritten != sizeof(*request))
+			elog(FATAL, "unexpected short write to fsync request pipe");
+	}
+#endif
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 41de140ae01..8ec71d13fa7 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -70,6 +70,7 @@
 #include <time.h>
 #include <sys/wait.h>
 #include <ctype.h>
+#include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
 #include <fcntl.h>
@@ -434,6 +435,7 @@ static pid_t StartChildProcess(AuxProcType type);
 static void StartAutovacuumWorker(void);
 static void MaybeStartWalReceiver(void);
 static void InitPostmasterDeathWatchHandle(void);
+static void InitFsyncFdSocketPair(void);
 
 /*
  * Archiver is allowed to start up at the current postmaster state?
@@ -523,9 +525,11 @@ typedef struct
 	HANDLE		PostmasterHandle;
 	HANDLE		initial_signal_pipe;
 	HANDLE		syslogPipe[2];
+	HANDLE		fsyncPipe[2];
 #else
 	int			postmaster_alive_fds[2];
 	int			syslogPipe[2];
+	int			fsync_fds[2];
 #endif
 	char		my_exec_path[MAXPGPATH];
 	char		pkglib_path[MAXPGPATH];
@@ -568,6 +572,12 @@ int			postmaster_alive_fds[2] = {-1, -1};
 HANDLE		PostmasterHandle;
 #endif
 
+#ifndef WIN32
+int			fsync_fds[2] = {-1, -1};
+#else
+HANDLE		fsyncPipe[2] = {0, 0};
+#endif
+
 /*
  * Postmaster main entry point
  */
@@ -1195,6 +1205,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitPostmasterDeathWatchHandle();
 
+	/*
+	 * Initialize socket pair used to transport file descriptors over.
+	 */
+	InitFsyncFdSocketPair();
+
 #ifdef WIN32
 
 	/*
@@ -5994,7 +6009,8 @@ extern pg_time_t first_syslogger_file_time;
 #define write_inheritable_socket(dest, src, childpid) ((*(dest) = (src)), true)
 #define read_inheritable_socket(dest, src) (*(dest) = *(src))
 #else
-static bool write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE child);
+static bool write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE child,
+									bool close_source);
 static bool write_inheritable_socket(InheritableSocket *dest, SOCKET src,
 						 pid_t childPid);
 static void read_inheritable_socket(SOCKET *dest, InheritableSocket *src);
@@ -6058,11 +6074,20 @@ save_backend_variables(BackendParameters *param, Port *port,
 	param->PostmasterHandle = PostmasterHandle;
 	if (!write_duplicated_handle(&param->initial_signal_pipe,
 								 pgwin32_create_signal_listener(childPid),
-								 childProcess))
+								 childProcess, true))
+		return false;
+	if (!write_duplicated_handle(&param->fsyncPipe[0],
+								 fsyncPipe[0],
+								 childProcess, false))
+		return false;
+	if (!write_duplicated_handle(&param->fsyncPipe[1],
+								 fsyncPipe[1],
+								 childProcess, false))
 		return false;
 #else
 	memcpy(&param->postmaster_alive_fds, &postmaster_alive_fds,
 		   sizeof(postmaster_alive_fds));
+	memcpy(&param->fsync_fds, &fsync_fds, sizeof(fsync_fds));
 #endif
 
 	memcpy(&param->syslogPipe, &syslogPipe, sizeof(syslogPipe));
@@ -6083,7 +6108,8 @@ save_backend_variables(BackendParameters *param, Port *port,
  * process instance of the handle to the parameter file.
  */
 static bool
-write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess)
+write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess,
+						bool close_source)
 {
 	HANDLE		hChild = INVALID_HANDLE_VALUE;
 
@@ -6093,7 +6119,8 @@ write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess)
 						 &hChild,
 						 0,
 						 TRUE,
-						 DUPLICATE_CLOSE_SOURCE | DUPLICATE_SAME_ACCESS))
+						 (close_source ? DUPLICATE_CLOSE_SOURCE : 0) |
+						 DUPLICATE_SAME_ACCESS))
 	{
 		ereport(LOG,
 				(errmsg_internal("could not duplicate handle to be written to backend parameter file: error code %lu",
@@ -6289,9 +6316,12 @@ restore_backend_variables(BackendParameters *param, Port *port)
 #ifdef WIN32
 	PostmasterHandle = param->PostmasterHandle;
 	pgwin32_initial_signal_pipe = param->initial_signal_pipe;
+	fsyncPipe[0] = param->fsyncPipe[0];
+	fsyncPipe[1] = param->fsyncPipe[1];
 #else
 	memcpy(&postmaster_alive_fds, &param->postmaster_alive_fds,
 		   sizeof(postmaster_alive_fds));
+	memcpy(&fsync_fds, &param->fsync_fds, sizeof(fsync_fds));
 #endif
 
 	memcpy(&syslogPipe, &param->syslogPipe, sizeof(syslogPipe));
@@ -6468,3 +6498,88 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/* Create socket used for requesting fsyncs by checkpointer */
+static void
+InitFsyncFdSocketPair(void)
+{
+	Assert(MyProcPid == PostmasterPid);
+
+#ifndef WIN32
+	if (socketpair(AF_UNIX, SOCK_STREAM, 0, fsync_fds) < 0)
+		ereport(FATAL,
+				(errcode_for_file_access(),
+				 errmsg_internal("could not create fsync sockets: %m")));
+	/*
+	 * Set O_NONBLOCK on both fds.
+	 */
+	if (fcntl(fsync_fds[FSYNC_FD_PROCESS], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync process socket to nonblocking mode: %m")));
+#ifndef EXEC_BACKEND
+	if (fcntl(fsync_fds[FSYNC_FD_PROCESS], F_SETFD, FD_CLOEXEC) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync process socket to close-on-exec mode: %m")));
+#endif
+
+	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync submit socket to nonblocking mode: %m")));
+#ifndef EXEC_BACKEND
+	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFD, FD_CLOEXEC) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync submit socket to close-on-exec mode: %m")));
+#endif
+#else
+	{
+		UCHAR		pipename[MAX_PATH];
+		SECURITY_ATTRIBUTES sa;
+
+		memset(&sa, 0, sizeof(sa));
+
+		/*
+		 * We'll create a named pipe, because anonymous pipes don't allow
+		 * overlapped (= async) IO or message-orient communication.  We'll
+		 * open both ends of it here, and then duplicate them into all child
+		 * processes in save_backend_variables().  First, open the server end.
+		 */
+		snprintf(pipename, sizeof(pipename), "\\\\.\\Pipe\\fsync_pipe.%08x",
+				 GetCurrentProcessId());
+		fsyncPipe[FSYNC_FD_PROCESS] = CreateNamedPipeA(pipename,
+													   PIPE_ACCESS_INBOUND | FILE_FLAG_OVERLAPPED,
+													   PIPE_TYPE_MESSAGE | PIPE_WAIT,
+													   1,
+													   4096,
+													   4096,
+													   -1,
+													   &sa);
+		if (!fsyncPipe[FSYNC_FD_PROCESS])
+		{
+			_dosmaperr(GetLastError());
+			ereport(FATAL,
+					(errcode_for_file_access(),
+					 errmsg_internal("could not create server end of fsync pipe: %m")));
+		}
+
+		/* Now open the client end. */
+		fsyncPipe[FSYNC_FD_SUBMIT] = CreateFileA(pipename,
+												 GENERIC_WRITE,
+												 0,
+												 &sa,
+												 OPEN_EXISTING,
+												 FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED,
+												 NULL);
+		if (!fsyncPipe[FSYNC_FD_SUBMIT])
+		{
+			_dosmaperr(GetLastError());
+			ereport(FATAL,
+					(errcode_for_file_access(),
+					 errmsg_internal("could not create client end of fsync pipe: %m")));
+		}
+	}
+#endif
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe57063..256cc5e0217 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -42,11 +42,13 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/standby.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 8dd51f17674..d5c8328b5d6 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -85,6 +85,7 @@
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "portability/mem.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -180,6 +181,7 @@ int			max_safe_fds = 32;	/* default if not changed */
 #define FD_DELETE_AT_CLOSE	(1 << 0)	/* T = delete when closed */
 #define FD_CLOSE_AT_EOXACT	(1 << 1)	/* T = close at eoXact */
 #define FD_TEMP_FILE_LIMIT	(1 << 2)	/* T = respect temp_file_limit */
+#define FD_NOT_IN_LRU		(1 << 3)	/* T = not in LRU */
 
 typedef struct vfd
 {
@@ -195,6 +197,7 @@ typedef struct vfd
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
 	mode_t		fileMode;		/* mode to pass to open(2) */
+	uint64		open_seq;		/* sequence number of opened file */
 } Vfd;
 
 /*
@@ -304,7 +307,6 @@ static void LruDelete(File file);
 static void Insert(File file);
 static int	LruInsert(File file);
 static bool ReleaseLruFile(void);
-static void ReleaseLruFiles(void);
 static File AllocateVfd(void);
 static void FreeVfd(File file);
 
@@ -333,6 +335,13 @@ static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
 static int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 static int	fsync_parent_path(const char *fname, int elevel);
 
+/* Shared memory state. */
+typedef struct
+{
+	pg_atomic_uint64 open_seq;
+} FdSharedData;
+
+static FdSharedData *fd_shared;
 
 /*
  * pg_fsync --- do fsync with or without writethrough
@@ -789,6 +798,20 @@ InitFileAccess(void)
 	on_proc_exit(AtProcExit_Files, 0);
 }
 
+/*
+ * Initialize shared memory state.  This is called after shared memory is
+ * ready.
+ */
+void
+FileShmemInit(void)
+{
+	bool	found;
+
+	fd_shared = ShmemInitStruct("fd_shared", sizeof(*fd_shared), &found);
+	if (!found)
+		pg_atomic_init_u64(&fd_shared->open_seq, 0);
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -1113,6 +1136,8 @@ LruInsert(File file)
 		{
 			++nfile;
 		}
+		vfdP->open_seq =
+			pg_atomic_fetch_add_u64(&fd_shared->open_seq, 1);
 
 		/*
 		 * Seek to the right position.  We need no special case for seekPos
@@ -1176,7 +1201,7 @@ ReleaseLruFile(void)
  * Release kernel FDs as needed to get under the max_safe_fds limit.
  * After calling this, it's OK to try to open another file.
  */
-static void
+void
 ReleaseLruFiles(void)
 {
 	while (nfile + numAllocatedDescs >= max_safe_fds)
@@ -1289,9 +1314,11 @@ FileAccess(File file)
 		 * We now know that the file is open and that it is not the last one
 		 * accessed, so we need to move it to the head of the Lru ring.
 		 */
-
-		Delete(file);
-		Insert(file);
+		if (!(VfdCache[file].fdstate & FD_NOT_IN_LRU))
+		{
+			Delete(file);
+			Insert(file);
+		}
 	}
 
 	return 0;
@@ -1410,6 +1437,58 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	vfdP->fileSize = 0;
 	vfdP->fdstate = 0x0;
 	vfdP->resowner = NULL;
+	vfdP->open_seq = pg_atomic_fetch_add_u64(&fd_shared->open_seq, 1);
+
+	return file;
+}
+
+/*
+ * Open a File for a pre-existing file descriptor.
+ *
+ * Note that these files will not be closed in an LRU basis, therefore the
+ * caller is responsible for limiting the number of open file descriptors.
+ *
+ * The passed in name is purely for informational purposes.
+ */
+File
+FileOpenForFd(int fd, const char *fileName, uint64 open_seq)
+{
+	char	   *fnamecopy;
+	File		file;
+	Vfd		   *vfdP;
+
+	/*
+	 * We need a malloc'd copy of the file name; fail cleanly if no room.
+	 */
+	fnamecopy = strdup(fileName);
+	if (fnamecopy == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory")));
+
+	file = AllocateVfd();
+	vfdP = &VfdCache[file];
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	vfdP->fd = fd;
+	++nfile;
+
+	DO_DB(elog(LOG, "FileOpenForFd: success %d/%d (%s)",
+			   file, fd, fnamecopy));
+
+	/* NB: Explicitly not inserted into LRU! */
+
+	vfdP->fileName = fnamecopy;
+	/* Saved flags are adjusted to be OK for re-opening file */
+	vfdP->fileFlags = 0;
+	vfdP->fileMode = 0;
+	vfdP->seekPos = 0;
+	vfdP->fileSize = 0;
+	vfdP->fdstate = FD_NOT_IN_LRU;
+	vfdP->resowner = NULL;
+	vfdP->open_seq = open_seq;
 
 	return file;
 }
@@ -1760,7 +1839,11 @@ FileClose(File file)
 		vfdP->fd = VFD_CLOSED;
 
 		/* remove the file from the lru ring */
-		Delete(file);
+		if (!(vfdP->fdstate & FD_NOT_IN_LRU))
+		{
+			vfdP->fdstate &= ~FD_NOT_IN_LRU;
+			Delete(file);
+		}
 	}
 
 	if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
@@ -2232,6 +2315,10 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
+	if (FileAccess(file))
+		return -1;
+
 	return VfdCache[file].fd;
 }
 
@@ -2255,6 +2342,17 @@ FileGetRawMode(File file)
 	return VfdCache[file].fileMode;
 }
 
+/*
+ * Get the opening sequence number of this file.  This number is captured
+ * after the file was opened but before anything was written to the file,
+ */
+uint64
+FileGetOpenSeq(File file)
+{
+	Assert(FileIsValid(file));
+	return VfdCache[file].open_seq;
+}
+
 /*
  * Make room for another allocatedDescs[] array entry if needed and possible.
  * Returns true if an array element is available.
@@ -3572,3 +3670,110 @@ MakePGDirectory(const char *directoryName)
 {
 	return mkdir(directoryName, pg_dir_create_mode);
 }
+
+#ifndef WIN32
+
+/*
+ * Send data over a unix domain socket, optionally (when fd != -1) including a
+ * file descriptor.
+ */
+ssize_t
+pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd)
+{
+	ssize_t     size;
+	struct msghdr   msg = {0};
+	struct iovec    iov = {0};
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	memset(&cmsgu, 0, sizeof(cmsgu));
+	iov.iov_base = buf;
+	iov.iov_len = buflen;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
+	if (fd >= 0)
+	{
+		msg.msg_control = cmsgu.control;
+		msg.msg_controllen = sizeof(cmsgu.control);
+
+		cmsg = CMSG_FIRSTHDR(&msg);
+		cmsg->cmsg_len = CMSG_LEN(sizeof (int));
+		cmsg->cmsg_level = SOL_SOCKET;
+		cmsg->cmsg_type = SCM_RIGHTS;
+
+		*((int *) CMSG_DATA(cmsg)) = fd;
+	}
+
+	size = sendmsg(sock, &msg, 0);
+
+	/* errors are returned directly */
+	return size;
+}
+
+/*
+ * Receive data from a unix domain socket. If a file is sent over the socket,
+ * store it in *fd.
+ */
+ssize_t
+pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd)
+{
+	ssize_t     size;
+	struct msghdr   msg;
+	struct iovec    iov;
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	Assert(fd != NULL);
+
+	iov.iov_base = buf;
+	iov.iov_len = bufsize;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsgu.control;
+	msg.msg_controllen = sizeof(cmsgu.control);
+
+	size = recvmsg (sock, &msg, 0);
+
+	if (size < 0)
+	{
+		*fd = -1;
+		return size;
+	}
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	if (cmsg && cmsg->cmsg_len == CMSG_LEN(sizeof(int)))
+	{
+		if (cmsg->cmsg_level != SOL_SOCKET)
+			elog(FATAL, "unexpected cmsg_level");
+
+		if (cmsg->cmsg_type != SCM_RIGHTS)
+			elog(FATAL, "unexpected cmsg_type");
+
+		*fd = *((int *) CMSG_DATA(cmsg));
+
+		/* FIXME: check / handle additional cmsg structures */
+	}
+	else
+		*fd = -1;
+
+	return size;
+}
+
+#endif
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 7c4ad1c4494..2b47824aab9 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -556,7 +556,7 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * not on extension.)
 	 */
 	if (rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks == 0)
 	{
 		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
 			rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
@@ -564,6 +564,9 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 		else
 			rel->rd_smgr->smgr_fsm_nblocks = 0;
 	}
+	else if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
+													 FSM_FORKNUM);
 
 	/* Handle requests beyond EOF */
 	if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c03..efbd25b84da 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -27,6 +27,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
@@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FileShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f6dda9cc9ac..081d399eefc 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -878,6 +878,12 @@ WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
 	{
 		*handle = PostmasterHandle;
 	}
+#ifdef WIN32
+	else if (event->events == WL_WIN32_HANDLE)
+	{
+		*handle = *(HANDLE *)event->user_data;
+	}
+#endif
 	else
 	{
 		int			flags = FD_CLOSE;	/* always check for errors/EOF */
@@ -1453,6 +1459,12 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			returned_events++;
 		}
 	}
+	else if (cur_event->events & WL_WIN32_HANDLE)
+	{
+		occurred_events->events |= WL_WIN32_HANDLE;
+		occurred_events++;
+		returned_events++;
+	}
 
 	return returned_events;
 }
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0df16..c9c4be325ed 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrsync.o smgrtype.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d077be..d6bff3b6e03 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -30,37 +30,24 @@
 #include "access/xlog.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
 
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
 
 /*
  * On Windows, we have to interpret EACCES as possibly meaning the same as
  * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
  * that's what you get.  Ugh.  This code is designed so that we don't
  * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
+ * a pending fsync request getting canceled ... see smgrsync).
  */
 #ifndef WIN32
 #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
@@ -110,6 +97,7 @@ typedef struct _MdfdVec
 {
 	File		mdfd_vfd;		/* fd number in fd.c's pool */
 	BlockNumber mdfd_segno;		/* segment number, from 0 */
+	uint64		mdfd_dirtied_cycle;
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
@@ -134,30 +122,9 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * (Regular backends do not track pending operations locally, but forward
  * them to the checkpointer.)
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
 
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
 
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
 
 
 /*** behavior for mdopen & _mdfd_getseg ***/
@@ -184,8 +151,7 @@ static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
 			 bool isRedo);
 static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
 static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
+								   MdfdVec *seg);
 static void _fdvec_resize(SMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg);
@@ -208,64 +174,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -334,6 +242,7 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 	mdfd = &reln->md_seg_fds[forkNum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 }
 
 /*
@@ -388,7 +297,7 @@ mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
 	/*
 	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
+	 * relation, else the next smgrsync() will fail.  There can't be any such
 	 * requests for a temp relation, though.  We can send just one request
 	 * even when deleting multiple forks, since the fsync queuing code accepts
 	 * the "InvalidForkNumber = all forks" convention.
@@ -448,7 +357,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		UnlinkAfterCheckpoint(rnode);
 	}
 
 	/*
@@ -555,7 +464,16 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 
 	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+	{
+		SmgrFileTag tag;
+
+		tag.node = reln->smgr_rnode.node;
+		tag.forknum = forknum;
+		tag.segno = v->mdfd_segno;
+		v->mdfd_dirtied_cycle = FsyncAtCheckpoint(&tag,
+												  v->mdfd_vfd,
+												  v->mdfd_dirtied_cycle);
+	}
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
@@ -615,6 +533,7 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior)
 	mdfd = &reln->md_seg_fds[forknum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
 
@@ -858,7 +777,16 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 
 	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+	{
+		SmgrFileTag tag;
+
+		tag.node = reln->smgr_rnode.node;
+		tag.forknum = forknum;
+		tag.segno = v->mdfd_segno;
+		v->mdfd_dirtied_cycle = FsyncAtCheckpoint(&tag,
+												  v->mdfd_vfd,
+												  v->mdfd_dirtied_cycle);
+	}
 }
 
 /*
@@ -1048,660 +976,38 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	mdsync() -- Sync previous writes to stable storage.
- */
-void
-mdsync(void)
-{
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
-	}
-
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * This is okay because we unlink each bitmapset from the hashtable
-		 * entry before scanning it.  That means that any incoming fsync
-		 * requests will be processed now if they reach the table before we
-		 * begin to scan their fork.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			while ((segno = bms_first_member(requests)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
-
-/*
- * mdpostckpt() -- Do post-checkpoint work
- *
- * Remove any lingering files that can now be safely removed.
+ * Return the filename for the specified segment of the relation. The
+ * returned string is palloc'd.
  */
 void
-mdpostckpt(void)
+mdpath(const SmgrFileTag *tag, char *out)
 {
-	int			absorb_counter;
-
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
+	char	   *path;
 
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
+	path = relpathperm(tag->node, tag->forknum);
 
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
+	if (tag->segno > 0)
+		snprintf(out, MAXPGPATH, "%s.%u", path, tag->segno);
+	else
+		snprintf(out, MAXPGPATH, "%s", path);
 
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
+	pfree(path);
 }
 
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
- *
- * If there is a local pending-ops table, just make an entry in it for
- * mdsync to process later.  Otherwise, try to pass off the fsync request
- * to the checkpointer process.  If that fails, just do the fsync
- * locally before returning (we hope this will not happen often enough
- * to be a performance problem).
  */
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
-	/* Temp relations should never be fsync'd */
-	Assert(!SmgrIsTemp(reln));
-
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
-		ereport(DEBUG1,
-				(errmsg("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
-	}
-}
-
-/*
- * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
- */
-static void
-register_unlink(RelFileNodeBackend rnode)
-{
-	/* Should never be used with temp relations */
-	Assert(!RelFileNodeBackendIsTemp(rnode));
-
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
-/*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
- */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
-{
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
-}
-
-/*
- * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
- */
-void
-ForgetDatabaseFsyncRequests(Oid dbid)
-{
-	RelFileNode rnode;
-
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
-
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	SmgrFileTag tag;
+
+	tag.node = reln->smgr_rnode.node;
+	tag.forknum = forknum;
+	tag.segno = seg->mdfd_segno;
+	seg->mdfd_dirtied_cycle = FsyncAtCheckpoint(&tag,
+												seg->mdfd_vfd,
+												seg->mdfd_dirtied_cycle);
 }
 
 /*
@@ -1831,6 +1137,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	v = &reln->md_seg_fds[forknum][segno];
 	v->mdfd_vfd = fd;
 	v->mdfd_segno = segno;
+	v->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 189342ef86a..c36ba4298b7 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
@@ -59,9 +60,7 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
+	void		(*smgr_path) (const SmgrFileTag *tag, char *out);
 } f_smgr;
 
 
@@ -82,9 +81,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
+		.smgr_path = mdpath
 	}
 };
 
@@ -104,6 +101,15 @@ static void smgrshutdown(int code, Datum arg);
 static void add_to_unowned_list(SMgrRelation reln);
 static void remove_from_unowned_list(SMgrRelation reln);
 
+/*
+ * For now there is only one implementation.  If more are added, we'll need to
+ * be able to dispatch based on a file tag.
+ */
+static inline int
+which_for_file_tag(const SmgrFileTag *tag)
+{
+	return 0;
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -118,6 +124,8 @@ smgrinit(void)
 {
 	int			i;
 
+	smgrsync_init();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
@@ -751,50 +759,13 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
 /*
- *	smgrsync() -- Sync files to disk during checkpoint.
+ * smgrpath() -- Expand a tag to a path.
  */
 void
-smgrsync(void)
+smgrpath(const SmgrFileTag *tag, char *out)
 {
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
+	smgrsw[which_for_file_tag(tag)].smgr_path(tag, out);
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgrsync.c b/src/backend/storage/smgr/smgrsync.c
new file mode 100644
index 00000000000..f4aad18054d
--- /dev/null
+++ b/src/backend/storage/smgr/smgrsync.c
@@ -0,0 +1,803 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.c
+ *	  management of file synchronization.
+ *
+ * This modules tracks which files need to be fsynced or unlinked at the
+ * next checkpoint, and performs those actions.  Normally the work is done
+ * when called by the checkpointer, but it is also done in standalone mode
+ * and startup.
+ *
+ * Originally this logic was inside md.c, but it is now made more general,
+ * for reuse by other SMGR implementations that work with files.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/smgrsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "pgstat.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "storage/relfilenode.h"
+#include "storage/smgrsync.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Special values for the segno member of SmgrFileTag.
+ *
+ * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
+ * fsync request from the queue if an identical, subsequent request is found.
+ * See comments there before making changes here.
+ */
+#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
+#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
+#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
+
+/* intervals for calling AbsorbFsyncRequests in smgrsync and smgrpostckpt */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * An entry in the hash table of files that need to be flushed for the next
+ * checkpoint.
+ */
+typedef struct PendingFsyncEntry
+{
+	SmgrFileTag	tag;
+	File		file;
+	uint64		cycle_ctr;
+} PendingFsyncEntry;
+
+typedef struct PendingUnlinkEntry
+{
+	RelFileNode rnode;			/* the dead relation to delete */
+	uint64		cycle_ctr;		/* ckpt_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static uint32 open_fsync_queue_files = 0;
+static bool sync_in_progress = false;
+static uint64 ckpt_cycle_ctr = 0;
+
+static HTAB *pendingFsyncTable = NULL;
+static List *pendingUnlinks = NIL;
+static MemoryContext pendingOpsCxt; /* context for the above  */
+
+static void syncpass(bool include_current);
+
+/*
+ * Initialize the pending operations state, if necessary.
+ */
+void
+smgrsync_init(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(SmgrFileTag);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingFsyncTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+}
+
+/*
+ * Do pre-checkpoint work.
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+smgrpreckpt(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	ckpt_cycle_ctr++;
+}
+
+/*
+ * Sync previous writes to stable storage.
+ */
+void
+smgrsync(void)
+{
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingFsyncTable.
+	 */
+	if (!pendingFsyncTable)
+		elog(ERROR, "cannot sync without a pendingFsyncTable");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbAllFsyncRequests();
+
+	syncpass(false);
+}
+
+/*
+ * Do one pass over the the fsync request hashtable and perform the necessary
+ * fsyncs. Increments the sync cycle counter.
+ *
+ * If include_current is true perform all fsyncs (this is done if too many
+ * files are open), otherwise only perform the fsyncs belonging to the cycle
+ * valid at call time.
+ */
+static void
+syncpass(bool include_current)
+{
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	int			processed = CheckpointStats.ckpt_sync_rels;
+	uint64		longest = CheckpointStats.ckpt_longest_sync;
+	uint64		total_elapsed = CheckpointStats.ckpt_agg_sync_time;
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use GetCheckpointSyncCycle() to tell old entries apart
+	 * from new ones: new ones will have cycle_ctr equal to
+	 * IncCheckpointSyncCycle().
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous smgrsync() failed to complete, run through the table and
+	 * forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+			entry->cycle_ctr = GetCheckpointSyncCycle();
+	}
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	IncCheckpointSyncCycle();
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingFsyncTable);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)))
+	{
+		/*
+		 * If processing fsync requests because of too may file handles, close
+		 * regardless of cycle. Otherwise nothing to be closed might be found,
+		 * and we want to make room as quickly as possible so more requests
+		 * can be absorbed.
+		 */
+		if (!include_current)
+		{
+			/* If the entry is new then don't process it this time. */
+			if (entry->cycle_ctr == GetCheckpointSyncCycle())
+				continue;
+
+			/* Else assert we haven't missed it */
+			Assert((entry->cycle_ctr + 1) == GetCheckpointSyncCycle());
+		}
+
+		/*
+		 * If fsync is off then we don't have to bother opening the file at
+		 * all.  (We delay checking until this point so that changing fsync on
+		 * the fly behaves sensibly.)
+		 *
+		 * XXX: Why is that an important goal? Doesn't give any interesting
+		 * guarantees afaict?
+		 */
+		if (enableFsync)
+		{
+			File		file;
+
+			/*
+			 * The fsync table could contain requests to fsync segments that
+			 * have been deleted (unlinked) by the time we get to them.  That
+			 * used to be problematic, but now we have a filehandle to the
+			 * deleted file. That means we might fsync an empty file
+			 * superfluously, in a relatively tight window, which is
+			 * acceptable.
+			 */
+			INSTR_TIME_SET_CURRENT(sync_start);
+
+			if (entry->file == -1)
+			{
+				/*
+				 * If we aren't transferring file descriptors directly to the
+				 * checkpointer on this platform, we'll have to convert the
+				 * tag to the path and open it (and close it again below).
+				 */
+				char		path[MAXPGPATH];
+
+				smgrpath(&entry->tag, path);
+				file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+				if (file < 0)
+					ereport(FATAL,
+							(errcode_for_file_access(),
+							 errmsg("could not open file \"%s\" to fsync: %m",
+									path)));
+			}
+			else
+			{
+				/*
+				 * Otherwise, we have kept the file descriptor from the oldest
+				 * request for the same tag.
+				 */
+				file = entry->file;
+			}
+
+			if (FileSync(file, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+				ereport(FATAL,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								FilePathName(file))));
+
+			/* Success; update statistics about sync timing */
+			INSTR_TIME_SET_CURRENT(sync_end);
+			sync_diff = sync_end;
+			INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+			elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+			if (elapsed > longest)
+				longest = elapsed;
+			total_elapsed += elapsed;
+			processed++;
+
+			if (log_checkpoints)
+				ereport(DEBUG1,
+						(errmsg("checkpoint sync: number=%d file=%s time=%.3f msec",
+								processed,
+								FilePathName(file),
+								(double) elapsed / 1000),
+						 errhidestmt(true),
+						 errhidecontext(true)));
+
+			if (entry->file == -1)
+				FileClose(file);
+		}
+
+		if (entry->file >= 0)
+		{
+			/*
+			 * Close file.  XXX: centralize code.
+			 */
+			Assert(open_fsync_queue_files > 0);
+			open_fsync_queue_files--;
+			FileClose(entry->file);
+			entry->file = -1;
+		}
+
+		/* Remove the entry. */
+		if (hash_search(pendingFsyncTable, &entry->tag, HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "pendingFsyncTable corrupted");
+
+		/*
+		 * If in checkpointer, we want to absorb pending requests every so
+		 * often to prevent overflow of the fsync request queue.  It is
+		 * unspecified whether newly-added entries will be visited by
+		 * hash_seq_search, but we don't care since we don't need to
+		 * process them anyway.
+		 */
+		if (absorb_counter-- <= 0)
+		{
+			/*
+			 * Don't absorb if too many files are open. This pass will
+			 * soon close some, so check again later.
+			 */
+			if (open_fsync_queue_files < ((max_safe_fds * 7) / 10))
+				AbsorbFsyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+	}							/* end loop over hashtable entries */
+
+	/* Flag successful completion of syncpass */
+	sync_in_progress = false;
+
+	/* Maintain sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+}
+
+/*
+ * Do post-checkpoint work.
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+smgrpostckpt(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == ckpt_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = relpathperm(entry->rnode, MAIN_FORKNUM);
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in smgrsync, we don't want to stop absorbing fsync requests for a
+		 * long time when there are many deletions to be done.  We can safely
+		 * call AbsorbFsyncRequests() at this point in the loop (note it might
+		 * try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			/* XXX: Centralize this condition */
+			if (open_fsync_queue_files < ((max_safe_fds * 7) / 10))
+				AbsorbFsyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+
+/*
+ * FsyncAtCheckpoint() -- Mark a relation segment as needing fsync
+ *
+ * If there is a local pending-ops table, just make an entry in it for
+ * smgrsync to process later.  Otherwise, try to pass off the fsync request
+ * to the checkpointer process.
+ */
+uint64
+FsyncAtCheckpoint(const SmgrFileTag *tag, File file, uint64 last_cycle)
+{
+	uint64		cycle;
+
+	pg_memory_barrier();
+	cycle = GetCheckpointSyncCycle();
+
+	/*
+	 * For historical reasons checkpointer keeps track of the number of time
+	 * backends perform writes themselves.
+	 */
+	if (!AmBackgroundWriterProcess())
+		CountBackendWrite();
+
+	/* Don't repeatedly register the same segment as dirty. */
+	if (last_cycle == cycle)
+		return cycle;
+
+	if (pendingFsyncTable)
+	{
+		int fd;
+
+		/*
+		 * Push it into local pending-ops table.
+		 *
+		 * Gotta duplicate the fd - we can't have fd.c close it behind our
+		 * back, as that'd lead to losing error reporting guarantees on
+		 * Linux.  RememberFsyncRequest() will manage the lifetime.
+		 */
+		ReleaseLruFiles();
+		fd = dup(FileGetRawDesc(file));
+		if (fd < 0)
+			elog(ERROR, "couldn't dup: %m");
+		RememberFsyncRequest(tag, fd, FileGetOpenSeq(file));
+	}
+	else
+		ForwardFsyncRequest(tag, file);
+
+	return cycle;
+}
+
+/*
+ * Schedule a file to be deleted after next checkpoint.
+ *
+ * As with FsyncAtCheckpoint, this could involve either a local or a remote
+ * pending-ops table.
+ */
+void
+UnlinkAfterCheckpoint(RelFileNodeBackend rnode)
+{
+	SmgrFileTag tag;
+
+	tag.node = rnode.node;
+	tag.forknum = MAIN_FORKNUM;
+	tag.segno = UNLINK_RELATION_REQUEST;
+
+	/* Should never be used with temp relations */
+	Assert(!RelFileNodeBackendIsTemp(rnode));
+
+	if (pendingFsyncTable)
+	{
+		/* push it into local pending-ops table */
+		RememberFsyncRequest(&tag, -1, 0);
+	}
+	else
+	{
+		/* Notify the checkpointer about it. */
+		Assert(IsUnderPostmaster);
+		ForwardFsyncRequest(&tag, -1);
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingFsyncTable during initialization of the startup
+ * process.  Calling this function drops the local pendingFsyncTable so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingFsyncTable)
+	{
+		smgrsync();
+		hash_destroy(pendingFsyncTable);
+	}
+	pendingFsyncTable = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
+
+
+/*
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * The range of possible segment numbers is way less than the range of
+ * BlockNumber, so we can reserve high values of segno for special purposes.
+ * We define three:
+ * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
+ *	 either for one fork, or all forks if forknum is InvalidForkNumber
+ * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
+ * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
+ *	 checkpoint.
+ * Note also that we're assuming real segment numbers don't exceed INT_MAX.
+ *
+ * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
+ * table has to be searched linearly, but dropping a database is a pretty
+ * heavyweight operation anyhow, so we'll live with it.)
+ */
+void
+RememberFsyncRequest(const SmgrFileTag *tag, int fd, uint64 open_seq)
+{
+	Assert(pendingFsyncTable);
+
+	if (tag->segno == FORGET_RELATION_FSYNC ||
+		tag->segno == FORGET_DATABASE_FSYNC)
+	{
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if ((tag->segno == FORGET_RELATION_FSYNC &&
+				 tag->node.dbNode == entry->tag.node.dbNode &&
+				 tag->node.relNode == entry->tag.node.relNode &&
+				 (tag->forknum == InvalidForkNumber ||
+				  tag->forknum == entry->tag.forknum)) ||
+				(tag->segno == FORGET_DATABASE_FSYNC &&
+				 tag->node.dbNode == entry->tag.node.dbNode))
+			{
+				if (entry->file != -1)
+				{
+					Assert(open_fsync_queue_files > 0);
+					open_fsync_queue_files--;
+					FileClose(entry->file);
+				}
+				hash_search(pendingFsyncTable, entry, HASH_REMOVE, NULL);
+			}
+		}
+
+		/* Remove unlink requests */
+		if (tag->segno == FORGET_DATABASE_FSYNC)
+		{
+			ListCell   *cell,
+					   *next,
+					   *prev;
+
+			prev = NULL;
+			for (cell = list_head(pendingUnlinks); cell; cell = next)
+			{
+				PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+				next = lnext(cell);
+				if (tag->node.dbNode == entry->rnode.dbNode)
+				{
+					pendingUnlinks = list_delete_cell(pendingUnlinks, cell,
+													  prev);
+					pfree(entry);
+				}
+				else
+					prev = cell;
+			}
+		}
+	}
+	else if (tag->segno == UNLINK_RELATION_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
+		Assert(tag->forknum == MAIN_FORKNUM);
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->rnode = tag->node;
+		entry->cycle_ctr = ckpt_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		entry = (PendingFsyncEntry *) hash_search(pendingFsyncTable,
+												  tag,
+												  HASH_ENTER,
+												  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			entry->file = -1;
+			entry->cycle_ctr = GetCheckpointSyncCycle();
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		if (fd >= 0)
+		{
+			File existing_file;
+			File new_file;
+
+			/*
+			 * If we didn't have a file already, or we did have a file but it
+			 * was opened later than this one, we'll keep the newly arrived
+			 * one.
+			 */
+			existing_file = entry->file;
+			if (existing_file == -1 ||
+				FileGetOpenSeq(existing_file) > open_seq)
+			{
+				char path[MAXPGPATH];
+
+				smgrpath(tag, path);
+
+				new_file = FileOpenForFd(fd, path, open_seq);
+				if (new_file < 0)
+					elog(ERROR, "cannot open file");
+				/* caller must have reserved entry */
+				entry->file = new_file;
+
+				if (existing_file != -1)
+					FileClose(existing_file);
+				else
+					open_fsync_queue_files++;
+			}
+			else
+			{
+				/*
+				 * File is already open. Have to keep the older fd, errors
+				 * might only be reported to it, thus close the one we just
+				 * got.
+				 *
+				 * XXX: check for errors.
+				 */
+				close(fd);
+			}
+
+			FlushFsyncRequestQueueIfNecessary();
+		}
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+}
+
+/*
+ * Flush the fsync request queue enough to make sure there's room for at least
+ * one more entry.
+ */
+bool
+FlushFsyncRequestQueueIfNecessary(void)
+{
+	if (sync_in_progress)
+		return false;
+
+	while (true)
+	{
+		if (open_fsync_queue_files >= ((max_safe_fds * 7) / 10))
+		{
+			elog(DEBUG1,
+				 "flush fsync request queue due to %u open files",
+				 open_fsync_queue_files);
+			syncpass(true);
+			elog(DEBUG1,
+				 "flushed fsync request, now at %u open files",
+				 open_fsync_queue_files);
+		}
+		else
+			break;
+	}
+
+	return true;
+}
+
+/*
+ * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+{
+	SmgrFileTag tag;
+
+	/* Create a special "forget relation" tag. */
+	tag.node = rnode;
+	tag.forknum = forknum;
+	tag.segno = FORGET_RELATION_FSYNC;
+
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(&tag, -1, 0);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		ForwardFsyncRequest(&tag, -1);
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
+ */
+void
+ForgetDatabaseFsyncRequests(Oid dbid)
+{
+	SmgrFileTag tag;
+
+	/* Create a special "forget database" tag. */
+	tag.node.dbNode = dbid;
+	tag.node.spcNode = 0;
+	tag.node.relNode = 0;
+	tag.forknum = InvalidForkNumber;
+	tag.segno = FORGET_DATABASE_FSYNC;
+
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(&tag, -1, 0);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* see notes in ForgetRelationFsyncRequests */
+		ForwardFsyncRequest(&tag, -1);
+	}
+}
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index ede1621d3ea..019b48e1507 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -59,7 +59,7 @@
 #include "commands/view.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rewriteRemove.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2317e8be6be..9fdd39cb97f 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -60,6 +60,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d1..137c748dfaf 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -1,10 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * bgwriter.h
- *	  Exports from postmaster/bgwriter.c and postmaster/checkpointer.c.
- *
- * The bgwriter process used to handle checkpointing duties too.  Now
- * there is a separate process, but we did not bother to split this header.
+ *	  Exports from postmaster/bgwriter.c.
  *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  *
@@ -15,29 +12,10 @@
 #ifndef _BGWRITER_H
 #define _BGWRITER_H
 
-#include "storage/block.h"
-#include "storage/relfilenode.h"
-
-
 /* GUC options */
 extern int	BgWriterDelay;
-extern int	CheckPointTimeout;
-extern int	CheckPointWarning;
-extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void) pg_attribute_noreturn();
-extern void CheckpointerMain(void) pg_attribute_noreturn();
-
-extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
-
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
-
-extern Size CheckpointerShmemSize(void);
-extern void CheckpointerShmemInit(void);
 
-extern bool FirstCallSinceLastCheckpoint(void);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/postmaster/checkpointer.h b/src/include/postmaster/checkpointer.h
new file mode 100644
index 00000000000..252a94f2909
--- /dev/null
+++ b/src/include/postmaster/checkpointer.h
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * checkpointer.h
+ *	  Exports from postmaster/checkpointer.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/checkpointer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHECKPOINTER_H
+#define CHECKPOINTER_H
+
+#include "storage/smgr.h"
+#include "storage/smgrsync.h"
+
+/*
+ * Control whether we transfer file descriptors to the checkpointer, to
+ * preserve error state on certain kernels.  We don't yet have support for
+ * sending files on Windows (it's entirely possible but it's not clear whether
+ * it would actually be useful for anything on that platform).  The macro is
+ * here just so that it can be commented out to test the non-fd-passing code
+ * path on Unix systems.
+ */
+#ifndef WIN32
+#define CHECKPOINTER_TRANSFER_FILES
+#endif
+
+/* GUC options */
+extern int	CheckPointTimeout;
+extern int	CheckPointWarning;
+extern double CheckPointCompletionTarget;
+
+/* The type used for counting checkpoint cycles. */
+typedef uint32 CheckpointCycle;
+
+/*
+ * A tag identifying a file to be flushed by the checkpointer.  This is
+ * convertible to the file's path, but it's convenient to have a small fixed
+ * sized object to use as a hash table key.
+ */
+typedef struct DirtyFileTag
+{
+	RelFileNode node;
+	ForkNumber forknum;
+	int segno;
+} DirtyFileTag;
+
+extern void CheckpointerMain(void) pg_attribute_noreturn();
+extern CheckpointCycle register_dirty_file(const DirtyFileTag *tag,
+										   File file,
+										   CheckpointCycle last_cycle);
+
+extern void ForwardFsyncRequest(const SmgrFileTag *tag, File fd);
+extern void RequestCheckpoint(int flags);
+extern void CheckpointWriteDelay(int flags, double progress);
+
+extern void AbsorbFsyncRequests(void);
+extern void AbsorbAllFsyncRequests(void);
+
+extern Size CheckpointerShmemSize(void);
+extern void CheckpointerShmemInit(void);
+
+extern uint64 GetCheckpointSyncCycle(void);
+extern uint64 IncCheckpointSyncCycle(void);
+
+extern bool FirstCallSinceLastCheckpoint(void);
+extern void CountBackendWrite(void);
+
+#endif
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index 1877eef2391..821fd2d1ad2 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -44,6 +44,15 @@ extern int	postmaster_alive_fds[2];
 #define POSTMASTER_FD_OWN		1	/* kept open by postmaster only */
 #endif
 
+#define FSYNC_FD_SUBMIT			0
+#define FSYNC_FD_PROCESS		1
+
+#ifndef WIN32
+extern int	fsync_fds[2];
+#else
+extern HANDLE fsyncPipe[2];
+#endif
+
 extern PGDLLIMPORT const char *progname;
 
 extern void PostmasterMain(int argc, char *argv[]) pg_attribute_noreturn();
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8e7c9728f4b..d952acf714e 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -65,6 +65,7 @@ extern int	max_safe_fds;
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
+extern File FileOpenForFd(int fd, const char *fileName, uint64 open_seq);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
@@ -78,6 +79,8 @@ extern char *FilePathName(File file);
 extern int	FileGetRawDesc(File file);
 extern int	FileGetRawFlags(File file);
 extern mode_t FileGetRawMode(File file);
+extern uint64 FileGetOpenSeq(File file);
+extern void FileSetOpenSeq(File file, uint64 seq);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
@@ -116,6 +119,7 @@ extern int	MakePGDirectory(const char *directoryName);
 
 /* Miscellaneous support routines */
 extern void InitFileAccess(void);
+extern void FileShmemInit(void);
 extern void set_max_safe_fds(void);
 extern void closeAllVfds(void);
 extern void SetTempTablespaces(Oid *tableSpaces, int numSpaces);
@@ -127,6 +131,7 @@ extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
 				  SubTransactionId parentSubid);
 extern void RemovePgTempFiles(void);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void ReleaseLruFiles(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
@@ -143,4 +148,10 @@ extern void SyncDataDirectory(void);
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
 #define PG_TEMP_FILE_PREFIX "pgsql_tmp"
 
+#ifndef WIN32
+/* XXX; This should probably go elsewhere */
+ssize_t pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd);
+ssize_t pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd);
+#endif
+
 #endif							/* FD_H */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index fd8735b7f5f..a74eedfe4e9 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -128,6 +128,7 @@ typedef struct Latch
 #define WL_POSTMASTER_DEATH  (1 << 4)
 #ifdef WIN32
 #define WL_SOCKET_CONNECTED  (1 << 5)
+#define WL_WIN32_HANDLE		 (1 << 6)
 #else
 /* avoid having to deal with case on platforms not requiring it */
 #define WL_SOCKET_CONNECTED  WL_SOCKET_WRITEABLE
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index c843bbc9692..dc22efbe0a8 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -77,6 +77,18 @@ typedef struct SMgrRelationData
 
 typedef SMgrRelationData *SMgrRelation;
 
+/*
+ * A tag identifying a file to be flushed at the next checkpoint.  This is
+ * convertible to the file's path, but it's convenient to have a small fixed
+ * sized object to use as a hash table key.
+ */
+typedef struct SmgrFileTag
+{
+	RelFileNode node;
+	ForkNumber forknum;
+	int segno;
+} SmgrFileTag;
+
 #define SmgrIsTemp(smgr) \
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
@@ -106,9 +118,7 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
+extern void smgrpath(const SmgrFileTag *tag, char *out);
 extern void AtEOXact_SMgr(void);
 
 
@@ -134,13 +144,9 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
+extern void mdpath(const SmgrFileTag *tag, char *out);
 
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
+extern bool FlushFsyncRequestQueueIfNecessary(void);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgrsync.h b/src/include/storage/smgrsync.h
new file mode 100644
index 00000000000..f32bb22a7cc
--- /dev/null
+++ b/src/include/storage/smgrsync.h
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.h
+ *	  management of file synchronization
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/smgrpending.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SMGRSYNC_H
+#define SMGRSYNC_H
+
+#include "postgres.h"
+
+#include "storage/fd.h"
+
+
+extern void smgrsync_init(void);
+extern void smgrpreckpt(void);
+extern void smgrsync(void);
+extern void smgrpostckpt(void);
+
+extern void UnlinkAfterCheckpoint(RelFileNodeBackend rnode);
+extern uint64 FsyncAtCheckpoint(const SmgrFileTag *tag,
+								File file,
+								uint64 last_cycle);
+extern void RememberFsyncRequest(const SmgrFileTag *tag,
+								 int fd,
+								 uint64 open_seq);
+extern void SetForwardFsyncRequests(void);
+
+
+#endif
-- 
2.17.1 (Apple Git-112)

Thomas Munro

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: Thomas Munro (#1)

1 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

(Adding Dmitry to CC list.)

On Tue, Oct 16, 2018 at 12:02 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

There is one major fly in the ointment: fsyncgate[1]. Originally I
planned to propose a patch on top of that one, but it's difficult --
both patches move a lot of the same stuff around. Personally, I don't
think it would be a very good idea to back-patch that anyway. It'd be
riskier than the problem it aims to solve, in terms of bugs and
hard-to-foresee portability problems IMHO. I think we should consider
back-patching some variant of Craig Ringer's PANIC patch, and consider
this redesigned approach for future releases.

So, please find attached the WIP patch that I would like to propose
for PostgreSQL 12, under a separate Commitfest entry. It incorporates
the fsyncgate work by Andres Freund (original file descriptor transfer
POC) and me (many bug fixes and improvements), and the refactoring
work as described above.

Here is a rebased version of the patch, post pread()/pwrite(). I have
also rewritten the commit message to try to explain the rationale
concisely, instead of requiring the reader to consult multiple
discussions that jump between lengthy email threads to understand the
key points.

There is one major problem with this patch: BufferSync(), run in the
checkpointer, can deadlock against a backend that holds a buffer lock
and is blocked in SendFsyncRequest(). To break this deadlock, we need
way out of it on either the sending or receiving side. Here are three
ideas:

1. Go back to the current pressure-valve strategy: make the sending
side perform the fsync(), if it can't immediately write to the pipe.

2. Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe. Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

3. Multiplex the checkpointer's work: Use LWLockConditionalAcquire()
when locking buffers, and if that fails, try to drain the pipe, and
then fall back to a LWLockTimedAcquire(), drain pipe, repeat loop. I
can hear you groan already; that doesn't seem particularly elegant,
and there are portability problems implementing LWLockTimedAcquire():
semtimedop() and sem_timedwait() are not available on all platforms
(eg macOS). Maybe pthread_timed_condwait() could help (!).

I'm not actually sure if idea 1 is correct, and I also don't like that
behaviour under pressure, and I think pressure is more likely than in
current master (since we gave up sender-side queue compaction, and it
seems quite easy to fill up the pipe's buffer). Number 2 appeals to
me the most right now, but I haven't looked into the details or tried
it yet. Number 3 is a straw man that perhaps helps illustrate the
problem but involves taking on unnecessary new problems and seems like
a non-starter. So, is there any reason the bgwriter process shouldn't
do that work on the checkpointer's behalf? Is there another way?

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Refactor-the-checkpointer-s-data-sync-request-que-v2.patchapplication/x-patch; name=0001-Refactor-the-checkpointer-s-data-sync-request-que-v2.patchDownload

From 930539ed2965b902687797480c16b43827f5c1dd Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Mon, 15 Oct 2018 22:48:05 +1300
Subject: [PATCH] Refactor the checkpointer's data sync request queue.

1.  Decouple the the checkpoint queue machinery from md.c, so that
future SMGR implementations can also use it to have arbitrary
files flushed to disk as part of the next checkpoint.

2.  Keep file descriptors open to avoid losing errors on some OSes.
Craig Ringer discovered that our practice of closing files and then
reopening them in the checkpointer so it can call fsync(2) could
hide write-back errors, on Linux.  While improvements have now been
made in Linux to fix the code that did that as a deliberate policy,
there remains a small risk that an error could be forgotten due
to inode cache pressure during the time when the file is not open.
The same risk exists in some other open source kernels that do not
keep dirty buffers around after write-back failure.  So, we'd better
not do that.

Change to a model where file descriptors are sent to the
checkpointer via the ancillary data mechanism of Unix domain sockets.
One file descriptor for each given file is held open, to prevent
error state amnesia.  This relies on the belief that open files
cannot be evicted from the kernel's inode cache.

To defend against an even less likely hazard on Linux, hold onto the
file descriptor that performed the oldest write.  Assign a
monotonically increasing sequence number to all file descriptors
after they are opened and before they have have been used to write.
This way, an external process such as a backup script can't consume
an error that we need to see.  This applies to recent Linux kernels
with errseq_t-based error tracking and a "seen" flag.  The sequence
number is effectively modelling Linux's internal write-back error
counter.

Other operating systems with a simple error flag that is cleared by
the first observer combined with a policy of dropping dirty buffers
on write-back failure probably have the same problem with external
processes, but there doesn't seem to be anything we can do about
that.

On Windows, a pipe is the most natural replacement for a Unix domin
socket, but unfortunately pipes don't support multiplexing via
WSAEventSelect(), as used by our WaitEventSet machninery.  So use
asynchronous I/O, and add the ability to wait for I/O completion to
WaitEventSet.  A new wait event flag WL_WIN32_HANDLE is provided
on Windows only, and used to wait for asynchronous read and write
operations over the checkpointer pipe.  For now file descriptors are
not transferred via the pipe on Windows (but could be in a future
patch; we don't currently have any reason to think that a similar
hazard does or does not exist on Windows, and if so, that this
technique would fix it, though it probably wouldn't hurt).

The fd-passing concept was originally proposed and prototyped by
Andres.  Here it is extended, made portable and combined with the
refactoring in point 1 since both things needed to rewrite the same
code.

Author: Andres Freund and Thomas Munro
Reviewed-by: Thomas Munro, Dmitry Dolgov
Discussion: https://postgr.es/m/CAEepm%3D2gTANm%3De3ARnJT%3Dn0h8hf88wqmaZxk0JYkxw%2Bb21fNrw%40mail.gmail.com
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
Discussion: https://postgr.es/m/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com
---
 src/backend/access/transam/xlog.c         |   9 +-
 src/backend/bootstrap/bootstrap.c         |   1 +
 src/backend/commands/dbcommands.c         |   2 +-
 src/backend/commands/tablespace.c         |   2 +-
 src/backend/postmaster/bgwriter.c         |   1 +
 src/backend/postmaster/checkpointer.c     | 544 +++++++++------
 src/backend/postmaster/postmaster.c       | 123 +++-
 src/backend/storage/buffer/bufmgr.c       |   2 +
 src/backend/storage/file/fd.c             | 217 +++++-
 src/backend/storage/freespace/freespace.c |   5 +-
 src/backend/storage/ipc/ipci.c            |   2 +
 src/backend/storage/ipc/latch.c           |  12 +
 src/backend/storage/smgr/Makefile         |   2 +-
 src/backend/storage/smgr/md.c             | 791 ++-------------------
 src/backend/storage/smgr/smgr.c           |  63 +-
 src/backend/storage/smgr/smgrsync.c       | 803 ++++++++++++++++++++++
 src/backend/tcop/utility.c                |   2 +-
 src/backend/utils/misc/guc.c              |   1 +
 src/include/postmaster/bgwriter.h         |  24 +-
 src/include/postmaster/checkpointer.h     |  71 ++
 src/include/postmaster/postmaster.h       |   9 +
 src/include/storage/fd.h                  |  11 +
 src/include/storage/latch.h               |   1 +
 src/include/storage/smgr.h                |  24 +-
 src/include/storage/smgrsync.h            |  37 +
 25 files changed, 1711 insertions(+), 1048 deletions(-)
 create mode 100644 src/backend/storage/smgr/smgrsync.c
 create mode 100644 src/include/postmaster/checkpointer.h
 create mode 100644 src/include/storage/smgrsync.h

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7eed5866d2e..8b0c88924a8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/walwriter.h"
 #include "postmaster/startup.h"
 #include "replication/basebackup.h"
@@ -64,6 +65,7 @@
 #include "storage/procarray.h"
 #include "storage/reinit.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/backend_random.h"
 #include "utils/builtins.h"
@@ -8760,8 +8762,10 @@ CreateCheckPoint(int flags)
 	 * Note: because it is possible for log_checkpoints to change while a
 	 * checkpoint proceeds, we always accumulate stats, even if
 	 * log_checkpoints is currently off.
+	 *
+	 * Note #2: this is reset at the end of the checkpoint, not here, because
+	 * we might have to fsync before getting here (see smgrsync()).
 	 */
-	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
 	/*
@@ -9124,6 +9128,9 @@ CreateCheckPoint(int flags)
 									 CheckpointStats.ckpt_segs_recycled);
 
 	LWLockRelease(CheckpointLock);
+
+	/* reset stats */
+	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 578af2e66d8..43bc24953a4 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -31,6 +31,7 @@
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5342f217c02..4d56db8d7b8 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -47,7 +47,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "replication/slot.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index f7e9160a4f6..3096a2c904d 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -70,7 +70,7 @@
 #include "commands/tablespace.h"
 #include "common/file_perm.h"
 #include "miscadmin.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/standby.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index b1e9bb2c537..d373449e3f7 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -44,6 +44,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1a033093c53..29d3f937292 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,7 +46,10 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "postmaster/postmaster.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -101,19 +104,21 @@
  *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
 {
-	RelFileNode rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
+	uint32		type;
+	SmgrFileTag	tag;
+	bool		contains_fd;
+	int			ckpt_started;
+	uint64		open_seq;
 	/* might add a real request-type field later; not needed yet */
 } CheckpointerRequest;
 
+#define CKPT_REQUEST_RNODE			1
+#define CKPT_REQUEST_SYN			2
+
 typedef struct
 {
 	pid_t		checkpointer_pid;	/* PID (0 if not started) */
@@ -126,12 +131,9 @@ typedef struct
 
 	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
-	int			num_requests;	/* current # of requests */
-	int			max_requests;	/* allocated array size */
-	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
+	pg_atomic_uint32 num_backend_writes; /* counts user backend buffer writes */
+	pg_atomic_uint32 num_backend_fsync;	/* counts user backend fsync calls */
+	pg_atomic_uint64 ckpt_cycle; /* cycle */
 } CheckpointerShmemStruct;
 
 static CheckpointerShmemStruct *CheckpointerShmem;
@@ -171,8 +173,9 @@ static pg_time_t last_xlog_switch_time;
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
-static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void SendFsyncRequest(CheckpointerRequest *request, int fd);
+static bool AbsorbFsyncRequest(bool stop_at_current_cycle);
 
 /* Signal handlers */
 
@@ -182,6 +185,11 @@ static void ReqCheckpointHandler(SIGNAL_ARGS);
 static void chkpt_sigusr1_handler(SIGNAL_ARGS);
 static void ReqShutdownHandler(SIGNAL_ARGS);
 
+#ifdef WIN32
+/* State used to track in-progress asynchronous fsync pipe reads. */
+static OVERLAPPED absorb_overlapped;
+static HANDLE *absorb_read_in_progress;
+#endif
 
 /*
  * Main entry point for checkpointer process
@@ -194,6 +202,7 @@ CheckpointerMain(void)
 {
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext checkpointer_context;
+	WaitEventSet *wes;
 
 	CheckpointerShmem->checkpointer_pid = MyProcPid;
 
@@ -334,6 +343,21 @@ CheckpointerMain(void)
 	 */
 	ProcGlobal->checkpointerLatch = &MyProc->procLatch;
 
+	/* Create reusable WaitEventSet. */
+	wes = CreateWaitEventSet(TopMemoryContext, 3);
+	AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL,
+					  NULL);
+	AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+#ifndef WIN32
+	AddWaitEventToSet(wes, WL_SOCKET_READABLE, fsync_fds[FSYNC_FD_PROCESS],
+					  NULL, NULL);
+#else
+	absorb_overlapped.hEvent = CreateEvent(NULL, TRUE, TRUE,
+										   "fsync pipe read completion");
+	AddWaitEventToSet(wes, WL_WIN32_HANDLE, PGINVALID_SOCKET, NULL,
+					  &absorb_overlapped.hEvent);
+#endif
+
 	/*
 	 * Loop forever
 	 */
@@ -345,6 +369,7 @@ CheckpointerMain(void)
 		int			elapsed_secs;
 		int			cur_timeout;
 		int			rc;
+		WaitEvent	event;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -545,16 +570,14 @@ CheckpointerMain(void)
 			cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
 		}
 
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-					   cur_timeout * 1000L /* convert to ms */ ,
-					   WAIT_EVENT_CHECKPOINTER_MAIN);
+		rc = WaitEventSetWait(wes, cur_timeout * 1000, &event, 1, 0);
+		Assert(rc > 0);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
 		 */
-		if (rc & WL_POSTMASTER_DEATH)
+		if (event.events == WL_POSTMASTER_DEATH)
 			exit(1);
 	}
 }
@@ -890,16 +913,7 @@ ReqShutdownHandler(SIGNAL_ARGS)
 Size
 CheckpointerShmemSize(void)
 {
-	Size		size;
-
-	/*
-	 * Currently, the size of the requests[] array is arbitrarily set equal to
-	 * NBuffers.  This may prove too large or small ...
-	 */
-	size = offsetof(CheckpointerShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
-
-	return size;
+	return sizeof(CheckpointerShmemStruct);
 }
 
 /*
@@ -920,13 +934,13 @@ CheckpointerShmemInit(void)
 	if (!found)
 	{
 		/*
-		 * First time through, so initialize.  Note that we zero the whole
-		 * requests array; this is so that CompactCheckpointerRequestQueue can
-		 * assume that any pad bytes in the request structs are zeroes.
+		 * First time through, so initialize.
 		 */
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
-		CheckpointerShmem->max_requests = NBuffers;
+		pg_atomic_init_u64(&CheckpointerShmem->ckpt_cycle, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_writes, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_fsync, 0);
 	}
 }
 
@@ -1102,181 +1116,84 @@ RequestCheckpoint(int flags)
  * is theoretically possible a backend fsync might still be necessary, if
  * the queue is full and contains no duplicate entries.  In that case, we
  * let the backend know by returning false.
+ *
+ * We add the cycle counter to the message.  That is an unsynchronized read
+ * of the shared memory counter, but it doesn't matter if it is arbitrarily
+ * old since it is only used to limit unnecessary extra queue draining in
+ * AbsorbAllFsyncRequests().
  */
-bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+void
+ForwardFsyncRequest(const SmgrFileTag *tag, File file)
 {
-	CheckpointerRequest *request;
-	bool		too_full;
+	CheckpointerRequest request = {0};
 
 	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
+		elog(ERROR, "ForwardFsyncRequest must not be called in single user mode");
 
 	if (AmCheckpointerProcess())
 		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
+	request.type = CKPT_REQUEST_RNODE;
+	request.tag = *tag;
+#ifdef CHECKPOINTER_TRANSFER_FILES
+	request.contains_fd = file != -1;
+#else
+	request.contains_fd = false;
+#endif
 
 	/*
-	 * If the checkpointer isn't running or the request queue is full, the
-	 * backend will have to perform its own fsync request.  But before forcing
-	 * that to happen, we can try to compact the request queue.
+	 * Tell the checkpointer the sequence number of the most recent open, so
+	 * that it can be sure to hold the older file descriptor.
 	 */
-	if (CheckpointerShmem->checkpointer_pid == 0 ||
-		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
-		 !CompactCheckpointerRequestQueue()))
-	{
-		/*
-		 * Count the subset of writes where backends have to do their own
-		 * fsync
-		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
-		return false;
-	}
+	request.open_seq = request.contains_fd ? FileGetOpenSeq(file) : (uint64) -1;
 
-	/* OK, insert request */
-	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
-
-	/* If queue is more than half full, nudge the checkpointer to empty it */
-	too_full = (CheckpointerShmem->num_requests >=
-				CheckpointerShmem->max_requests / 2);
-
-	LWLockRelease(CheckpointerCommLock);
-
-	/* ... but not till after we release the lock */
-	if (too_full && ProcGlobal->checkpointerLatch)
-		SetLatch(ProcGlobal->checkpointerLatch);
+	/*
+	 * We read ckpt_started without synchronization.  It is used to prevent
+	 * AbsorbAllFsyncRequests() from reading new values from after a
+	 * checkpoint began.  A slightly out-of-date value here will only cause
+	 * it to do a little bit more work than strictly necessary, but that's
+	 * OK.
+	 */
+	request.ckpt_started = CheckpointerShmem->ckpt_started;
 
-	return true;
+	SendFsyncRequest(&request,
+					 request.contains_fd ? FileGetRawDesc(file) : -1);
 }
 
 /*
- * CompactCheckpointerRequestQueue
- *		Remove duplicates from the request queue to avoid backend fsyncs.
- *		Returns "true" if any entries were removed.
- *
- * Although a full fsync request queue is not common, it can lead to severe
- * performance problems when it does happen.  So far, this situation has
- * only been observed to occur when the system is under heavy write load,
- * and especially during the "sync" phase of a checkpoint.  Without this
- * logic, each backend begins doing an fsync for every block written, which
- * gets very expensive and can slow down the whole system.
+ * AbsorbFsyncRequests
+ *		Retrieve queued fsync requests and pass them to local smgr. Stop when
+ *		resources would be exhausted by absorbing more.
  *
- * Trying to do this every time the queue is full could lose if there
- * aren't any removable entries.  But that should be vanishingly rare in
- * practice: there's one queue entry per shared buffer.
+ * This is exported because we want to continue accepting requests during
+ * smgrsync().
  */
-static bool
-CompactCheckpointerRequestQueue(void)
+void
+AbsorbFsyncRequests(void)
 {
-	struct CheckpointerSlotMapping
-	{
-		CheckpointerRequest request;
-		int			slot;
-	};
-
-	int			n,
-				preserve_count;
-	int			num_skipped = 0;
-	HASHCTL		ctl;
-	HTAB	   *htab;
-	bool	   *skip_slot;
-
-	/* must hold CheckpointerCommLock in exclusive mode */
-	Assert(LWLockHeldByMe(CheckpointerCommLock));
-
-	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
-
-	/* Initialize temporary hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(CheckpointerRequest);
-	ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
-	ctl.hcxt = CurrentMemoryContext;
-
-	htab = hash_create("CompactCheckpointerRequestQueue",
-					   CheckpointerShmem->num_requests,
-					   &ctl,
-					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-	/*
-	 * The basic idea here is that a request can be skipped if it's followed
-	 * by a later, identical request.  It might seem more sensible to work
-	 * backwards from the end of the queue and check whether a request is
-	 * *preceded* by an earlier, identical request, in the hopes of doing less
-	 * copying.  But that might change the semantics, if there's an
-	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
-	 * we do it this way.  It would be possible to be even smarter if we made
-	 * the code below understand the specific semantics of such requests (it
-	 * could blow away preceding entries that would end up being canceled
-	 * anyhow), but it's not clear that the extra complexity would buy us
-	 * anything.
-	 */
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		CheckpointerRequest *request;
-		struct CheckpointerSlotMapping *slotmap;
-		bool		found;
-
-		/*
-		 * We use the request struct directly as a hashtable key.  This
-		 * assumes that any padding bytes in the structs are consistently the
-		 * same, which should be okay because we zeroed them in
-		 * CheckpointerShmemInit.  Note also that RelFileNode had better
-		 * contain no pad bytes.
-		 */
-		request = &CheckpointerShmem->requests[n];
-		slotmap = hash_search(htab, request, HASH_ENTER, &found);
-		if (found)
-		{
-			/* Duplicate, so mark the previous occurrence as skippable */
-			skip_slot[slotmap->slot] = true;
-			num_skipped++;
-		}
-		/* Remember slot containing latest occurrence of this request value */
-		slotmap->slot = n;
-	}
+	if (!AmCheckpointerProcess())
+		return;
 
-	/* Done with the hash table. */
-	hash_destroy(htab);
+	/* Transfer stats counts into pending pgstats message */
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
-	/* If no duplicates, we're out of luck. */
-	if (!num_skipped)
+	while (true)
 	{
-		pfree(skip_slot);
-		return false;
-	}
+		if (!FlushFsyncRequestQueueIfNecessary())
+			break;
 
-	/* We found some duplicates; remove them. */
-	preserve_count = 0;
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		if (skip_slot[n])
-			continue;
-		CheckpointerShmem->requests[preserve_count++] = CheckpointerShmem->requests[n];
+		if (!AbsorbFsyncRequest(false))
+			break;
 	}
-	ereport(DEBUG1,
-			(errmsg("compacted fsync request queue from %d entries to %d entries",
-					CheckpointerShmem->num_requests, preserve_count)));
-	CheckpointerShmem->num_requests = preserve_count;
-
-	/* Cleanup. */
-	pfree(skip_slot);
-	return true;
 }
 
 /*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ * AbsorbAllFsyncRequests
+ *		Retrieve all already pending fsync requests and pass them to local
+ *		smgr.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1284,54 +1201,121 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbAllFsyncRequests(void)
 {
-	CheckpointerRequest *requests = NULL;
-	CheckpointerRequest *request;
-	int			n;
-
 	if (!AmCheckpointerProcess())
 		return;
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
 	/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
-	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	for (;;)
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
+		if (!FlushFsyncRequestQueueIfNecessary())
+			elog(FATAL, "may not happen");
+
+		if (!AbsorbFsyncRequest(true))
+			break;
 	}
+}
+
+/*
+ * AbsorbFsyncRequest
+ *		Retrieve one queued fsync request and pass them to local smgr.
+ */
+static bool
+AbsorbFsyncRequest(bool stop_at_current_cycle)
+{
+	static CheckpointerRequest req;
+	int fd = -1;
+#ifndef WIN32
+	int ret;
+#else
+	DWORD bytes_read;
+#endif
+
+	ReleaseLruFiles();
 
 	START_CRIT_SECTION();
+#ifndef WIN32
+	ret = pg_uds_recv_with_fd(fsync_fds[FSYNC_FD_PROCESS],
+							  &req,
+							  sizeof(req),
+							  &fd);
+	if (ret < 0 && (errno == EWOULDBLOCK || errno == EAGAIN))
+	{
+		END_CRIT_SECTION();
+		return false;
+	}
+	else if (ret < 0)
+		elog(ERROR, "recvmsg failed: %m");
+#else
+	if (!absorb_read_in_progress)
+	{
+		if (!ReadFile(fsyncPipe[FSYNC_FD_PROCESS],
+					  &req,
+					  sizeof(req),
+					  &bytes_read,
+					  &absorb_overlapped))
+		{
+			if (GetLastError() != ERROR_IO_PENDING)
+			{
+				_dosmaperr(GetLastError());
+				elog(ERROR, "can't begin read from fsync pipe: %m");
+			}
 
-	CheckpointerShmem->num_requests = 0;
+			/*
+			 * An asynchronous read has begun.  We'll tell caller to call us
+			 * back when the event indicates completion.
+			 */
+			absorb_read_in_progress = &absorb_overlapped.hEvent;
+			END_CRIT_SECTION();
+			return false;
+		}
+		/* The read completed synchronously.  'req' is now populated. */
+	}
+	if (absorb_read_in_progress)
+	{
+		/* Completed yet? */
+		if (!GetOverlappedResult(fsyncPipe[FSYNC_FD_PROCESS],
+								 &absorb_overlapped,
+								 &bytes_read,
+								 false))
+		{
+			if (GetLastError() == ERROR_IO_INCOMPLETE)
+			{
+				/* Nope.  Spurious event?  Tell caller to wait some more. */
+				END_CRIT_SECTION();
+				return false;
+			}
+			_dosmaperr(GetLastError());
+			elog(ERROR, "can't complete from fsync pipe: %m");
+		}
+		/* The asynchronous read completed.  'req' is now populated. */
+		absorb_read_in_progress = NULL;
+	}
 
-	LWLockRelease(CheckpointerCommLock);
+	/* Check message size. */
+	if (bytes_read != sizeof(req))
+		elog(ERROR, "unexpected short read on fsync pipe");
+#endif
 
-	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+	if (req.contains_fd != (fd != -1))
+	{
+		elog(FATAL, "message should have fd associated, but doesn't");
+	}
 
+	RememberFsyncRequest(&req.tag, fd, req.open_seq);
 	END_CRIT_SECTION();
 
-	if (requests)
-		pfree(requests);
+	if (stop_at_current_cycle &&
+		req.ckpt_started == CheckpointerShmem->ckpt_started)
+		return false;
+
+	return true;
 }
 
 /*
@@ -1374,3 +1358,139 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+uint64
+GetCheckpointSyncCycle(void)
+{
+	return pg_atomic_read_u64(&CheckpointerShmem->ckpt_cycle);
+}
+
+uint64
+IncCheckpointSyncCycle(void)
+{
+	return pg_atomic_fetch_add_u64(&CheckpointerShmem->ckpt_cycle, 1);
+}
+
+void
+CountBackendWrite(void)
+{
+	pg_atomic_fetch_add_u32(&CheckpointerShmem->num_backend_writes, 1);
+}
+
+/*
+ * Send a message to the checkpointer's fsync socket (Unix) or pipe (Windows).
+ * This is essentially a blocking call (there is no CHECK_FOR_INTERRUPTS, and
+ * even if there were it'd be surpressed since callers hold a lock), except
+ * that we don't ignore postmaster death so we need an event loop.
+ *
+ * The code is rather different on Windows, because there we have to begin the
+ * write and then wait for it to complete, while on Unix we have to wait until
+ * we can do the write.
+ */
+static void
+SendFsyncRequest(CheckpointerRequest *request, int fd)
+{
+#ifndef WIN32
+	ssize_t ret;
+	int		rc;
+
+	while (true)
+	{
+		ret = pg_uds_send_with_fd(fsync_fds[FSYNC_FD_SUBMIT],
+								  request,
+								  sizeof(*request),
+								  request->contains_fd ? fd : -1);
+
+		if (ret >= 0)
+		{
+			/*
+			 * Don't think short writes will ever happen in realistic
+			 * implementations, but better make sure that's true...
+			 */
+			if (ret != sizeof(*request))
+				elog(FATAL, "unexpected short write to fsync request socket");
+			break;
+		}
+		else if (errno == EWOULDBLOCK || errno == EAGAIN
+#ifdef __darwin__
+				 || errno == EMSGSIZE || errno == ENOBUFS
+#endif
+				)
+		{
+			/*
+			 * Testing on macOS 10.13 showed occasional EMSGSIZE or
+			 * ENOBUFS errors, which could be handled by retrying.  Unless
+			 * the problem also shows up on other systems, let's handle those
+			 * only for that OS.
+			 */
+
+			/* Blocked on write - wait for socket to become readable */
+			rc = WaitLatchOrSocket(NULL,
+								   WL_SOCKET_WRITEABLE | WL_POSTMASTER_DEATH,
+								   fsync_fds[FSYNC_FD_SUBMIT], -1, 0);
+			if (rc & WL_POSTMASTER_DEATH)
+				exit(1);
+		}
+		else
+			ereport(FATAL, (errmsg("could not send fsync request: %m")));
+	}
+
+#else /* WIN32 */
+	{
+		OVERLAPPED overlapped = {0};
+		DWORD nwritten;
+		int rc;
+
+		overlapped.hEvent = CreateEvent(NULL, TRUE, TRUE, NULL);
+
+		if (!WriteFile(fsyncPipe[FSYNC_FD_SUBMIT],
+					   request,
+					   sizeof(*request),
+					   &nwritten,
+					   &overlapped))
+		{
+			WaitEventSet *wes;
+			WaitEvent event;
+
+			/* Handle unexpected errors. */
+			if (GetLastError() != ERROR_IO_PENDING)
+			{
+				_dosmaperr(GetLastError());
+				CloseHandle(overlapped.hEvent);
+				ereport(FATAL, (errmsg("could not send fsync request: %m")));
+			}
+
+			/* Wait for asynchronous IO to complete. */
+			wes = CreateWaitEventSet(TopMemoryContext, 3);
+			AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL,
+							  NULL);
+			AddWaitEventToSet(wes, WL_WIN32_HANDLE, PGINVALID_SOCKET, NULL,
+							  &overlapped.hEvent);
+			for (;;)
+			{
+				rc = WaitEventSetWait(wes, -1, &event, 1, 0);
+				Assert(rc > 0);
+				if (event.events == WL_POSTMASTER_DEATH)
+					exit(1);
+				if (event.events == WL_WIN32_HANDLE)
+				{
+					if (!GetOverlappedResult(fsyncPipe[FSYNC_FD_SUBMIT], &overlapped,
+											 &nwritten, FALSE))
+					{
+						_dosmaperr(GetLastError());
+						CloseHandle(overlapped.hEvent);
+						ereport(FATAL, (errmsg("could not get result of sending fsync request: %m")));
+					}
+					if (nwritten > 0)
+						break;
+				}
+			}
+			FreeWaitEventSet(wes);
+		}
+
+		CloseHandle(overlapped.hEvent);
+		if (nwritten != sizeof(*request))
+			elog(FATAL, "unexpected short write to fsync request pipe");
+	}
+#endif
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 688f462e7d0..9eb8a04235c 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -70,6 +70,7 @@
 #include <time.h>
 #include <sys/wait.h>
 #include <ctype.h>
+#include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
 #include <fcntl.h>
@@ -435,6 +436,7 @@ static pid_t StartChildProcess(AuxProcType type);
 static void StartAutovacuumWorker(void);
 static void MaybeStartWalReceiver(void);
 static void InitPostmasterDeathWatchHandle(void);
+static void InitFsyncFdSocketPair(void);
 
 /*
  * Archiver is allowed to start up at the current postmaster state?
@@ -524,9 +526,11 @@ typedef struct
 	HANDLE		PostmasterHandle;
 	HANDLE		initial_signal_pipe;
 	HANDLE		syslogPipe[2];
+	HANDLE		fsyncPipe[2];
 #else
 	int			postmaster_alive_fds[2];
 	int			syslogPipe[2];
+	int			fsync_fds[2];
 #endif
 	char		my_exec_path[MAXPGPATH];
 	char		pkglib_path[MAXPGPATH];
@@ -569,6 +573,12 @@ int			postmaster_alive_fds[2] = {-1, -1};
 HANDLE		PostmasterHandle;
 #endif
 
+#ifndef WIN32
+int			fsync_fds[2] = {-1, -1};
+#else
+HANDLE		fsyncPipe[2] = {0, 0};
+#endif
+
 /*
  * Postmaster main entry point
  */
@@ -1186,6 +1196,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitPostmasterDeathWatchHandle();
 
+	/*
+	 * Initialize socket pair used to transport file descriptors over.
+	 */
+	InitFsyncFdSocketPair();
+
 #ifdef WIN32
 
 	/*
@@ -5992,7 +6007,8 @@ extern pg_time_t first_syslogger_file_time;
 #define write_inheritable_socket(dest, src, childpid) ((*(dest) = (src)), true)
 #define read_inheritable_socket(dest, src) (*(dest) = *(src))
 #else
-static bool write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE child);
+static bool write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE child,
+									bool close_source);
 static bool write_inheritable_socket(InheritableSocket *dest, SOCKET src,
 						 pid_t childPid);
 static void read_inheritable_socket(SOCKET *dest, InheritableSocket *src);
@@ -6056,11 +6072,20 @@ save_backend_variables(BackendParameters *param, Port *port,
 	param->PostmasterHandle = PostmasterHandle;
 	if (!write_duplicated_handle(&param->initial_signal_pipe,
 								 pgwin32_create_signal_listener(childPid),
-								 childProcess))
+								 childProcess, true))
+		return false;
+	if (!write_duplicated_handle(&param->fsyncPipe[0],
+								 fsyncPipe[0],
+								 childProcess, false))
+		return false;
+	if (!write_duplicated_handle(&param->fsyncPipe[1],
+								 fsyncPipe[1],
+								 childProcess, false))
 		return false;
 #else
 	memcpy(&param->postmaster_alive_fds, &postmaster_alive_fds,
 		   sizeof(postmaster_alive_fds));
+	memcpy(&param->fsync_fds, &fsync_fds, sizeof(fsync_fds));
 #endif
 
 	memcpy(&param->syslogPipe, &syslogPipe, sizeof(syslogPipe));
@@ -6081,7 +6106,8 @@ save_backend_variables(BackendParameters *param, Port *port,
  * process instance of the handle to the parameter file.
  */
 static bool
-write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess)
+write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess,
+						bool close_source)
 {
 	HANDLE		hChild = INVALID_HANDLE_VALUE;
 
@@ -6091,7 +6117,8 @@ write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess)
 						 &hChild,
 						 0,
 						 TRUE,
-						 DUPLICATE_CLOSE_SOURCE | DUPLICATE_SAME_ACCESS))
+						 (close_source ? DUPLICATE_CLOSE_SOURCE : 0) |
+						 DUPLICATE_SAME_ACCESS))
 	{
 		ereport(LOG,
 				(errmsg_internal("could not duplicate handle to be written to backend parameter file: error code %lu",
@@ -6287,9 +6314,12 @@ restore_backend_variables(BackendParameters *param, Port *port)
 #ifdef WIN32
 	PostmasterHandle = param->PostmasterHandle;
 	pgwin32_initial_signal_pipe = param->initial_signal_pipe;
+	fsyncPipe[0] = param->fsyncPipe[0];
+	fsyncPipe[1] = param->fsyncPipe[1];
 #else
 	memcpy(&postmaster_alive_fds, &param->postmaster_alive_fds,
 		   sizeof(postmaster_alive_fds));
+	memcpy(&fsync_fds, &param->fsync_fds, sizeof(fsync_fds));
 #endif
 
 	memcpy(&syslogPipe, &param->syslogPipe, sizeof(syslogPipe));
@@ -6466,3 +6496,88 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/* Create socket used for requesting fsyncs by checkpointer */
+static void
+InitFsyncFdSocketPair(void)
+{
+	Assert(MyProcPid == PostmasterPid);
+
+#ifndef WIN32
+	if (socketpair(AF_UNIX, SOCK_STREAM, 0, fsync_fds) < 0)
+		ereport(FATAL,
+				(errcode_for_file_access(),
+				 errmsg_internal("could not create fsync sockets: %m")));
+	/*
+	 * Set O_NONBLOCK on both fds.
+	 */
+	if (fcntl(fsync_fds[FSYNC_FD_PROCESS], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync process socket to nonblocking mode: %m")));
+#ifndef EXEC_BACKEND
+	if (fcntl(fsync_fds[FSYNC_FD_PROCESS], F_SETFD, FD_CLOEXEC) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync process socket to close-on-exec mode: %m")));
+#endif
+
+	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync submit socket to nonblocking mode: %m")));
+#ifndef EXEC_BACKEND
+	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFD, FD_CLOEXEC) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync submit socket to close-on-exec mode: %m")));
+#endif
+#else
+	{
+		UCHAR		pipename[MAX_PATH];
+		SECURITY_ATTRIBUTES sa;
+
+		memset(&sa, 0, sizeof(sa));
+
+		/*
+		 * We'll create a named pipe, because anonymous pipes don't allow
+		 * overlapped (= async) IO or message-orient communication.  We'll
+		 * open both ends of it here, and then duplicate them into all child
+		 * processes in save_backend_variables().  First, open the server end.
+		 */
+		snprintf(pipename, sizeof(pipename), "\\\\.\\Pipe\\fsync_pipe.%08x",
+				 GetCurrentProcessId());
+		fsyncPipe[FSYNC_FD_PROCESS] = CreateNamedPipeA(pipename,
+													   PIPE_ACCESS_INBOUND | FILE_FLAG_OVERLAPPED,
+													   PIPE_TYPE_MESSAGE | PIPE_WAIT,
+													   1,
+													   4096,
+													   4096,
+													   -1,
+													   &sa);
+		if (!fsyncPipe[FSYNC_FD_PROCESS])
+		{
+			_dosmaperr(GetLastError());
+			ereport(FATAL,
+					(errcode_for_file_access(),
+					 errmsg_internal("could not create server end of fsync pipe: %m")));
+		}
+
+		/* Now open the client end. */
+		fsyncPipe[FSYNC_FD_SUBMIT] = CreateFileA(pipename,
+												 GENERIC_WRITE,
+												 0,
+												 &sa,
+												 OPEN_EXISTING,
+												 FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED,
+												 NULL);
+		if (!fsyncPipe[FSYNC_FD_SUBMIT])
+		{
+			_dosmaperr(GetLastError());
+			ereport(FATAL,
+					(errcode_for_file_access(),
+					 errmsg_internal("could not create client end of fsync pipe: %m")));
+		}
+	}
+#endif
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe57063..256cc5e0217 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -42,11 +42,13 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/standby.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 2d75773ef02..63e7c3e3fd1 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -85,6 +85,7 @@
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "portability/mem.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -171,6 +172,7 @@ int			max_safe_fds = 32;	/* default if not changed */
 #define FD_DELETE_AT_CLOSE	(1 << 0)	/* T = delete when closed */
 #define FD_CLOSE_AT_EOXACT	(1 << 1)	/* T = close at eoXact */
 #define FD_TEMP_FILE_LIMIT	(1 << 2)	/* T = respect temp_file_limit */
+#define FD_NOT_IN_LRU		(1 << 3)	/* T = not in LRU */
 
 typedef struct vfd
 {
@@ -185,6 +187,7 @@ typedef struct vfd
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
 	mode_t		fileMode;		/* mode to pass to open(2) */
+	uint64		open_seq;		/* sequence number of opened file */
 } Vfd;
 
 /*
@@ -294,7 +297,6 @@ static void LruDelete(File file);
 static void Insert(File file);
 static int	LruInsert(File file);
 static bool ReleaseLruFile(void);
-static void ReleaseLruFiles(void);
 static File AllocateVfd(void);
 static void FreeVfd(File file);
 
@@ -323,6 +325,13 @@ static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
 static int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 static int	fsync_parent_path(const char *fname, int elevel);
 
+/* Shared memory state. */
+typedef struct
+{
+	pg_atomic_uint64 open_seq;
+} FdSharedData;
+
+static FdSharedData *fd_shared;
 
 /*
  * pg_fsync --- do fsync with or without writethrough
@@ -777,6 +786,20 @@ InitFileAccess(void)
 	on_proc_exit(AtProcExit_Files, 0);
 }
 
+/*
+ * Initialize shared memory state.  This is called after shared memory is
+ * ready.
+ */
+void
+FileShmemInit(void)
+{
+	bool	found;
+
+	fd_shared = ShmemInitStruct("fd_shared", sizeof(*fd_shared), &found);
+	if (!found)
+		pg_atomic_init_u64(&fd_shared->open_seq, 0);
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -1085,6 +1108,9 @@ LruInsert(File file)
 		{
 			++nfile;
 		}
+
+		vfdP->open_seq =
+			pg_atomic_fetch_add_u64(&fd_shared->open_seq, 1);
 	}
 
 	/*
@@ -1121,7 +1147,7 @@ ReleaseLruFile(void)
  * Release kernel FDs as needed to get under the max_safe_fds limit.
  * After calling this, it's OK to try to open another file.
  */
-static void
+void
 ReleaseLruFiles(void)
 {
 	while (nfile + numAllocatedDescs >= max_safe_fds)
@@ -1234,9 +1260,11 @@ FileAccess(File file)
 		 * We now know that the file is open and that it is not the last one
 		 * accessed, so we need to move it to the head of the Lru ring.
 		 */
-
-		Delete(file);
-		Insert(file);
+		if (!(VfdCache[file].fdstate & FD_NOT_IN_LRU))
+		{
+			Delete(file);
+			Insert(file);
+		}
 	}
 
 	return 0;
@@ -1354,6 +1382,57 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	vfdP->fileSize = 0;
 	vfdP->fdstate = 0x0;
 	vfdP->resowner = NULL;
+	vfdP->open_seq = pg_atomic_fetch_add_u64(&fd_shared->open_seq, 1);
+
+	return file;
+}
+
+/*
+ * Open a File for a pre-existing file descriptor.
+ *
+ * Note that these files will not be closed in an LRU basis, therefore the
+ * caller is responsible for limiting the number of open file descriptors.
+ *
+ * The passed in name is purely for informational purposes.
+ */
+File
+FileOpenForFd(int fd, const char *fileName, uint64 open_seq)
+{
+	char	   *fnamecopy;
+	File		file;
+	Vfd		   *vfdP;
+
+	/*
+	 * We need a malloc'd copy of the file name; fail cleanly if no room.
+	 */
+	fnamecopy = strdup(fileName);
+	if (fnamecopy == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory")));
+
+	file = AllocateVfd();
+	vfdP = &VfdCache[file];
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	vfdP->fd = fd;
+	++nfile;
+
+	DO_DB(elog(LOG, "FileOpenForFd: success %d/%d (%s)",
+			   file, fd, fnamecopy));
+
+	/* NB: Explicitly not inserted into LRU! */
+
+	vfdP->fileName = fnamecopy;
+	/* Saved flags are adjusted to be OK for re-opening file */
+	vfdP->fileFlags = 0;
+	vfdP->fileMode = 0;
+	vfdP->fileSize = 0;
+	vfdP->fdstate = FD_NOT_IN_LRU;
+	vfdP->resowner = NULL;
+	vfdP->open_seq = open_seq;
 
 	return file;
 }
@@ -1704,7 +1783,11 @@ FileClose(File file)
 		vfdP->fd = VFD_CLOSED;
 
 		/* remove the file from the lru ring */
-		Delete(file);
+		if (!(vfdP->fdstate & FD_NOT_IN_LRU))
+		{
+			vfdP->fdstate &= ~FD_NOT_IN_LRU;
+			Delete(file);
+		}
 	}
 
 	if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
@@ -2073,6 +2156,10 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
+	if (FileAccess(file))
+		return -1;
+
 	return VfdCache[file].fd;
 }
 
@@ -2096,6 +2183,17 @@ FileGetRawMode(File file)
 	return VfdCache[file].fileMode;
 }
 
+/*
+ * Get the opening sequence number of this file.  This number is captured
+ * after the file was opened but before anything was written to the file,
+ */
+uint64
+FileGetOpenSeq(File file)
+{
+	Assert(FileIsValid(file));
+	return VfdCache[file].open_seq;
+}
+
 /*
  * Make room for another allocatedDescs[] array entry if needed and possible.
  * Returns true if an array element is available.
@@ -3413,3 +3511,110 @@ MakePGDirectory(const char *directoryName)
 {
 	return mkdir(directoryName, pg_dir_create_mode);
 }
+
+#ifndef WIN32
+
+/*
+ * Send data over a unix domain socket, optionally (when fd != -1) including a
+ * file descriptor.
+ */
+ssize_t
+pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd)
+{
+	ssize_t     size;
+	struct msghdr   msg = {0};
+	struct iovec    iov = {0};
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	memset(&cmsgu, 0, sizeof(cmsgu));
+	iov.iov_base = buf;
+	iov.iov_len = buflen;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
+	if (fd >= 0)
+	{
+		msg.msg_control = cmsgu.control;
+		msg.msg_controllen = sizeof(cmsgu.control);
+
+		cmsg = CMSG_FIRSTHDR(&msg);
+		cmsg->cmsg_len = CMSG_LEN(sizeof (int));
+		cmsg->cmsg_level = SOL_SOCKET;
+		cmsg->cmsg_type = SCM_RIGHTS;
+
+		*((int *) CMSG_DATA(cmsg)) = fd;
+	}
+
+	size = sendmsg(sock, &msg, 0);
+
+	/* errors are returned directly */
+	return size;
+}
+
+/*
+ * Receive data from a unix domain socket. If a file is sent over the socket,
+ * store it in *fd.
+ */
+ssize_t
+pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd)
+{
+	ssize_t     size;
+	struct msghdr   msg;
+	struct iovec    iov;
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	Assert(fd != NULL);
+
+	iov.iov_base = buf;
+	iov.iov_len = bufsize;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsgu.control;
+	msg.msg_controllen = sizeof(cmsgu.control);
+
+	size = recvmsg (sock, &msg, 0);
+
+	if (size < 0)
+	{
+		*fd = -1;
+		return size;
+	}
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	if (cmsg && cmsg->cmsg_len == CMSG_LEN(sizeof(int)))
+	{
+		if (cmsg->cmsg_level != SOL_SOCKET)
+			elog(FATAL, "unexpected cmsg_level");
+
+		if (cmsg->cmsg_type != SCM_RIGHTS)
+			elog(FATAL, "unexpected cmsg_type");
+
+		*fd = *((int *) CMSG_DATA(cmsg));
+
+		/* FIXME: check / handle additional cmsg structures */
+	}
+	else
+		*fd = -1;
+
+	return size;
+}
+
+#endif
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 7c4ad1c4494..2b47824aab9 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -556,7 +556,7 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * not on extension.)
 	 */
 	if (rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks == 0)
 	{
 		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
 			rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
@@ -564,6 +564,9 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 		else
 			rel->rd_smgr->smgr_fsm_nblocks = 0;
 	}
+	else if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
+													 FSM_FORKNUM);
 
 	/* Handle requests beyond EOF */
 	if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c03..efbd25b84da 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -27,6 +27,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
@@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FileShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index c129446f9c9..d4f3ad0d44d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -878,6 +878,12 @@ WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
 	{
 		*handle = PostmasterHandle;
 	}
+#ifdef WIN32
+	else if (event->events == WL_WIN32_HANDLE)
+	{
+		*handle = *(HANDLE *)event->user_data;
+	}
+#endif
 	else
 	{
 		int			flags = FD_CLOSE;	/* always check for errors/EOF */
@@ -1453,6 +1459,12 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			returned_events++;
 		}
 	}
+	else if (cur_event->events & WL_WIN32_HANDLE)
+	{
+		occurred_events->events |= WL_WIN32_HANDLE;
+		occurred_events++;
+		returned_events++;
+	}
 
 	return returned_events;
 }
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0df16..c9c4be325ed 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrsync.o smgrtype.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 86013a5c8b2..61fd0adbb8d 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -30,37 +30,24 @@
 #include "access/xlog.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
 
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
 
 /*
  * On Windows, we have to interpret EACCES as possibly meaning the same as
  * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
  * that's what you get.  Ugh.  This code is designed so that we don't
  * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
+ * a pending fsync request getting canceled ... see smgrsync).
  */
 #ifndef WIN32
 #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
@@ -110,6 +97,7 @@ typedef struct _MdfdVec
 {
 	File		mdfd_vfd;		/* fd number in fd.c's pool */
 	BlockNumber mdfd_segno;		/* segment number, from 0 */
+	uint64		mdfd_dirtied_cycle;
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
@@ -134,30 +122,9 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * (Regular backends do not track pending operations locally, but forward
  * them to the checkpointer.)
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
 
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
 
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
 
 
 /*** behavior for mdopen & _mdfd_getseg ***/
@@ -184,8 +151,7 @@ static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
 			 bool isRedo);
 static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
 static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
+								   MdfdVec *seg);
 static void _fdvec_resize(SMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg);
@@ -208,64 +174,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -334,6 +242,7 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 	mdfd = &reln->md_seg_fds[forkNum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 }
 
 /*
@@ -388,7 +297,7 @@ mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
 	/*
 	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
+	 * relation, else the next smgrsync() will fail.  There can't be any such
 	 * requests for a temp relation, though.  We can send just one request
 	 * even when deleting multiple forks, since the fsync queuing code accepts
 	 * the "InvalidForkNumber = all forks" convention.
@@ -448,7 +357,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		UnlinkAfterCheckpoint(rnode);
 	}
 
 	/*
@@ -540,7 +449,16 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 
 	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+	{
+		SmgrFileTag tag;
+
+		tag.node = reln->smgr_rnode.node;
+		tag.forknum = forknum;
+		tag.segno = v->mdfd_segno;
+		v->mdfd_dirtied_cycle = FsyncAtCheckpoint(&tag,
+												  v->mdfd_vfd,
+												  v->mdfd_dirtied_cycle);
+	}
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
@@ -600,6 +518,7 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior)
 	mdfd = &reln->md_seg_fds[forknum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
 
@@ -831,7 +750,16 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 
 	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+	{
+		SmgrFileTag tag;
+
+		tag.node = reln->smgr_rnode.node;
+		tag.forknum = forknum;
+		tag.segno = v->mdfd_segno;
+		v->mdfd_dirtied_cycle = FsyncAtCheckpoint(&tag,
+												  v->mdfd_vfd,
+												  v->mdfd_dirtied_cycle);
+	}
 }
 
 /*
@@ -1021,660 +949,38 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	mdsync() -- Sync previous writes to stable storage.
- */
-void
-mdsync(void)
-{
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
-	}
-
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * This is okay because we unlink each bitmapset from the hashtable
-		 * entry before scanning it.  That means that any incoming fsync
-		 * requests will be processed now if they reach the table before we
-		 * begin to scan their fork.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			while ((segno = bms_first_member(requests)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-						ereport(ERROR,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
-
-/*
- * mdpostckpt() -- Do post-checkpoint work
- *
- * Remove any lingering files that can now be safely removed.
+ * Return the filename for the specified segment of the relation. The
+ * returned string is palloc'd.
  */
 void
-mdpostckpt(void)
+mdpath(const SmgrFileTag *tag, char *out)
 {
-	int			absorb_counter;
-
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
+	char	   *path;
 
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
+	path = relpathperm(tag->node, tag->forknum);
 
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
+	if (tag->segno > 0)
+		snprintf(out, MAXPGPATH, "%s.%u", path, tag->segno);
+	else
+		snprintf(out, MAXPGPATH, "%s", path);
 
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
+	pfree(path);
 }
 
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
- *
- * If there is a local pending-ops table, just make an entry in it for
- * mdsync to process later.  Otherwise, try to pass off the fsync request
- * to the checkpointer process.  If that fails, just do the fsync
- * locally before returning (we hope this will not happen often enough
- * to be a performance problem).
  */
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
-	/* Temp relations should never be fsync'd */
-	Assert(!SmgrIsTemp(reln));
-
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
-		ereport(DEBUG1,
-				(errmsg("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
-	}
-}
-
-/*
- * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
- */
-static void
-register_unlink(RelFileNodeBackend rnode)
-{
-	/* Should never be used with temp relations */
-	Assert(!RelFileNodeBackendIsTemp(rnode));
-
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
-/*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
- */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
-{
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
-}
-
-/*
- * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
- */
-void
-ForgetDatabaseFsyncRequests(Oid dbid)
-{
-	RelFileNode rnode;
-
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
-
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	SmgrFileTag tag;
+
+	tag.node = reln->smgr_rnode.node;
+	tag.forknum = forknum;
+	tag.segno = seg->mdfd_segno;
+	seg->mdfd_dirtied_cycle = FsyncAtCheckpoint(&tag,
+												seg->mdfd_vfd,
+												seg->mdfd_dirtied_cycle);
 }
 
 /*
@@ -1804,6 +1110,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	v = &reln->md_seg_fds[forknum][segno];
 	v->mdfd_vfd = fd;
 	v->mdfd_segno = segno;
+	v->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 189342ef86a..c36ba4298b7 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
@@ -59,9 +60,7 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
+	void		(*smgr_path) (const SmgrFileTag *tag, char *out);
 } f_smgr;
 
 
@@ -82,9 +81,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
+		.smgr_path = mdpath
 	}
 };
 
@@ -104,6 +101,15 @@ static void smgrshutdown(int code, Datum arg);
 static void add_to_unowned_list(SMgrRelation reln);
 static void remove_from_unowned_list(SMgrRelation reln);
 
+/*
+ * For now there is only one implementation.  If more are added, we'll need to
+ * be able to dispatch based on a file tag.
+ */
+static inline int
+which_for_file_tag(const SmgrFileTag *tag)
+{
+	return 0;
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -118,6 +124,8 @@ smgrinit(void)
 {
 	int			i;
 
+	smgrsync_init();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
@@ -751,50 +759,13 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
 /*
- *	smgrsync() -- Sync files to disk during checkpoint.
+ * smgrpath() -- Expand a tag to a path.
  */
 void
-smgrsync(void)
+smgrpath(const SmgrFileTag *tag, char *out)
 {
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
+	smgrsw[which_for_file_tag(tag)].smgr_path(tag, out);
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgrsync.c b/src/backend/storage/smgr/smgrsync.c
new file mode 100644
index 00000000000..f4aad18054d
--- /dev/null
+++ b/src/backend/storage/smgr/smgrsync.c
@@ -0,0 +1,803 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.c
+ *	  management of file synchronization.
+ *
+ * This modules tracks which files need to be fsynced or unlinked at the
+ * next checkpoint, and performs those actions.  Normally the work is done
+ * when called by the checkpointer, but it is also done in standalone mode
+ * and startup.
+ *
+ * Originally this logic was inside md.c, but it is now made more general,
+ * for reuse by other SMGR implementations that work with files.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/smgrsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "pgstat.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "storage/relfilenode.h"
+#include "storage/smgrsync.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Special values for the segno member of SmgrFileTag.
+ *
+ * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
+ * fsync request from the queue if an identical, subsequent request is found.
+ * See comments there before making changes here.
+ */
+#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
+#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
+#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
+
+/* intervals for calling AbsorbFsyncRequests in smgrsync and smgrpostckpt */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * An entry in the hash table of files that need to be flushed for the next
+ * checkpoint.
+ */
+typedef struct PendingFsyncEntry
+{
+	SmgrFileTag	tag;
+	File		file;
+	uint64		cycle_ctr;
+} PendingFsyncEntry;
+
+typedef struct PendingUnlinkEntry
+{
+	RelFileNode rnode;			/* the dead relation to delete */
+	uint64		cycle_ctr;		/* ckpt_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static uint32 open_fsync_queue_files = 0;
+static bool sync_in_progress = false;
+static uint64 ckpt_cycle_ctr = 0;
+
+static HTAB *pendingFsyncTable = NULL;
+static List *pendingUnlinks = NIL;
+static MemoryContext pendingOpsCxt; /* context for the above  */
+
+static void syncpass(bool include_current);
+
+/*
+ * Initialize the pending operations state, if necessary.
+ */
+void
+smgrsync_init(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(SmgrFileTag);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingFsyncTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+}
+
+/*
+ * Do pre-checkpoint work.
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+smgrpreckpt(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	ckpt_cycle_ctr++;
+}
+
+/*
+ * Sync previous writes to stable storage.
+ */
+void
+smgrsync(void)
+{
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingFsyncTable.
+	 */
+	if (!pendingFsyncTable)
+		elog(ERROR, "cannot sync without a pendingFsyncTable");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbAllFsyncRequests();
+
+	syncpass(false);
+}
+
+/*
+ * Do one pass over the the fsync request hashtable and perform the necessary
+ * fsyncs. Increments the sync cycle counter.
+ *
+ * If include_current is true perform all fsyncs (this is done if too many
+ * files are open), otherwise only perform the fsyncs belonging to the cycle
+ * valid at call time.
+ */
+static void
+syncpass(bool include_current)
+{
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	int			processed = CheckpointStats.ckpt_sync_rels;
+	uint64		longest = CheckpointStats.ckpt_longest_sync;
+	uint64		total_elapsed = CheckpointStats.ckpt_agg_sync_time;
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use GetCheckpointSyncCycle() to tell old entries apart
+	 * from new ones: new ones will have cycle_ctr equal to
+	 * IncCheckpointSyncCycle().
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous smgrsync() failed to complete, run through the table and
+	 * forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+			entry->cycle_ctr = GetCheckpointSyncCycle();
+	}
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	IncCheckpointSyncCycle();
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingFsyncTable);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)))
+	{
+		/*
+		 * If processing fsync requests because of too may file handles, close
+		 * regardless of cycle. Otherwise nothing to be closed might be found,
+		 * and we want to make room as quickly as possible so more requests
+		 * can be absorbed.
+		 */
+		if (!include_current)
+		{
+			/* If the entry is new then don't process it this time. */
+			if (entry->cycle_ctr == GetCheckpointSyncCycle())
+				continue;
+
+			/* Else assert we haven't missed it */
+			Assert((entry->cycle_ctr + 1) == GetCheckpointSyncCycle());
+		}
+
+		/*
+		 * If fsync is off then we don't have to bother opening the file at
+		 * all.  (We delay checking until this point so that changing fsync on
+		 * the fly behaves sensibly.)
+		 *
+		 * XXX: Why is that an important goal? Doesn't give any interesting
+		 * guarantees afaict?
+		 */
+		if (enableFsync)
+		{
+			File		file;
+
+			/*
+			 * The fsync table could contain requests to fsync segments that
+			 * have been deleted (unlinked) by the time we get to them.  That
+			 * used to be problematic, but now we have a filehandle to the
+			 * deleted file. That means we might fsync an empty file
+			 * superfluously, in a relatively tight window, which is
+			 * acceptable.
+			 */
+			INSTR_TIME_SET_CURRENT(sync_start);
+
+			if (entry->file == -1)
+			{
+				/*
+				 * If we aren't transferring file descriptors directly to the
+				 * checkpointer on this platform, we'll have to convert the
+				 * tag to the path and open it (and close it again below).
+				 */
+				char		path[MAXPGPATH];
+
+				smgrpath(&entry->tag, path);
+				file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+				if (file < 0)
+					ereport(FATAL,
+							(errcode_for_file_access(),
+							 errmsg("could not open file \"%s\" to fsync: %m",
+									path)));
+			}
+			else
+			{
+				/*
+				 * Otherwise, we have kept the file descriptor from the oldest
+				 * request for the same tag.
+				 */
+				file = entry->file;
+			}
+
+			if (FileSync(file, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+				ereport(FATAL,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								FilePathName(file))));
+
+			/* Success; update statistics about sync timing */
+			INSTR_TIME_SET_CURRENT(sync_end);
+			sync_diff = sync_end;
+			INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+			elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+			if (elapsed > longest)
+				longest = elapsed;
+			total_elapsed += elapsed;
+			processed++;
+
+			if (log_checkpoints)
+				ereport(DEBUG1,
+						(errmsg("checkpoint sync: number=%d file=%s time=%.3f msec",
+								processed,
+								FilePathName(file),
+								(double) elapsed / 1000),
+						 errhidestmt(true),
+						 errhidecontext(true)));
+
+			if (entry->file == -1)
+				FileClose(file);
+		}
+
+		if (entry->file >= 0)
+		{
+			/*
+			 * Close file.  XXX: centralize code.
+			 */
+			Assert(open_fsync_queue_files > 0);
+			open_fsync_queue_files--;
+			FileClose(entry->file);
+			entry->file = -1;
+		}
+
+		/* Remove the entry. */
+		if (hash_search(pendingFsyncTable, &entry->tag, HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "pendingFsyncTable corrupted");
+
+		/*
+		 * If in checkpointer, we want to absorb pending requests every so
+		 * often to prevent overflow of the fsync request queue.  It is
+		 * unspecified whether newly-added entries will be visited by
+		 * hash_seq_search, but we don't care since we don't need to
+		 * process them anyway.
+		 */
+		if (absorb_counter-- <= 0)
+		{
+			/*
+			 * Don't absorb if too many files are open. This pass will
+			 * soon close some, so check again later.
+			 */
+			if (open_fsync_queue_files < ((max_safe_fds * 7) / 10))
+				AbsorbFsyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+	}							/* end loop over hashtable entries */
+
+	/* Flag successful completion of syncpass */
+	sync_in_progress = false;
+
+	/* Maintain sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+}
+
+/*
+ * Do post-checkpoint work.
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+smgrpostckpt(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == ckpt_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = relpathperm(entry->rnode, MAIN_FORKNUM);
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in smgrsync, we don't want to stop absorbing fsync requests for a
+		 * long time when there are many deletions to be done.  We can safely
+		 * call AbsorbFsyncRequests() at this point in the loop (note it might
+		 * try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			/* XXX: Centralize this condition */
+			if (open_fsync_queue_files < ((max_safe_fds * 7) / 10))
+				AbsorbFsyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+
+/*
+ * FsyncAtCheckpoint() -- Mark a relation segment as needing fsync
+ *
+ * If there is a local pending-ops table, just make an entry in it for
+ * smgrsync to process later.  Otherwise, try to pass off the fsync request
+ * to the checkpointer process.
+ */
+uint64
+FsyncAtCheckpoint(const SmgrFileTag *tag, File file, uint64 last_cycle)
+{
+	uint64		cycle;
+
+	pg_memory_barrier();
+	cycle = GetCheckpointSyncCycle();
+
+	/*
+	 * For historical reasons checkpointer keeps track of the number of time
+	 * backends perform writes themselves.
+	 */
+	if (!AmBackgroundWriterProcess())
+		CountBackendWrite();
+
+	/* Don't repeatedly register the same segment as dirty. */
+	if (last_cycle == cycle)
+		return cycle;
+
+	if (pendingFsyncTable)
+	{
+		int fd;
+
+		/*
+		 * Push it into local pending-ops table.
+		 *
+		 * Gotta duplicate the fd - we can't have fd.c close it behind our
+		 * back, as that'd lead to losing error reporting guarantees on
+		 * Linux.  RememberFsyncRequest() will manage the lifetime.
+		 */
+		ReleaseLruFiles();
+		fd = dup(FileGetRawDesc(file));
+		if (fd < 0)
+			elog(ERROR, "couldn't dup: %m");
+		RememberFsyncRequest(tag, fd, FileGetOpenSeq(file));
+	}
+	else
+		ForwardFsyncRequest(tag, file);
+
+	return cycle;
+}
+
+/*
+ * Schedule a file to be deleted after next checkpoint.
+ *
+ * As with FsyncAtCheckpoint, this could involve either a local or a remote
+ * pending-ops table.
+ */
+void
+UnlinkAfterCheckpoint(RelFileNodeBackend rnode)
+{
+	SmgrFileTag tag;
+
+	tag.node = rnode.node;
+	tag.forknum = MAIN_FORKNUM;
+	tag.segno = UNLINK_RELATION_REQUEST;
+
+	/* Should never be used with temp relations */
+	Assert(!RelFileNodeBackendIsTemp(rnode));
+
+	if (pendingFsyncTable)
+	{
+		/* push it into local pending-ops table */
+		RememberFsyncRequest(&tag, -1, 0);
+	}
+	else
+	{
+		/* Notify the checkpointer about it. */
+		Assert(IsUnderPostmaster);
+		ForwardFsyncRequest(&tag, -1);
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingFsyncTable during initialization of the startup
+ * process.  Calling this function drops the local pendingFsyncTable so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingFsyncTable)
+	{
+		smgrsync();
+		hash_destroy(pendingFsyncTable);
+	}
+	pendingFsyncTable = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
+
+
+/*
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * The range of possible segment numbers is way less than the range of
+ * BlockNumber, so we can reserve high values of segno for special purposes.
+ * We define three:
+ * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
+ *	 either for one fork, or all forks if forknum is InvalidForkNumber
+ * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
+ * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
+ *	 checkpoint.
+ * Note also that we're assuming real segment numbers don't exceed INT_MAX.
+ *
+ * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
+ * table has to be searched linearly, but dropping a database is a pretty
+ * heavyweight operation anyhow, so we'll live with it.)
+ */
+void
+RememberFsyncRequest(const SmgrFileTag *tag, int fd, uint64 open_seq)
+{
+	Assert(pendingFsyncTable);
+
+	if (tag->segno == FORGET_RELATION_FSYNC ||
+		tag->segno == FORGET_DATABASE_FSYNC)
+	{
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if ((tag->segno == FORGET_RELATION_FSYNC &&
+				 tag->node.dbNode == entry->tag.node.dbNode &&
+				 tag->node.relNode == entry->tag.node.relNode &&
+				 (tag->forknum == InvalidForkNumber ||
+				  tag->forknum == entry->tag.forknum)) ||
+				(tag->segno == FORGET_DATABASE_FSYNC &&
+				 tag->node.dbNode == entry->tag.node.dbNode))
+			{
+				if (entry->file != -1)
+				{
+					Assert(open_fsync_queue_files > 0);
+					open_fsync_queue_files--;
+					FileClose(entry->file);
+				}
+				hash_search(pendingFsyncTable, entry, HASH_REMOVE, NULL);
+			}
+		}
+
+		/* Remove unlink requests */
+		if (tag->segno == FORGET_DATABASE_FSYNC)
+		{
+			ListCell   *cell,
+					   *next,
+					   *prev;
+
+			prev = NULL;
+			for (cell = list_head(pendingUnlinks); cell; cell = next)
+			{
+				PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+				next = lnext(cell);
+				if (tag->node.dbNode == entry->rnode.dbNode)
+				{
+					pendingUnlinks = list_delete_cell(pendingUnlinks, cell,
+													  prev);
+					pfree(entry);
+				}
+				else
+					prev = cell;
+			}
+		}
+	}
+	else if (tag->segno == UNLINK_RELATION_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
+		Assert(tag->forknum == MAIN_FORKNUM);
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->rnode = tag->node;
+		entry->cycle_ctr = ckpt_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		entry = (PendingFsyncEntry *) hash_search(pendingFsyncTable,
+												  tag,
+												  HASH_ENTER,
+												  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			entry->file = -1;
+			entry->cycle_ctr = GetCheckpointSyncCycle();
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		if (fd >= 0)
+		{
+			File existing_file;
+			File new_file;
+
+			/*
+			 * If we didn't have a file already, or we did have a file but it
+			 * was opened later than this one, we'll keep the newly arrived
+			 * one.
+			 */
+			existing_file = entry->file;
+			if (existing_file == -1 ||
+				FileGetOpenSeq(existing_file) > open_seq)
+			{
+				char path[MAXPGPATH];
+
+				smgrpath(tag, path);
+
+				new_file = FileOpenForFd(fd, path, open_seq);
+				if (new_file < 0)
+					elog(ERROR, "cannot open file");
+				/* caller must have reserved entry */
+				entry->file = new_file;
+
+				if (existing_file != -1)
+					FileClose(existing_file);
+				else
+					open_fsync_queue_files++;
+			}
+			else
+			{
+				/*
+				 * File is already open. Have to keep the older fd, errors
+				 * might only be reported to it, thus close the one we just
+				 * got.
+				 *
+				 * XXX: check for errors.
+				 */
+				close(fd);
+			}
+
+			FlushFsyncRequestQueueIfNecessary();
+		}
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+}
+
+/*
+ * Flush the fsync request queue enough to make sure there's room for at least
+ * one more entry.
+ */
+bool
+FlushFsyncRequestQueueIfNecessary(void)
+{
+	if (sync_in_progress)
+		return false;
+
+	while (true)
+	{
+		if (open_fsync_queue_files >= ((max_safe_fds * 7) / 10))
+		{
+			elog(DEBUG1,
+				 "flush fsync request queue due to %u open files",
+				 open_fsync_queue_files);
+			syncpass(true);
+			elog(DEBUG1,
+				 "flushed fsync request, now at %u open files",
+				 open_fsync_queue_files);
+		}
+		else
+			break;
+	}
+
+	return true;
+}
+
+/*
+ * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+{
+	SmgrFileTag tag;
+
+	/* Create a special "forget relation" tag. */
+	tag.node = rnode;
+	tag.forknum = forknum;
+	tag.segno = FORGET_RELATION_FSYNC;
+
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(&tag, -1, 0);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		ForwardFsyncRequest(&tag, -1);
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
+ */
+void
+ForgetDatabaseFsyncRequests(Oid dbid)
+{
+	SmgrFileTag tag;
+
+	/* Create a special "forget database" tag. */
+	tag.node.dbNode = dbid;
+	tag.node.spcNode = 0;
+	tag.node.relNode = 0;
+	tag.forknum = InvalidForkNumber;
+	tag.segno = FORGET_DATABASE_FSYNC;
+
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(&tag, -1, 0);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* see notes in ForgetRelationFsyncRequests */
+		ForwardFsyncRequest(&tag, -1);
+	}
+}
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 970c94ee805..32bc91102d7 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -59,7 +59,7 @@
 #include "commands/view.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rewriteRemove.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0327b295da8..a20f951209a 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -60,6 +60,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d1..137c748dfaf 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -1,10 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * bgwriter.h
- *	  Exports from postmaster/bgwriter.c and postmaster/checkpointer.c.
- *
- * The bgwriter process used to handle checkpointing duties too.  Now
- * there is a separate process, but we did not bother to split this header.
+ *	  Exports from postmaster/bgwriter.c.
  *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  *
@@ -15,29 +12,10 @@
 #ifndef _BGWRITER_H
 #define _BGWRITER_H
 
-#include "storage/block.h"
-#include "storage/relfilenode.h"
-
-
 /* GUC options */
 extern int	BgWriterDelay;
-extern int	CheckPointTimeout;
-extern int	CheckPointWarning;
-extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void) pg_attribute_noreturn();
-extern void CheckpointerMain(void) pg_attribute_noreturn();
-
-extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
-
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
-
-extern Size CheckpointerShmemSize(void);
-extern void CheckpointerShmemInit(void);
 
-extern bool FirstCallSinceLastCheckpoint(void);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/postmaster/checkpointer.h b/src/include/postmaster/checkpointer.h
new file mode 100644
index 00000000000..252a94f2909
--- /dev/null
+++ b/src/include/postmaster/checkpointer.h
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * checkpointer.h
+ *	  Exports from postmaster/checkpointer.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/checkpointer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHECKPOINTER_H
+#define CHECKPOINTER_H
+
+#include "storage/smgr.h"
+#include "storage/smgrsync.h"
+
+/*
+ * Control whether we transfer file descriptors to the checkpointer, to
+ * preserve error state on certain kernels.  We don't yet have support for
+ * sending files on Windows (it's entirely possible but it's not clear whether
+ * it would actually be useful for anything on that platform).  The macro is
+ * here just so that it can be commented out to test the non-fd-passing code
+ * path on Unix systems.
+ */
+#ifndef WIN32
+#define CHECKPOINTER_TRANSFER_FILES
+#endif
+
+/* GUC options */
+extern int	CheckPointTimeout;
+extern int	CheckPointWarning;
+extern double CheckPointCompletionTarget;
+
+/* The type used for counting checkpoint cycles. */
+typedef uint32 CheckpointCycle;
+
+/*
+ * A tag identifying a file to be flushed by the checkpointer.  This is
+ * convertible to the file's path, but it's convenient to have a small fixed
+ * sized object to use as a hash table key.
+ */
+typedef struct DirtyFileTag
+{
+	RelFileNode node;
+	ForkNumber forknum;
+	int segno;
+} DirtyFileTag;
+
+extern void CheckpointerMain(void) pg_attribute_noreturn();
+extern CheckpointCycle register_dirty_file(const DirtyFileTag *tag,
+										   File file,
+										   CheckpointCycle last_cycle);
+
+extern void ForwardFsyncRequest(const SmgrFileTag *tag, File fd);
+extern void RequestCheckpoint(int flags);
+extern void CheckpointWriteDelay(int flags, double progress);
+
+extern void AbsorbFsyncRequests(void);
+extern void AbsorbAllFsyncRequests(void);
+
+extern Size CheckpointerShmemSize(void);
+extern void CheckpointerShmemInit(void);
+
+extern uint64 GetCheckpointSyncCycle(void);
+extern uint64 IncCheckpointSyncCycle(void);
+
+extern bool FirstCallSinceLastCheckpoint(void);
+extern void CountBackendWrite(void);
+
+#endif
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index a40d66e8906..8e3bc6edf90 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -44,6 +44,15 @@ extern int	postmaster_alive_fds[2];
 #define POSTMASTER_FD_OWN		1	/* kept open by postmaster only */
 #endif
 
+#define FSYNC_FD_SUBMIT			0
+#define FSYNC_FD_PROCESS		1
+
+#ifndef WIN32
+extern int	fsync_fds[2];
+#else
+extern HANDLE fsyncPipe[2];
+#endif
+
 extern PGDLLIMPORT const char *progname;
 
 extern void PostmasterMain(int argc, char *argv[]) pg_attribute_noreturn();
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 1289589a46b..982f380512a 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -61,6 +61,7 @@ extern int	max_safe_fds;
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
+extern File FileOpenForFd(int fd, const char *fileName, uint64 open_seq);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
@@ -74,6 +75,8 @@ extern char *FilePathName(File file);
 extern int	FileGetRawDesc(File file);
 extern int	FileGetRawFlags(File file);
 extern mode_t FileGetRawMode(File file);
+extern uint64 FileGetOpenSeq(File file);
+extern void FileSetOpenSeq(File file, uint64 seq);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
@@ -112,6 +115,7 @@ extern int	MakePGDirectory(const char *directoryName);
 
 /* Miscellaneous support routines */
 extern void InitFileAccess(void);
+extern void FileShmemInit(void);
 extern void set_max_safe_fds(void);
 extern void closeAllVfds(void);
 extern void SetTempTablespaces(Oid *tableSpaces, int numSpaces);
@@ -123,6 +127,7 @@ extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
 				  SubTransactionId parentSubid);
 extern void RemovePgTempFiles(void);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void ReleaseLruFiles(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
@@ -139,4 +144,10 @@ extern void SyncDataDirectory(void);
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
 #define PG_TEMP_FILE_PREFIX "pgsql_tmp"
 
+#ifndef WIN32
+/* XXX; This should probably go elsewhere */
+ssize_t pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd);
+ssize_t pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd);
+#endif
+
 #endif							/* FD_H */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index fd8735b7f5f..a74eedfe4e9 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -128,6 +128,7 @@ typedef struct Latch
 #define WL_POSTMASTER_DEATH  (1 << 4)
 #ifdef WIN32
 #define WL_SOCKET_CONNECTED  (1 << 5)
+#define WL_WIN32_HANDLE		 (1 << 6)
 #else
 /* avoid having to deal with case on platforms not requiring it */
 #define WL_SOCKET_CONNECTED  WL_SOCKET_WRITEABLE
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index c843bbc9692..dc22efbe0a8 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -77,6 +77,18 @@ typedef struct SMgrRelationData
 
 typedef SMgrRelationData *SMgrRelation;
 
+/*
+ * A tag identifying a file to be flushed at the next checkpoint.  This is
+ * convertible to the file's path, but it's convenient to have a small fixed
+ * sized object to use as a hash table key.
+ */
+typedef struct SmgrFileTag
+{
+	RelFileNode node;
+	ForkNumber forknum;
+	int segno;
+} SmgrFileTag;
+
 #define SmgrIsTemp(smgr) \
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
@@ -106,9 +118,7 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
+extern void smgrpath(const SmgrFileTag *tag, char *out);
 extern void AtEOXact_SMgr(void);
 
 
@@ -134,13 +144,9 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
+extern void mdpath(const SmgrFileTag *tag, char *out);
 
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
+extern bool FlushFsyncRequestQueueIfNecessary(void);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgrsync.h b/src/include/storage/smgrsync.h
new file mode 100644
index 00000000000..f32bb22a7cc
--- /dev/null
+++ b/src/include/storage/smgrsync.h
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.h
+ *	  management of file synchronization
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/smgrpending.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SMGRSYNC_H
+#define SMGRSYNC_H
+
+#include "postgres.h"
+
+#include "storage/fd.h"
+
+
+extern void smgrsync_init(void);
+extern void smgrpreckpt(void);
+extern void smgrsync(void);
+extern void smgrpostckpt(void);
+
+extern void UnlinkAfterCheckpoint(RelFileNodeBackend rnode);
+extern uint64 FsyncAtCheckpoint(const SmgrFileTag *tag,
+								File file,
+								uint64 last_cycle);
+extern void RememberFsyncRequest(const SmgrFileTag *tag,
+								 int fd,
+								 uint64 open_seq);
+extern void SetForwardFsyncRequests(void);
+
+
+#endif
-- 
2.19.1

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Thomas Munro (#2)

Re: Refactoring the checkpointer's fsync request queue

On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

There is one major problem with this patch

If there's only one, you're doing great! Although admittedly this
seems like a big one...

1. Go back to the current pressure-valve strategy: make the sending
side perform the fsync(), if it can't immediately write to the pipe.

As you say, this will happen significantly more often with
deduplication. That deduplication logic got added in response to a
real need. Before that, you could cause an individual backend to
start doing its own fsyncs() with something as simple as a bulk load.
The queue would absorb most of them, but not all, and the performance
ramifications where noticeable.

2. Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe. Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

3. Multiplex the checkpointer's work: Use LWLockConditionalAcquire()
when locking buffers, and if that fails, try to drain the pipe, and
then fall back to a LWLockTimedAcquire(), drain pipe, repeat loop. I
can hear you groan already; that doesn't seem particularly elegant,
and there are portability problems implementing LWLockTimedAcquire():
semtimedop() and sem_timedwait() are not available on all platforms
(eg macOS). Maybe pthread_timed_condwait() could help (!).

You don't really need to invent LWLockTimedAcquire(). You could just
keep retrying LWLockConditionalAcquire() in a delay loop. I agree
that doesn't seem particularly elegant, though.

I still feel like this whole pass-the-fds-to-the-checkpointer thing is
a bit of a fool's errand, though. I mean, there's no guarantee that
the first FD that gets passed to the checkpointer is the first one
opened, or even the first one written, is there? It seems like if you
wanted to make this work reliably, you'd need to do it the other way
around: have the checkpointer (or some other background process) open
all the FDs, and anybody else who wants to have one open get it from
the checkpointer.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Andres Freund

andres@anarazel.de

about 7 years ago

In reply to: Thomas Munro (#2)

Re: Refactoring the checkpointer's fsync request queue

Hi,

On 2018-11-12 15:58:41 +1300, Thomas Munro wrote:

There is one major problem with this patch: BufferSync(), run in the
checkpointer, can deadlock against a backend that holds a buffer lock
and is blocked in SendFsyncRequest(). To break this deadlock, we need
way out of it on either the sending or receiving side. Here are three
ideas:

That's the deadlock I'd mentioned in Pune (?) btw.

1. Go back to the current pressure-valve strategy: make the sending
side perform the fsync(), if it can't immediately write to the pipe.

I don't think that's correct / safe? I've previously wondered whether
there's any way we could delay the write to a point where the buffer is
not locked anymore - as far as I can tell it's actually not required for
correctness that we send the fsync request before unlocking. It's
architecturally a bit dicey tho :(

Greetings,

Andres Freund

andres@anarazel.de

about 7 years ago

In reply to: Robert Haas (#3)

Re: Refactoring the checkpointer's fsync request queue

Hi,

On 2018-11-13 12:04:23 -0500, Robert Haas wrote:

I still feel like this whole pass-the-fds-to-the-checkpointer thing is
a bit of a fool's errand, though. I mean, there's no guarantee that
the first FD that gets passed to the checkpointer is the first one
opened, or even the first one written, is there?

I'm not sure I understand the danger you're seeing here. It doesn't have
to be the first fd opened, it has to be an fd that's older than all the
writes that we need to ensure made it to disk. And that ought to be
guaranteed by the logic? Between the FileWrite() and the
register_dirty_segment() (and other relevant paths) the FD cannot be
closed.

It seems like if you wanted to make this work reliably, you'd need to
do it the other way around: have the checkpointer (or some other
background process) open all the FDs, and anybody else who wants to
have one open get it from the checkpointer.

That'd require a process context switch for each FD opened, which seems
clearly like a no-go?

Greetings,

Andres Freund

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Andres Freund (#5)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Nov 13, 2018 at 1:07 PM Andres Freund <andres@anarazel.de> wrote:

On 2018-11-13 12:04:23 -0500, Robert Haas wrote:

I still feel like this whole pass-the-fds-to-the-checkpointer thing is
a bit of a fool's errand, though. I mean, there's no guarantee that
the first FD that gets passed to the checkpointer is the first one
opened, or even the first one written, is there?

I'm not sure I understand the danger you're seeing here. It doesn't have
to be the first fd opened, it has to be an fd that's older than all the
writes that we need to ensure made it to disk. And that ought to be
guaranteed by the logic? Between the FileWrite() and the
register_dirty_segment() (and other relevant paths) the FD cannot be
closed.

Suppose backend A and backend B open a segment around the same time.
Is it possible that backend A does a write before backend B, but
backend B's copy of the fd reaches the checkpointer before backend A's
copy? If you send the FD to the checkpointer before writing anything
then I think it's fine, but if you write first and then send the FD to
the checkpointer I don't see what guarantees the ordering.

It seems like if you wanted to make this work reliably, you'd need to
do it the other way around: have the checkpointer (or some other
background process) open all the FDs, and anybody else who wants to
have one open get it from the checkpointer.

That'd require a process context switch for each FD opened, which seems
clearly like a no-go?

I don't know how bad that would be. But hey, no cost is too great to
pay as a workaround for insane kernel semantics, right?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Thomas Munro

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: Robert Haas (#6)

Re: Refactoring the checkpointer's fsync request queue

(Replies to a couple of different messages below)

On Wed, Nov 14, 2018 at 6:04 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Sun, Nov 11, 2018 at 9:59 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

There is one major problem with this patch

If there's only one, you're doing great! Although admittedly this
seems like a big one...

Make that two.

2. Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe. Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

But is there an argument against it? The checkpointer would still be
creating checkpoints including running fsync, but the background
writer would be, erm, writing, erm, in the background.

Admittedly it adds a whole extra rabbit hole to this rabbit hole,
which was itself a diversion from my goal of refactoring the syncing
machinery to support undo logs. But the other two ideas seem to suck
and/or have correctness issues.

On Wed, Nov 14, 2018 at 7:43 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 13, 2018 at 1:07 PM Andres Freund <andres@anarazel.de> wrote:

On 2018-11-13 12:04:23 -0500, Robert Haas wrote:

I still feel like this whole pass-the-fds-to-the-checkpointer thing is
a bit of a fool's errand, though. I mean, there's no guarantee that
the first FD that gets passed to the checkpointer is the first one
opened, or even the first one written, is there?

I'm not sure I understand the danger you're seeing here. It doesn't have
to be the first fd opened, it has to be an fd that's older than all the
writes that we need to ensure made it to disk. And that ought to be
guaranteed by the logic? Between the FileWrite() and the
register_dirty_segment() (and other relevant paths) the FD cannot be
closed.

Suppose backend A and backend B open a segment around the same time.
Is it possible that backend A does a write before backend B, but
backend B's copy of the fd reaches the checkpointer before backend A's
copy? If you send the FD to the checkpointer before writing anything
then I think it's fine, but if you write first and then send the FD to
the checkpointer I don't see what guarantees the ordering.

I'm not sure if it matters whether we send the fd before or after the
write, but we still need some kind of global ordering of fds that can
order a given fd with respect to writes in other processes, so the
patch introduces a global shared counter captured immediately after
open() (including when reopened in the vfd machinery).

In your example, both fds arrive in the checkpointer in some order,
and it will keep the one with the older sequence number and close the
other one. This sorting of all interesting fds will be forced before
the checkpoint completes by AbsorbFsyncRequests(), which drains all
messages from the pipe until it sees a message for the next checkpoint
cycle.

Hmm, I think there is a flaw in the plan here though. Messages for
different checkpoint cycles race to enter the pipe around the time the
cycle counter is bumped, so you could have a message for n hiding
behind a message for n + 1 and not drain enough; I'm not sure and need
to look at something else today, but I see a couple of potential
solutions to that which I will mull over, based on either a new shared
counter increment or a special pipe message written after BufferSync()
by the bgwriter (if we go for idea #2; Andres had something similar in
the original prototype but it could self-deadlock). I need to figure
out if that is a suitable barrier due to buffer interlocking.

It seems like if you wanted to make this work reliably, you'd need to
do it the other way around: have the checkpointer (or some other
background process) open all the FDs, and anybody else who wants to
have one open get it from the checkpointer.

That'd require a process context switch for each FD opened, which seems
clearly like a no-go?

I don't know how bad that would be. But hey, no cost is too great to
pay as a workaround for insane kernel semantics, right?

Yeah, seems extremely expensive and unnecessary. It seems sufficient
to track the global opening order... or at least a proxy that
identifies the fd that performed the oldest write. Which I believe
this patch is doing.

--
Thomas Munro
http://www.enterprisedb.com

Dmitry Dolgov

9erthalion6@gmail.com

about 7 years ago

In reply to: Thomas Munro (#7)

Re: Refactoring the checkpointer's fsync request queue

On Wed, 14 Nov 2018 at 00:44, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Here is a rebased version of the patch, post pread()/pwrite(). I have
also rewritten the commit message to try to explain the rationale
concisely, instead of requiring the reader to consult multiple
discussions that jump between lengthy email threads to understand the
key points.

Thank you for working on this patch!

There is one major problem with this patch: BufferSync(), run in the
checkpointer, can deadlock against a backend that holds a buffer lock
and is blocked in SendFsyncRequest(). To break this deadlock, we need
way out of it on either the sending or receiving side.

Or introduce a third side, but I'm not sure how appropriate it here.

2. Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe. Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

I also think it sounds better than other options (although probably it's
partially because these options were formulated while already having some bias
towards one of the solution).

2. Offload the BufferSync() work to bgwriter, so the checkpointer can
keep draining the pipe. Communication between checkpointer and
bgwriter can be fairly easily multiplexed with the pipe draining work.

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

I had the same first thought, but then after reading the corresponding mailing
thread I've got an impression that the purpose of this change was rather
technical (to split work between different processed because of performance
reasons) and not exactly relevant to the division of labor - am I wrong here?

While testing this patch with frequent checkpoints I've stumbled upon an
interesting error, that happened already after I finished one test:

TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574)
2018-11-13 22:06:29.773 CET [7886] LOG: checkpointer process (PID
7934) was terminated by signal 6: Aborted
2018-11-13 22:06:29.773 CET [7886] LOG: terminating any other active
server processes
2018-11-13 22:06:29.773 CET [7937] WARNING: terminating connection
because of crash of another server process
2018-11-13 22:06:29.773 CET [7937] DETAIL: The postmaster has
commanded this server process to roll back the current transaction and
exit, because another server process exited abnormally and possibly
corrupted shared memory.
2018-11-13 22:06:29.773 CET [7937] HINT: In a moment you should be
able to reconnect to the database and repeat your command.
2018-11-13 22:06:29.778 CET [7886] LOG: all server processes
terminated; reinitializing

I assume it should't be like that? I haven't investigated deeply yet, but
backtrace looks like:

bt

#0 0x00007f7ee7a3af00 in raise () from /lib64/libc.so.6
#1 0x00007f7ee7a3ca57 in abort () from /lib64/libc.so.6
#2 0x0000560e89d1858e in ExceptionalCondition
(conditionName=conditionName@entry=0x560e89eca333 "!(rc > 0)",
errorType=errorType@entry=0x560e89d6cec8 "FailedAssertion",
fileName=fileName@entry=0x560e89eca2c9 "checkpointer.c",
lineNumber=lineNumber@entry=574) at assert.c:54
#3 0x0000560e89b5e3ff in CheckpointerMain () at checkpointer.c:574
#4 0x0000560e8995ef9e in AuxiliaryProcessMain (argc=argc@entry=2,
argv=argv@entry=0x7ffe05c32f60) at bootstrap.c:460
#5 0x0000560e89b69c55 in StartChildProcess
(type=type@entry=CheckpointerProcess) at postmaster.c:5369
#6 0x0000560e89b6af15 in reaper (postgres_signal_arg=<optimized out>)
at postmaster.c:2916
#7 <signal handler called>
#8 0x00007f7ee7afe00b in select () from /lib64/libc.so.6
#9 0x0000560e89b6bd20 in ServerLoop () at postmaster.c:1679
#10 0x0000560e89b6d1bc in PostmasterMain (argc=3, argv=<optimized
out>) at postmaster.c:1388
#11 0x0000560e89acadc6 in main (argc=3, argv=0x560e8ad42c00) at main.c:228

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Thomas Munro (#7)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

But is there an argument against it? The checkpointer would still be
creating checkpoints including running fsync, but the background
writer would be, erm, writing, erm, in the background.

I don't know. I guess the fact that the checkpointer is still
performing the fsyncs is probably a key point. I mean, in the old
division of labor, fsyncs could interrupt the background writing that
was supposed to be happening.

I'm not sure if it matters whether we send the fd before or after the
write, but we still need some kind of global ordering of fds that can
order a given fd with respect to writes in other processes, so the
patch introduces a global shared counter captured immediately after
open() (including when reopened in the vfd machinery).

But how do you make reading that counter atomic with the open() itself?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#10

Andres Freund

andres@anarazel.de

about 7 years ago

In reply to: Robert Haas (#9)

Re: Refactoring the checkpointer's fsync request queue

On 2018-11-14 16:36:49 -0500, Robert Haas wrote:

On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I'm not sure if it matters whether we send the fd before or after the
write, but we still need some kind of global ordering of fds that can
order a given fd with respect to writes in other processes, so the
patch introduces a global shared counter captured immediately after
open() (including when reopened in the vfd machinery).

But how do you make reading that counter atomic with the open() itself?

I don't see why it has to be. As long as the "fd generation" assignment
happens before fsync (and writes secondarily), there ought not to be any
further need for synchronizity?

Greetings,

Andres Freund

#11

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Andres Freund (#10)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Nov 14, 2018 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:

On 2018-11-14 16:36:49 -0500, Robert Haas wrote:

But how do you make reading that counter atomic with the open() itself?

I don't see why it has to be. As long as the "fd generation" assignment
happens before fsync (and writes secondarily), there ought not to be any
further need for synchronizity?

If the goal is to have the FD that is opened first end up in the
checkpointer's table, grabbing a counter backwards does not achieve
it, because there's a race.

S1: open FD
S2: open FD
S2: local_counter = shared_counter++
S1: local_counter = shared_counter++

Now S1 was opened first but has a higher shared counter value than S2
which was opened later. Does that matter? Beats me! I just work
here...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#12

Thomas Munro

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: Dmitry Dolgov (#8)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Nov 15, 2018 at 5:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

While testing this patch with frequent checkpoints I've stumbled upon an
interesting error, that happened already after I finished one test:

TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574)

Thanks for testing! Yeah, that's:

+       rc = WaitEventSetWait(wes, cur_timeout * 1000, &event, 1, 0);
+       Assert(rc > 0);

I got confused about the API. If there is a timeout, you get rc == 0,
but I think I was expecting rc == 1, event.event == WL_TIMEOUT. Oops.
I will fix that when I post a new experimental version that uses the
bgworker as discussed, and we can try to figure out if that design
will fly.

--
Thomas Munro
http://www.enterprisedb.com

#13

Thomas Munro

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: Robert Haas (#11)

Re: Refactoring the checkpointer's fsync request queue

On Sat, Nov 17, 2018 at 4:05 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Nov 14, 2018 at 4:49 PM Andres Freund <andres@anarazel.de> wrote:

On 2018-11-14 16:36:49 -0500, Robert Haas wrote:

But how do you make reading that counter atomic with the open() itself?

I don't see why it has to be. As long as the "fd generation" assignment
happens before fsync (and writes secondarily), there ought not to be any
further need for synchronizity?

If the goal is to have the FD that is opened first end up in the
checkpointer's table, grabbing a counter backwards does not achieve
it, because there's a race.

S1: open FD
S2: open FD
S2: local_counter = shared_counter++
S1: local_counter = shared_counter++

Now S1 was opened first but has a higher shared counter value than S2
which was opened later. Does that matter? Beats me! I just work
here...

It's not important for the sequence numbers to match the opening order
exactly (that'd work too but be expensive to orchestrate). It's
important for the sequence numbers to be assigned before each backend
does its first pwrite(). That gives us the following interleavings to
worry about:

S1: local_counter = shared_counter++
S2: local_counter = shared_counter++
S1: pwrite()
S2: pwrite()

S1: local_counter = shared_counter++
S2: local_counter = shared_counter++
S2: pwrite()
S1: pwrite()

S1: local_counter = shared_counter++
S1: pwrite()
S2: local_counter = shared_counter++
S2: pwrite()

... plus the same interleavings with S1 and S2 labels swapped. In all
6 orderings, the fd that has the lowest sequence number can see errors
relating to write-back of kernel buffers dirtied by both pwrite()
calls.

Or to put it another way, you can't be given a lower sequence number
than another process that has already written, because that other
process must have been given a sequence number before it wrote.

--
Thomas Munro
http://www.enterprisedb.com

#14

Robert Haas

robertmhaas@gmail.com

about 7 years ago

In reply to: Thomas Munro (#13)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Nov 16, 2018 at 5:38 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Or to put it another way, you can't be given a lower sequence number
than another process that has already written, because that other
process must have been given a sequence number before it wrote.

OK, that makes sense.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#15

Thomas Munro

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: Robert Haas (#14)

2 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Sat, Nov 17, 2018 at 10:53 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Thu, Nov 15, 2018 at 5:09 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

While testing this patch with frequent checkpoints I've stumbled upon an
interesting error, that happened already after I finished one test:

TRAP: FailedAssertion("!(rc > 0)", File: "checkpointer.c", Line: 574)

Fixed in the 0001 patch (and a similar problem in the WIN32 branch).

On Thu, Nov 15, 2018 at 10:37 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Nov 13, 2018 at 6:44 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

That sounds a little like you are proposing to go back to the way
things were before 806a2aee3791244bf0f916729bfdb5489936e068 (and,
belatedly, bf405ba8e460051e715d0a91442b579e590328ce) although I guess
the division of labor wouldn't be quite the same.

But is there an argument against it? The checkpointer would still be
creating checkpoints including running fsync, but the background
writer would be, erm, writing, erm, in the background.

I don't know. I guess the fact that the checkpointer is still
performing the fsyncs is probably a key point. I mean, in the old
division of labor, fsyncs could interrupt the background writing that
was supposed to be happening.

Robert explained off-list that BgBufferSync() and BufferSync() have
rather different goals, and performing them in the same loop without
major reengineering to merge their logic would probably not work out
well. So I'm abandoning that plan for now (though it could perhaps be
interesting if done right).

I do have a new plan though...

On Wed, Nov 14, 2018 at 7:01 AM Andres Freund <andres@anarazel.de> wrote:

... I've previously wondered whether
there's any way we could delay the write to a point where the buffer is
not locked anymore - as far as I can tell it's actually not required for
correctness that we send the fsync request before unlocking. It's
architecturally a bit dicey tho :(

... and it's basically what Andres said there ^^^.

The specific hazard I wondered about is when a checkpoint begins after
BufferAlloc() calls pwrite() but before it calls sendto(), so that we
fail to fsync() a file that was modified before the checkpoint LSN.
But, AFAICS, assuming we call sendto() before we update the buffer
header, there are only two possibilities from the point of view of
BufferAlloc():

1. The checkpointer's BufferSync() loop arrives before we update the
buffer header, so it sees the buffer as dirty, writes it out (again),
remembers that the segment is dirty, and then when we eventually get
the buffer header lock we see that it's not dirty anymore and we just
skip the buffer.

2. The checkpointer's BufferSync() loop arrives after we updated the
buffer header, so it sees it as invalid (or some later state), which
means that we have already called sendto() (before we updated the
header).

Either way, the checkpointer finishes up calling fsync() before the
checkpoint completes, as it should, and the worst that can happen due
to bad timing is a harmless double pwrite().

I noticed a subtle problem though. Suppose we have case 2 above.
After BufferSync() returns in the checkpointer, our backend has called
sendto() to register a dirty file. In v2 the checkpointer runs
AbsorbAllFsyncRequests() to drain the pipe until it sees a message for
the current cycle (that is, it absorbs messages for the previous
cycle). That's obviously not good enough, since backends race to call
sendto() and a message for cycle n - 1 might be hiding behind a
message for cycle n. So I propose to drain the pipe until it is empty
or we see a message for cycle n + 1 (where n is the current cycle
before we start draining, meaning that we ran out of fds and forced a
new cycle in FlushFsyncRequestQueueIfNecessary()). I think that
works, because although we can't be sure that we'll receive all
messages for n - 1 before we receive a message for n due to races on
the insert side, we *can* be certain that we'll receive all messages
for n - 1 before we receive a message for n + 1, because we know that
they were already in the pipe before we began. In the happy case, our
system never runs out of fds so the pipe will eventually be empty,
since backends send at most one message per cycle per file and the
number of files is finite, and in the sad case, there are too many
dirty files per cycle, so we keep creating new cycles while absorbing,
but again the loop is bounded because we know that seeing n + 1 is
enough for our purpose (which is to fsync all files that were already
mentioned in messages send to the pipe before we started our loop).

That's implemented in the 0002 patch, separated for ease of review.
The above theories cover BufferAlloc()'s interaction with a
checkpoint, and that seems to be the main story. I'm not entirely
sure about the other callers of FlushBuffer() or FlushOneBuffer() (eg
special case init fork stuff), but I've run out of brain power for now
and wanted to post an update.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Refactor-the-checkpointer-s-data-sync-request-que-v3.patchapplication/octet-stream; name=0001-Refactor-the-checkpointer-s-data-sync-request-que-v3.patchDownload

From 317d0d8deac56c65a816f5d74ff3181006b87614 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Mon, 15 Oct 2018 22:48:05 +1300
Subject: [PATCH 1/2] Refactor the checkpointer's data sync request queue.

1.  Decouple the the checkpoint queue machinery from md.c, so that
future SMGR implementations can also use it to have arbitrary
files flushed to disk as part of the next checkpoint.

2.  Keep file descriptors open to avoid losing errors on some OSes.

Although commit 9ccdd7f6 made sure that we don't retry after a failed
fsync(), on operating systems that eject dirty buffers on I/O errors,
there is still a small chance that errors could be forgotten while
a file is closed.  Under memory pressure, the inode holding the error
state could be evicted.

Change to a model where file descriptors are sent to the
checkpointer via the ancillary data mechanism of Unix domain sockets.
One file descriptor for each given file is held open, to prevent
error state amnesia.

To defend against an even less likely hazard on Linux, hold onto the
file descriptor that performed the oldest write.  Assign a
monotonically increasing sequence number to all file descriptors
after they are opened and before they have have been used to write.
This way, an external process such as a backup script can't consume
an error that we need to see.  This logic works for recent Linux
kernels with errseq_t-based error tracking and a "seen" flag.

Other operating systems with a simple error flag that is cleared by
the first observer combined with a policy of dropping dirty buffers
on write-back failure probably have the same problem with external
processes, but there doesn't seem to be anything we can do about
that.

On Windows, a pipe is the most natural replacement for a Unix domin
socket, but unfortunately pipes don't support multiplexing via
WSAEventSelect(), as used by our WaitEventSet machninery.  So use
asynchronous I/O, and add the ability to wait for I/O completion to
WaitEventSet.  A new wait event flag WL_WIN32_HANDLE is provided
on Windows only, and used to wait for asynchronous read and write
operations over the checkpointer pipe.  For now file descriptors are
not transferred via the pipe on Windows (but could be in a future
patch; we don't currently have any reason to think that a similar
hazard does or does not exist on Windows, and if so, that this
technique would fix it, though it probably wouldn't hurt).

The fd-passing concept was originally proposed and prototyped by
Andres.  Here it is extended, made portable and combined with the
refactoring in point 1 since both things needed to rewrite the same
code.

Author: Andres Freund and Thomas Munro
Reviewed-by: Thomas Munro, Dmitry Dolgov
Discussion: https://postgr.es/m/CAEepm%3D2gTANm%3De3ARnJT%3Dn0h8hf88wqmaZxk0JYkxw%2Bb21fNrw%40mail.gmail.com
Discussion: https://postgr.es/m/20180427222842.in2e4mibx45zdth5%40alap3.anarazel.de
Discussion: https://postgr.es/m/CAMsr+YHh+5Oq4xziwwoEfhoTZgr07vdGG+hu=1adXx59aTeaoQ@mail.gmail.com
---
 src/backend/access/transam/xlog.c         |   9 +-
 src/backend/bootstrap/bootstrap.c         |   1 +
 src/backend/commands/dbcommands.c         |   2 +-
 src/backend/commands/tablespace.c         |   2 +-
 src/backend/postmaster/bgwriter.c         |   1 +
 src/backend/postmaster/checkpointer.c     | 542 +++++++++------
 src/backend/postmaster/postmaster.c       | 123 +++-
 src/backend/storage/buffer/bufmgr.c       |   2 +
 src/backend/storage/file/fd.c             | 217 +++++-
 src/backend/storage/freespace/freespace.c |   5 +-
 src/backend/storage/ipc/ipci.c            |   2 +
 src/backend/storage/ipc/latch.c           |  12 +
 src/backend/storage/smgr/Makefile         |   2 +-
 src/backend/storage/smgr/md.c             | 804 ++--------------------
 src/backend/storage/smgr/smgr.c           |  63 +-
 src/backend/storage/smgr/smgrsync.c       | 803 +++++++++++++++++++++
 src/backend/tcop/utility.c                |   2 +-
 src/backend/utils/misc/guc.c              |   1 +
 src/include/postmaster/bgwriter.h         |  24 +-
 src/include/postmaster/checkpointer.h     |  71 ++
 src/include/postmaster/postmaster.h       |   9 +
 src/include/storage/fd.h                  |  11 +
 src/include/storage/latch.h               |   1 +
 src/include/storage/smgr.h                |  24 +-
 src/include/storage/smgrsync.h            |  37 +
 25 files changed, 1709 insertions(+), 1061 deletions(-)
 create mode 100644 src/backend/storage/smgr/smgrsync.c
 create mode 100644 src/include/postmaster/checkpointer.h
 create mode 100644 src/include/storage/smgrsync.h

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 80616c5f1e7..4b805e7f66c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/walwriter.h"
 #include "postmaster/startup.h"
 #include "replication/basebackup.h"
@@ -64,6 +65,7 @@
 #include "storage/procarray.h"
 #include "storage/reinit.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/backend_random.h"
 #include "utils/builtins.h"
@@ -8757,8 +8759,10 @@ CreateCheckPoint(int flags)
 	 * Note: because it is possible for log_checkpoints to change while a
 	 * checkpoint proceeds, we always accumulate stats, even if
 	 * log_checkpoints is currently off.
+	 *
+	 * Note #2: this is reset at the end of the checkpoint, not here, because
+	 * we might have to fsync before getting here (see smgrsync()).
 	 */
-	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 	CheckpointStats.ckpt_start_t = GetCurrentTimestamp();
 
 	/*
@@ -9121,6 +9125,9 @@ CreateCheckPoint(int flags)
 									 CheckpointStats.ckpt_segs_recycled);
 
 	LWLockRelease(CheckpointLock);
+
+	/* reset stats */
+	MemSet(&CheckpointStats, 0, sizeof(CheckpointStats));
 }
 
 /*
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 7caab64ce78..7863bd7783d 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -31,6 +31,7 @@
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f640f469729..b59414b3350 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -47,7 +47,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "replication/slot.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 4a714f6e2be..aa76b8d25ec 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -70,7 +70,7 @@
 #include "commands/tablespace.h"
 #include "common/file_perm.h"
 #include "miscadmin.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/standby.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 54a042843da..6110bc98aed 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -44,6 +44,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 9eac86b554b..892654dc053 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -46,7 +46,10 @@
 #include "libpq/pqsignal.h"
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "postmaster/postmaster.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -101,19 +104,21 @@
  *
  * The requests array holds fsync requests sent by backends and not yet
  * absorbed by the checkpointer.
- *
- * Unlike the checkpoint fields, num_backend_writes, num_backend_fsync, and
- * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
 typedef struct
 {
-	RelFileNode rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
+	uint32		type;
+	SmgrFileTag	tag;
+	bool		contains_fd;
+	int			ckpt_started;
+	uint64		open_seq;
 	/* might add a real request-type field later; not needed yet */
 } CheckpointerRequest;
 
+#define CKPT_REQUEST_RNODE			1
+#define CKPT_REQUEST_SYN			2
+
 typedef struct
 {
 	pid_t		checkpointer_pid;	/* PID (0 if not started) */
@@ -126,12 +131,9 @@ typedef struct
 
 	int			ckpt_flags;		/* checkpoint flags, as defined in xlog.h */
 
-	uint32		num_backend_writes; /* counts user backend buffer writes */
-	uint32		num_backend_fsync;	/* counts user backend fsync calls */
-
-	int			num_requests;	/* current # of requests */
-	int			max_requests;	/* allocated array size */
-	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
+	pg_atomic_uint32 num_backend_writes; /* counts user backend buffer writes */
+	pg_atomic_uint32 num_backend_fsync;	/* counts user backend fsync calls */
+	pg_atomic_uint64 ckpt_cycle; /* cycle */
 } CheckpointerShmemStruct;
 
 static CheckpointerShmemStruct *CheckpointerShmem;
@@ -171,8 +173,9 @@ static pg_time_t last_xlog_switch_time;
 static void CheckArchiveTimeout(void);
 static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
-static bool CompactCheckpointerRequestQueue(void);
 static void UpdateSharedMemoryConfig(void);
+static void SendFsyncRequest(CheckpointerRequest *request, int fd);
+static bool AbsorbFsyncRequest(bool stop_at_current_cycle);
 
 /* Signal handlers */
 
@@ -182,6 +185,11 @@ static void ReqCheckpointHandler(SIGNAL_ARGS);
 static void chkpt_sigusr1_handler(SIGNAL_ARGS);
 static void ReqShutdownHandler(SIGNAL_ARGS);
 
+#ifdef WIN32
+/* State used to track in-progress asynchronous fsync pipe reads. */
+static OVERLAPPED absorb_overlapped;
+static HANDLE *absorb_read_in_progress;
+#endif
 
 /*
  * Main entry point for checkpointer process
@@ -194,6 +202,7 @@ CheckpointerMain(void)
 {
 	sigjmp_buf	local_sigjmp_buf;
 	MemoryContext checkpointer_context;
+	WaitEventSet *wes;
 
 	CheckpointerShmem->checkpointer_pid = MyProcPid;
 
@@ -330,6 +339,21 @@ CheckpointerMain(void)
 	 */
 	ProcGlobal->checkpointerLatch = &MyProc->procLatch;
 
+	/* Create reusable WaitEventSet. */
+	wes = CreateWaitEventSet(TopMemoryContext, 3);
+	AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL,
+					  NULL);
+	AddWaitEventToSet(wes, WL_LATCH_SET, PGINVALID_SOCKET, MyLatch, NULL);
+#ifndef WIN32
+	AddWaitEventToSet(wes, WL_SOCKET_READABLE, fsync_fds[FSYNC_FD_PROCESS],
+					  NULL, NULL);
+#else
+	absorb_overlapped.hEvent = CreateEvent(NULL, TRUE, TRUE,
+										   "fsync pipe read completion");
+	AddWaitEventToSet(wes, WL_WIN32_HANDLE, PGINVALID_SOCKET, NULL,
+					  &absorb_overlapped.hEvent);
+#endif
+
 	/*
 	 * Loop forever
 	 */
@@ -341,6 +365,7 @@ CheckpointerMain(void)
 		int			elapsed_secs;
 		int			cur_timeout;
 		int			rc;
+		WaitEvent	event;
 
 		/* Clear any already-pending wakeups */
 		ResetLatch(MyLatch);
@@ -541,16 +566,13 @@ CheckpointerMain(void)
 			cur_timeout = Min(cur_timeout, XLogArchiveTimeout - elapsed_secs);
 		}
 
-		rc = WaitLatch(MyLatch,
-					   WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
-					   cur_timeout * 1000L /* convert to ms */ ,
-					   WAIT_EVENT_CHECKPOINTER_MAIN);
+		rc = WaitEventSetWait(wes, cur_timeout * 1000, &event, 1, 0);
 
 		/*
 		 * Emergency bailout if postmaster has died.  This is to avoid the
 		 * necessity for manual cleanup of all postmaster children.
 		 */
-		if (rc & WL_POSTMASTER_DEATH)
+		if (rc == 1 && event.events == WL_POSTMASTER_DEATH)
 			exit(1);
 	}
 }
@@ -886,16 +908,7 @@ ReqShutdownHandler(SIGNAL_ARGS)
 Size
 CheckpointerShmemSize(void)
 {
-	Size		size;
-
-	/*
-	 * Currently, the size of the requests[] array is arbitrarily set equal to
-	 * NBuffers.  This may prove too large or small ...
-	 */
-	size = offsetof(CheckpointerShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
-
-	return size;
+	return sizeof(CheckpointerShmemStruct);
 }
 
 /*
@@ -916,13 +929,13 @@ CheckpointerShmemInit(void)
 	if (!found)
 	{
 		/*
-		 * First time through, so initialize.  Note that we zero the whole
-		 * requests array; this is so that CompactCheckpointerRequestQueue can
-		 * assume that any pad bytes in the request structs are zeroes.
+		 * First time through, so initialize.
 		 */
 		MemSet(CheckpointerShmem, 0, size);
 		SpinLockInit(&CheckpointerShmem->ckpt_lck);
-		CheckpointerShmem->max_requests = NBuffers;
+		pg_atomic_init_u64(&CheckpointerShmem->ckpt_cycle, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_writes, 0);
+		pg_atomic_init_u32(&CheckpointerShmem->num_backend_fsync, 0);
 	}
 }
 
@@ -1098,181 +1111,84 @@ RequestCheckpoint(int flags)
  * is theoretically possible a backend fsync might still be necessary, if
  * the queue is full and contains no duplicate entries.  In that case, we
  * let the backend know by returning false.
+ *
+ * We add the cycle counter to the message.  That is an unsynchronized read
+ * of the shared memory counter, but it doesn't matter if it is arbitrarily
+ * old since it is only used to limit unnecessary extra queue draining in
+ * AbsorbAllFsyncRequests().
  */
-bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+void
+ForwardFsyncRequest(const SmgrFileTag *tag, File file)
 {
-	CheckpointerRequest *request;
-	bool		too_full;
+	CheckpointerRequest request = {0};
 
 	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
+		elog(ERROR, "ForwardFsyncRequest must not be called in single user mode");
 
 	if (AmCheckpointerProcess())
 		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
-	/* Count all backend writes regardless of if they fit in the queue */
-	if (!AmBackgroundWriterProcess())
-		CheckpointerShmem->num_backend_writes++;
+	request.type = CKPT_REQUEST_RNODE;
+	request.tag = *tag;
+#ifdef CHECKPOINTER_TRANSFER_FILES
+	request.contains_fd = file != -1;
+#else
+	request.contains_fd = false;
+#endif
 
 	/*
-	 * If the checkpointer isn't running or the request queue is full, the
-	 * backend will have to perform its own fsync request.  But before forcing
-	 * that to happen, we can try to compact the request queue.
+	 * Tell the checkpointer the sequence number of the most recent open, so
+	 * that it can be sure to hold the older file descriptor.
 	 */
-	if (CheckpointerShmem->checkpointer_pid == 0 ||
-		(CheckpointerShmem->num_requests >= CheckpointerShmem->max_requests &&
-		 !CompactCheckpointerRequestQueue()))
-	{
-		/*
-		 * Count the subset of writes where backends have to do their own
-		 * fsync
-		 */
-		if (!AmBackgroundWriterProcess())
-			CheckpointerShmem->num_backend_fsync++;
-		LWLockRelease(CheckpointerCommLock);
-		return false;
-	}
+	request.open_seq = request.contains_fd ? FileGetOpenSeq(file) : (uint64) -1;
 
-	/* OK, insert request */
-	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
-
-	/* If queue is more than half full, nudge the checkpointer to empty it */
-	too_full = (CheckpointerShmem->num_requests >=
-				CheckpointerShmem->max_requests / 2);
-
-	LWLockRelease(CheckpointerCommLock);
-
-	/* ... but not till after we release the lock */
-	if (too_full && ProcGlobal->checkpointerLatch)
-		SetLatch(ProcGlobal->checkpointerLatch);
+	/*
+	 * We read ckpt_started without synchronization.  It is used to prevent
+	 * AbsorbAllFsyncRequests() from reading new values from after a
+	 * checkpoint began.  A slightly out-of-date value here will only cause
+	 * it to do a little bit more work than strictly necessary, but that's
+	 * OK.
+	 */
+	request.ckpt_started = CheckpointerShmem->ckpt_started;
 
-	return true;
+	SendFsyncRequest(&request,
+					 request.contains_fd ? FileGetRawDesc(file) : -1);
 }
 
 /*
- * CompactCheckpointerRequestQueue
- *		Remove duplicates from the request queue to avoid backend fsyncs.
- *		Returns "true" if any entries were removed.
- *
- * Although a full fsync request queue is not common, it can lead to severe
- * performance problems when it does happen.  So far, this situation has
- * only been observed to occur when the system is under heavy write load,
- * and especially during the "sync" phase of a checkpoint.  Without this
- * logic, each backend begins doing an fsync for every block written, which
- * gets very expensive and can slow down the whole system.
+ * AbsorbFsyncRequests
+ *		Retrieve queued fsync requests and pass them to local smgr. Stop when
+ *		resources would be exhausted by absorbing more.
  *
- * Trying to do this every time the queue is full could lose if there
- * aren't any removable entries.  But that should be vanishingly rare in
- * practice: there's one queue entry per shared buffer.
+ * This is exported because we want to continue accepting requests during
+ * smgrsync().
  */
-static bool
-CompactCheckpointerRequestQueue(void)
+void
+AbsorbFsyncRequests(void)
 {
-	struct CheckpointerSlotMapping
-	{
-		CheckpointerRequest request;
-		int			slot;
-	};
-
-	int			n,
-				preserve_count;
-	int			num_skipped = 0;
-	HASHCTL		ctl;
-	HTAB	   *htab;
-	bool	   *skip_slot;
-
-	/* must hold CheckpointerCommLock in exclusive mode */
-	Assert(LWLockHeldByMe(CheckpointerCommLock));
-
-	/* Initialize skip_slot array */
-	skip_slot = palloc0(sizeof(bool) * CheckpointerShmem->num_requests);
-
-	/* Initialize temporary hash table */
-	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(CheckpointerRequest);
-	ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
-	ctl.hcxt = CurrentMemoryContext;
-
-	htab = hash_create("CompactCheckpointerRequestQueue",
-					   CheckpointerShmem->num_requests,
-					   &ctl,
-					   HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-
-	/*
-	 * The basic idea here is that a request can be skipped if it's followed
-	 * by a later, identical request.  It might seem more sensible to work
-	 * backwards from the end of the queue and check whether a request is
-	 * *preceded* by an earlier, identical request, in the hopes of doing less
-	 * copying.  But that might change the semantics, if there's an
-	 * intervening FORGET_RELATION_FSYNC or FORGET_DATABASE_FSYNC request, so
-	 * we do it this way.  It would be possible to be even smarter if we made
-	 * the code below understand the specific semantics of such requests (it
-	 * could blow away preceding entries that would end up being canceled
-	 * anyhow), but it's not clear that the extra complexity would buy us
-	 * anything.
-	 */
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		CheckpointerRequest *request;
-		struct CheckpointerSlotMapping *slotmap;
-		bool		found;
-
-		/*
-		 * We use the request struct directly as a hashtable key.  This
-		 * assumes that any padding bytes in the structs are consistently the
-		 * same, which should be okay because we zeroed them in
-		 * CheckpointerShmemInit.  Note also that RelFileNode had better
-		 * contain no pad bytes.
-		 */
-		request = &CheckpointerShmem->requests[n];
-		slotmap = hash_search(htab, request, HASH_ENTER, &found);
-		if (found)
-		{
-			/* Duplicate, so mark the previous occurrence as skippable */
-			skip_slot[slotmap->slot] = true;
-			num_skipped++;
-		}
-		/* Remember slot containing latest occurrence of this request value */
-		slotmap->slot = n;
-	}
+	if (!AmCheckpointerProcess())
+		return;
 
-	/* Done with the hash table. */
-	hash_destroy(htab);
+	/* Transfer stats counts into pending pgstats message */
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
-	/* If no duplicates, we're out of luck. */
-	if (!num_skipped)
+	while (true)
 	{
-		pfree(skip_slot);
-		return false;
-	}
+		if (!FlushFsyncRequestQueueIfNecessary())
+			break;
 
-	/* We found some duplicates; remove them. */
-	preserve_count = 0;
-	for (n = 0; n < CheckpointerShmem->num_requests; n++)
-	{
-		if (skip_slot[n])
-			continue;
-		CheckpointerShmem->requests[preserve_count++] = CheckpointerShmem->requests[n];
+		if (!AbsorbFsyncRequest(false))
+			break;
 	}
-	ereport(DEBUG1,
-			(errmsg("compacted fsync request queue from %d entries to %d entries",
-					CheckpointerShmem->num_requests, preserve_count)));
-	CheckpointerShmem->num_requests = preserve_count;
-
-	/* Cleanup. */
-	pfree(skip_slot);
-	return true;
 }
 
 /*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ * AbsorbAllFsyncRequests
+ *		Retrieve all already pending fsync requests and pass them to local
+ *		smgr.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1280,54 +1196,121 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbAllFsyncRequests(void)
 {
-	CheckpointerRequest *requests = NULL;
-	CheckpointerRequest *request;
-	int			n;
-
 	if (!AmCheckpointerProcess())
 		return;
 
-	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
-
 	/* Transfer stats counts into pending pgstats message */
-	BgWriterStats.m_buf_written_backend += CheckpointerShmem->num_backend_writes;
-	BgWriterStats.m_buf_fsync_backend += CheckpointerShmem->num_backend_fsync;
-
-	CheckpointerShmem->num_backend_writes = 0;
-	CheckpointerShmem->num_backend_fsync = 0;
+	BgWriterStats.m_buf_written_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_writes, 0);
+	BgWriterStats.m_buf_fsync_backend +=
+		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
-	/*
-	 * We try to avoid holding the lock for a long time by copying the request
-	 * array, and processing the requests after releasing the lock.
-	 *
-	 * Once we have cleared the requests from shared memory, we have to PANIC
-	 * if we then fail to absorb them (eg, because our hashtable runs out of
-	 * memory).  This is because the system cannot run safely if we are unable
-	 * to fsync what we have been told to fsync.  Fortunately, the hashtable
-	 * is so small that the problem is quite unlikely to arise in practice.
-	 */
-	n = CheckpointerShmem->num_requests;
-	if (n > 0)
+	for (;;)
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
+		if (!FlushFsyncRequestQueueIfNecessary())
+			elog(FATAL, "may not happen");
+
+		if (!AbsorbFsyncRequest(true))
+			break;
 	}
+}
+
+/*
+ * AbsorbFsyncRequest
+ *		Retrieve one queued fsync request and pass them to local smgr.
+ */
+static bool
+AbsorbFsyncRequest(bool stop_at_current_cycle)
+{
+	static CheckpointerRequest req;
+	int fd = -1;
+#ifndef WIN32
+	int ret;
+#else
+	DWORD bytes_read;
+#endif
+
+	ReleaseLruFiles();
 
 	START_CRIT_SECTION();
+#ifndef WIN32
+	ret = pg_uds_recv_with_fd(fsync_fds[FSYNC_FD_PROCESS],
+							  &req,
+							  sizeof(req),
+							  &fd);
+	if (ret < 0 && (errno == EWOULDBLOCK || errno == EAGAIN))
+	{
+		END_CRIT_SECTION();
+		return false;
+	}
+	else if (ret < 0)
+		elog(ERROR, "recvmsg failed: %m");
+#else
+	if (!absorb_read_in_progress)
+	{
+		if (!ReadFile(fsyncPipe[FSYNC_FD_PROCESS],
+					  &req,
+					  sizeof(req),
+					  &bytes_read,
+					  &absorb_overlapped))
+		{
+			if (GetLastError() != ERROR_IO_PENDING)
+			{
+				_dosmaperr(GetLastError());
+				elog(ERROR, "can't begin read from fsync pipe: %m");
+			}
 
-	CheckpointerShmem->num_requests = 0;
+			/*
+			 * An asynchronous read has begun.  We'll tell caller to call us
+			 * back when the event indicates completion.
+			 */
+			absorb_read_in_progress = &absorb_overlapped.hEvent;
+			END_CRIT_SECTION();
+			return false;
+		}
+		/* The read completed synchronously.  'req' is now populated. */
+	}
+	if (absorb_read_in_progress)
+	{
+		/* Completed yet? */
+		if (!GetOverlappedResult(fsyncPipe[FSYNC_FD_PROCESS],
+								 &absorb_overlapped,
+								 &bytes_read,
+								 false))
+		{
+			if (GetLastError() == ERROR_IO_INCOMPLETE)
+			{
+				/* Nope.  Spurious event?  Tell caller to wait some more. */
+				END_CRIT_SECTION();
+				return false;
+			}
+			_dosmaperr(GetLastError());
+			elog(ERROR, "can't complete from fsync pipe: %m");
+		}
+		/* The asynchronous read completed.  'req' is now populated. */
+		absorb_read_in_progress = NULL;
+	}
 
-	LWLockRelease(CheckpointerCommLock);
+	/* Check message size. */
+	if (bytes_read != sizeof(req))
+		elog(ERROR, "unexpected short read on fsync pipe");
+#endif
 
-	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+	if (req.contains_fd != (fd != -1))
+	{
+		elog(FATAL, "message should have fd associated, but doesn't");
+	}
 
+	RememberFsyncRequest(&req.tag, fd, req.open_seq);
 	END_CRIT_SECTION();
 
-	if (requests)
-		pfree(requests);
+	if (stop_at_current_cycle &&
+		req.ckpt_started == CheckpointerShmem->ckpt_started)
+		return false;
+
+	return true;
 }
 
 /*
@@ -1370,3 +1353,138 @@ FirstCallSinceLastCheckpoint(void)
 
 	return FirstCall;
 }
+
+uint64
+GetCheckpointSyncCycle(void)
+{
+	return pg_atomic_read_u64(&CheckpointerShmem->ckpt_cycle);
+}
+
+uint64
+IncCheckpointSyncCycle(void)
+{
+	return pg_atomic_fetch_add_u64(&CheckpointerShmem->ckpt_cycle, 1);
+}
+
+void
+CountBackendWrite(void)
+{
+	pg_atomic_fetch_add_u32(&CheckpointerShmem->num_backend_writes, 1);
+}
+
+/*
+ * Send a message to the checkpointer's fsync socket (Unix) or pipe (Windows).
+ * This is essentially a blocking call (there is no CHECK_FOR_INTERRUPTS, and
+ * even if there were it'd be surpressed since callers hold a lock), except
+ * that we don't ignore postmaster death so we need an event loop.
+ *
+ * The code is rather different on Windows, because there we have to begin the
+ * write and then wait for it to complete, while on Unix we have to wait until
+ * we can do the write.
+ */
+static void
+SendFsyncRequest(CheckpointerRequest *request, int fd)
+{
+#ifndef WIN32
+	ssize_t ret;
+	int		rc;
+
+	while (true)
+	{
+		ret = pg_uds_send_with_fd(fsync_fds[FSYNC_FD_SUBMIT],
+								  request,
+								  sizeof(*request),
+								  request->contains_fd ? fd : -1);
+
+		if (ret >= 0)
+		{
+			/*
+			 * Don't think short writes will ever happen in realistic
+			 * implementations, but better make sure that's true...
+			 */
+			if (ret != sizeof(*request))
+				elog(FATAL, "unexpected short write to fsync request socket");
+			break;
+		}
+		else if (errno == EWOULDBLOCK || errno == EAGAIN
+#ifdef __darwin__
+				 || errno == EMSGSIZE || errno == ENOBUFS
+#endif
+				)
+		{
+			/*
+			 * Testing on macOS 10.13 showed occasional EMSGSIZE or
+			 * ENOBUFS errors, which could be handled by retrying.  Unless
+			 * the problem also shows up on other systems, let's handle those
+			 * only for that OS.
+			 */
+
+			/* Blocked on write - wait for socket to become readable */
+			rc = WaitLatchOrSocket(NULL,
+								   WL_SOCKET_WRITEABLE | WL_POSTMASTER_DEATH,
+								   fsync_fds[FSYNC_FD_SUBMIT], -1, 0);
+			if (rc & WL_POSTMASTER_DEATH)
+				exit(1);
+		}
+		else
+			ereport(FATAL, (errmsg("could not send fsync request: %m")));
+	}
+
+#else /* WIN32 */
+	{
+		OVERLAPPED overlapped = {0};
+		DWORD nwritten;
+		int rc;
+
+		overlapped.hEvent = CreateEvent(NULL, TRUE, TRUE, NULL);
+
+		if (!WriteFile(fsyncPipe[FSYNC_FD_SUBMIT],
+					   request,
+					   sizeof(*request),
+					   &nwritten,
+					   &overlapped))
+		{
+			WaitEventSet *wes;
+			WaitEvent event;
+
+			/* Handle unexpected errors. */
+			if (GetLastError() != ERROR_IO_PENDING)
+			{
+				_dosmaperr(GetLastError());
+				CloseHandle(overlapped.hEvent);
+				ereport(FATAL, (errmsg("could not send fsync request: %m")));
+			}
+
+			/* Wait for asynchronous IO to complete. */
+			wes = CreateWaitEventSet(TopMemoryContext, 3);
+			AddWaitEventToSet(wes, WL_POSTMASTER_DEATH, PGINVALID_SOCKET, NULL,
+							  NULL);
+			AddWaitEventToSet(wes, WL_WIN32_HANDLE, PGINVALID_SOCKET, NULL,
+							  &overlapped.hEvent);
+			for (;;)
+			{
+				rc = WaitEventSetWait(wes, -1, &event, 1, 0);
+				if (rc == 1 && event.events == WL_POSTMASTER_DEATH)
+					proc_exit(1);
+				if (rc == 1 && event.events == WL_WIN32_HANDLE)
+				{
+					if (!GetOverlappedResult(fsyncPipe[FSYNC_FD_SUBMIT], &overlapped,
+											 &nwritten, FALSE))
+					{
+						_dosmaperr(GetLastError());
+						CloseHandle(overlapped.hEvent);
+						ereport(FATAL, (errmsg("could not get result of sending fsync request: %m")));
+					}
+					if (nwritten > 0)
+						break;
+				}
+			}
+			FreeWaitEventSet(wes);
+		}
+
+		CloseHandle(overlapped.hEvent);
+		if (nwritten != sizeof(*request))
+			elog(FATAL, "unexpected short write to fsync request pipe");
+	}
+#endif
+}
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index a33a1311829..77eec366512 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -70,6 +70,7 @@
 #include <time.h>
 #include <sys/wait.h>
 #include <ctype.h>
+#include <sys/types.h>
 #include <sys/stat.h>
 #include <sys/socket.h>
 #include <fcntl.h>
@@ -435,6 +436,7 @@ static pid_t StartChildProcess(AuxProcType type);
 static void StartAutovacuumWorker(void);
 static void MaybeStartWalReceiver(void);
 static void InitPostmasterDeathWatchHandle(void);
+static void InitFsyncFdSocketPair(void);
 
 /*
  * Archiver is allowed to start up at the current postmaster state?
@@ -524,9 +526,11 @@ typedef struct
 	HANDLE		PostmasterHandle;
 	HANDLE		initial_signal_pipe;
 	HANDLE		syslogPipe[2];
+	HANDLE		fsyncPipe[2];
 #else
 	int			postmaster_alive_fds[2];
 	int			syslogPipe[2];
+	int			fsync_fds[2];
 #endif
 	char		my_exec_path[MAXPGPATH];
 	char		pkglib_path[MAXPGPATH];
@@ -569,6 +573,12 @@ int			postmaster_alive_fds[2] = {-1, -1};
 HANDLE		PostmasterHandle;
 #endif
 
+#ifndef WIN32
+int			fsync_fds[2] = {-1, -1};
+#else
+HANDLE		fsyncPipe[2] = {0, 0};
+#endif
+
 /*
  * Postmaster main entry point
  */
@@ -1199,6 +1209,11 @@ PostmasterMain(int argc, char *argv[])
 	 */
 	InitPostmasterDeathWatchHandle();
 
+	/*
+	 * Initialize socket pair used to transport file descriptors over.
+	 */
+	InitFsyncFdSocketPair();
+
 #ifdef WIN32
 
 	/*
@@ -6013,7 +6028,8 @@ extern pg_time_t first_syslogger_file_time;
 #define write_inheritable_socket(dest, src, childpid) ((*(dest) = (src)), true)
 #define read_inheritable_socket(dest, src) (*(dest) = *(src))
 #else
-static bool write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE child);
+static bool write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE child,
+									bool close_source);
 static bool write_inheritable_socket(InheritableSocket *dest, SOCKET src,
 						 pid_t childPid);
 static void read_inheritable_socket(SOCKET *dest, InheritableSocket *src);
@@ -6077,11 +6093,20 @@ save_backend_variables(BackendParameters *param, Port *port,
 	param->PostmasterHandle = PostmasterHandle;
 	if (!write_duplicated_handle(&param->initial_signal_pipe,
 								 pgwin32_create_signal_listener(childPid),
-								 childProcess))
+								 childProcess, true))
+		return false;
+	if (!write_duplicated_handle(&param->fsyncPipe[0],
+								 fsyncPipe[0],
+								 childProcess, false))
+		return false;
+	if (!write_duplicated_handle(&param->fsyncPipe[1],
+								 fsyncPipe[1],
+								 childProcess, false))
 		return false;
 #else
 	memcpy(&param->postmaster_alive_fds, &postmaster_alive_fds,
 		   sizeof(postmaster_alive_fds));
+	memcpy(&param->fsync_fds, &fsync_fds, sizeof(fsync_fds));
 #endif
 
 	memcpy(&param->syslogPipe, &syslogPipe, sizeof(syslogPipe));
@@ -6102,7 +6127,8 @@ save_backend_variables(BackendParameters *param, Port *port,
  * process instance of the handle to the parameter file.
  */
 static bool
-write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess)
+write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess,
+						bool close_source)
 {
 	HANDLE		hChild = INVALID_HANDLE_VALUE;
 
@@ -6112,7 +6138,8 @@ write_duplicated_handle(HANDLE *dest, HANDLE src, HANDLE childProcess)
 						 &hChild,
 						 0,
 						 TRUE,
-						 DUPLICATE_CLOSE_SOURCE | DUPLICATE_SAME_ACCESS))
+						 (close_source ? DUPLICATE_CLOSE_SOURCE : 0) |
+						 DUPLICATE_SAME_ACCESS))
 	{
 		ereport(LOG,
 				(errmsg_internal("could not duplicate handle to be written to backend parameter file: error code %lu",
@@ -6308,9 +6335,12 @@ restore_backend_variables(BackendParameters *param, Port *port)
 #ifdef WIN32
 	PostmasterHandle = param->PostmasterHandle;
 	pgwin32_initial_signal_pipe = param->initial_signal_pipe;
+	fsyncPipe[0] = param->fsyncPipe[0];
+	fsyncPipe[1] = param->fsyncPipe[1];
 #else
 	memcpy(&postmaster_alive_fds, &param->postmaster_alive_fds,
 		   sizeof(postmaster_alive_fds));
+	memcpy(&fsync_fds, &param->fsync_fds, sizeof(fsync_fds));
 #endif
 
 	memcpy(&syslogPipe, &param->syslogPipe, sizeof(syslogPipe));
@@ -6487,3 +6517,88 @@ InitPostmasterDeathWatchHandle(void)
 								 GetLastError())));
 #endif							/* WIN32 */
 }
+
+/* Create socket used for requesting fsyncs by checkpointer */
+static void
+InitFsyncFdSocketPair(void)
+{
+	Assert(MyProcPid == PostmasterPid);
+
+#ifndef WIN32
+	if (socketpair(AF_UNIX, SOCK_STREAM, 0, fsync_fds) < 0)
+		ereport(FATAL,
+				(errcode_for_file_access(),
+				 errmsg_internal("could not create fsync sockets: %m")));
+	/*
+	 * Set O_NONBLOCK on both fds.
+	 */
+	if (fcntl(fsync_fds[FSYNC_FD_PROCESS], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync process socket to nonblocking mode: %m")));
+#ifndef EXEC_BACKEND
+	if (fcntl(fsync_fds[FSYNC_FD_PROCESS], F_SETFD, FD_CLOEXEC) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync process socket to close-on-exec mode: %m")));
+#endif
+
+	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFL, O_NONBLOCK) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync submit socket to nonblocking mode: %m")));
+#ifndef EXEC_BACKEND
+	if (fcntl(fsync_fds[FSYNC_FD_SUBMIT], F_SETFD, FD_CLOEXEC) == -1)
+		ereport(FATAL,
+				(errcode_for_socket_access(),
+				 errmsg_internal("could not set fsync submit socket to close-on-exec mode: %m")));
+#endif
+#else
+	{
+		UCHAR		pipename[MAX_PATH];
+		SECURITY_ATTRIBUTES sa;
+
+		memset(&sa, 0, sizeof(sa));
+
+		/*
+		 * We'll create a named pipe, because anonymous pipes don't allow
+		 * overlapped (= async) IO or message-orient communication.  We'll
+		 * open both ends of it here, and then duplicate them into all child
+		 * processes in save_backend_variables().  First, open the server end.
+		 */
+		snprintf(pipename, sizeof(pipename), "\\\\.\\Pipe\\fsync_pipe.%08x",
+				 GetCurrentProcessId());
+		fsyncPipe[FSYNC_FD_PROCESS] = CreateNamedPipeA(pipename,
+													   PIPE_ACCESS_INBOUND | FILE_FLAG_OVERLAPPED,
+													   PIPE_TYPE_MESSAGE | PIPE_WAIT,
+													   1,
+													   4096,
+													   4096,
+													   -1,
+													   &sa);
+		if (!fsyncPipe[FSYNC_FD_PROCESS])
+		{
+			_dosmaperr(GetLastError());
+			ereport(FATAL,
+					(errcode_for_file_access(),
+					 errmsg_internal("could not create server end of fsync pipe: %m")));
+		}
+
+		/* Now open the client end. */
+		fsyncPipe[FSYNC_FD_SUBMIT] = CreateFileA(pipename,
+												 GENERIC_WRITE,
+												 0,
+												 &sa,
+												 OPEN_EXISTING,
+												 FILE_ATTRIBUTE_NORMAL | FILE_FLAG_OVERLAPPED,
+												 NULL);
+		if (!fsyncPipe[FSYNC_FD_SUBMIT])
+		{
+			_dosmaperr(GetLastError());
+			ereport(FATAL,
+					(errcode_for_file_access(),
+					 errmsg_internal("could not create client end of fsync pipe: %m")));
+		}
+	}
+#endif
+}
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe57063..256cc5e0217 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -42,11 +42,13 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/standby.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9e596e7868b..3b6451554a7 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -85,6 +85,7 @@
 #include "catalog/pg_tablespace.h"
 #include "common/file_perm.h"
 #include "pgstat.h"
+#include "port/atomics.h"
 #include "portability/mem.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
@@ -173,6 +174,7 @@ bool		data_sync_retry = false;
 #define FD_DELETE_AT_CLOSE	(1 << 0)	/* T = delete when closed */
 #define FD_CLOSE_AT_EOXACT	(1 << 1)	/* T = close at eoXact */
 #define FD_TEMP_FILE_LIMIT	(1 << 2)	/* T = respect temp_file_limit */
+#define FD_NOT_IN_LRU		(1 << 3)	/* T = not in LRU */
 
 typedef struct vfd
 {
@@ -187,6 +189,7 @@ typedef struct vfd
 	/* NB: fileName is malloc'd, and must be free'd when closing the VFD */
 	int			fileFlags;		/* open(2) flags for (re)opening the file */
 	mode_t		fileMode;		/* mode to pass to open(2) */
+	uint64		open_seq;		/* sequence number of opened file */
 } Vfd;
 
 /*
@@ -296,7 +299,6 @@ static void LruDelete(File file);
 static void Insert(File file);
 static int	LruInsert(File file);
 static bool ReleaseLruFile(void);
-static void ReleaseLruFiles(void);
 static File AllocateVfd(void);
 static void FreeVfd(File file);
 
@@ -325,6 +327,13 @@ static void unlink_if_exists_fname(const char *fname, bool isdir, int elevel);
 static int	fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
 static int	fsync_parent_path(const char *fname, int elevel);
 
+/* Shared memory state. */
+typedef struct
+{
+	pg_atomic_uint64 open_seq;
+} FdSharedData;
+
+static FdSharedData *fd_shared;
 
 /*
  * pg_fsync --- do fsync with or without writethrough
@@ -777,6 +786,20 @@ InitFileAccess(void)
 	on_proc_exit(AtProcExit_Files, 0);
 }
 
+/*
+ * Initialize shared memory state.  This is called after shared memory is
+ * ready.
+ */
+void
+FileShmemInit(void)
+{
+	bool	found;
+
+	fd_shared = ShmemInitStruct("fd_shared", sizeof(*fd_shared), &found);
+	if (!found)
+		pg_atomic_init_u64(&fd_shared->open_seq, 0);
+}
+
 /*
  * count_usable_fds --- count how many FDs the system will let us open,
  *		and estimate how many are already open.
@@ -1086,6 +1109,9 @@ LruInsert(File file)
 		{
 			++nfile;
 		}
+
+		vfdP->open_seq =
+			pg_atomic_fetch_add_u64(&fd_shared->open_seq, 1);
 	}
 
 	/*
@@ -1122,7 +1148,7 @@ ReleaseLruFile(void)
  * Release kernel FDs as needed to get under the max_safe_fds limit.
  * After calling this, it's OK to try to open another file.
  */
-static void
+void
 ReleaseLruFiles(void)
 {
 	while (nfile + numAllocatedDescs >= max_safe_fds)
@@ -1235,9 +1261,11 @@ FileAccess(File file)
 		 * We now know that the file is open and that it is not the last one
 		 * accessed, so we need to move it to the head of the Lru ring.
 		 */
-
-		Delete(file);
-		Insert(file);
+		if (!(VfdCache[file].fdstate & FD_NOT_IN_LRU))
+		{
+			Delete(file);
+			Insert(file);
+		}
 	}
 
 	return 0;
@@ -1355,6 +1383,57 @@ PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode)
 	vfdP->fileSize = 0;
 	vfdP->fdstate = 0x0;
 	vfdP->resowner = NULL;
+	vfdP->open_seq = pg_atomic_fetch_add_u64(&fd_shared->open_seq, 1);
+
+	return file;
+}
+
+/*
+ * Open a File for a pre-existing file descriptor.
+ *
+ * Note that these files will not be closed in an LRU basis, therefore the
+ * caller is responsible for limiting the number of open file descriptors.
+ *
+ * The passed in name is purely for informational purposes.
+ */
+File
+FileOpenForFd(int fd, const char *fileName, uint64 open_seq)
+{
+	char	   *fnamecopy;
+	File		file;
+	Vfd		   *vfdP;
+
+	/*
+	 * We need a malloc'd copy of the file name; fail cleanly if no room.
+	 */
+	fnamecopy = strdup(fileName);
+	if (fnamecopy == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory")));
+
+	file = AllocateVfd();
+	vfdP = &VfdCache[file];
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	vfdP->fd = fd;
+	++nfile;
+
+	DO_DB(elog(LOG, "FileOpenForFd: success %d/%d (%s)",
+			   file, fd, fnamecopy));
+
+	/* NB: Explicitly not inserted into LRU! */
+
+	vfdP->fileName = fnamecopy;
+	/* Saved flags are adjusted to be OK for re-opening file */
+	vfdP->fileFlags = 0;
+	vfdP->fileMode = 0;
+	vfdP->fileSize = 0;
+	vfdP->fdstate = FD_NOT_IN_LRU;
+	vfdP->resowner = NULL;
+	vfdP->open_seq = open_seq;
 
 	return file;
 }
@@ -1712,7 +1791,11 @@ FileClose(File file)
 		vfdP->fd = VFD_CLOSED;
 
 		/* remove the file from the lru ring */
-		Delete(file);
+		if (!(vfdP->fdstate & FD_NOT_IN_LRU))
+		{
+			vfdP->fdstate &= ~FD_NOT_IN_LRU;
+			Delete(file);
+		}
 	}
 
 	if (vfdP->fdstate & FD_TEMP_FILE_LIMIT)
@@ -2081,6 +2164,10 @@ int
 FileGetRawDesc(File file)
 {
 	Assert(FileIsValid(file));
+
+	if (FileAccess(file))
+		return -1;
+
 	return VfdCache[file].fd;
 }
 
@@ -2104,6 +2191,17 @@ FileGetRawMode(File file)
 	return VfdCache[file].fileMode;
 }
 
+/*
+ * Get the opening sequence number of this file.  This number is captured
+ * after the file was opened but before anything was written to the file,
+ */
+uint64
+FileGetOpenSeq(File file)
+{
+	Assert(FileIsValid(file));
+	return VfdCache[file].open_seq;
+}
+
 /*
  * Make room for another allocatedDescs[] array entry if needed and possible.
  * Returns true if an array element is available.
@@ -3454,3 +3552,110 @@ data_sync_elevel(int elevel)
 {
 	return data_sync_retry ? elevel : PANIC;
 }
+
+#ifndef WIN32
+
+/*
+ * Send data over a unix domain socket, optionally (when fd != -1) including a
+ * file descriptor.
+ */
+ssize_t
+pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd)
+{
+	ssize_t     size;
+	struct msghdr   msg = {0};
+	struct iovec    iov = {0};
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	memset(&cmsgu, 0, sizeof(cmsgu));
+	iov.iov_base = buf;
+	iov.iov_len = buflen;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+
+	if (fd >= 0)
+	{
+		msg.msg_control = cmsgu.control;
+		msg.msg_controllen = sizeof(cmsgu.control);
+
+		cmsg = CMSG_FIRSTHDR(&msg);
+		cmsg->cmsg_len = CMSG_LEN(sizeof (int));
+		cmsg->cmsg_level = SOL_SOCKET;
+		cmsg->cmsg_type = SCM_RIGHTS;
+
+		*((int *) CMSG_DATA(cmsg)) = fd;
+	}
+
+	size = sendmsg(sock, &msg, 0);
+
+	/* errors are returned directly */
+	return size;
+}
+
+/*
+ * Receive data from a unix domain socket. If a file is sent over the socket,
+ * store it in *fd.
+ */
+ssize_t
+pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd)
+{
+	ssize_t     size;
+	struct msghdr   msg;
+	struct iovec    iov;
+	/* cmsg header, union for correct alignment */
+	union
+	{
+		struct cmsghdr  cmsghdr;
+		char        control[CMSG_SPACE(sizeof (int))];
+	} cmsgu;
+	struct cmsghdr  *cmsg;
+
+	Assert(fd != NULL);
+
+	iov.iov_base = buf;
+	iov.iov_len = bufsize;
+
+	msg.msg_name = NULL;
+	msg.msg_namelen = 0;
+	msg.msg_iov = &iov;
+	msg.msg_iovlen = 1;
+	msg.msg_control = cmsgu.control;
+	msg.msg_controllen = sizeof(cmsgu.control);
+
+	size = recvmsg (sock, &msg, 0);
+
+	if (size < 0)
+	{
+		*fd = -1;
+		return size;
+	}
+
+	cmsg = CMSG_FIRSTHDR(&msg);
+	if (cmsg && cmsg->cmsg_len == CMSG_LEN(sizeof(int)))
+	{
+		if (cmsg->cmsg_level != SOL_SOCKET)
+			elog(FATAL, "unexpected cmsg_level");
+
+		if (cmsg->cmsg_type != SCM_RIGHTS)
+			elog(FATAL, "unexpected cmsg_type");
+
+		*fd = *((int *) CMSG_DATA(cmsg));
+
+		/* FIXME: check / handle additional cmsg structures */
+	}
+	else
+		*fd = -1;
+
+	return size;
+}
+
+#endif
diff --git a/src/backend/storage/freespace/freespace.c b/src/backend/storage/freespace/freespace.c
index 7c4ad1c4494..2b47824aab9 100644
--- a/src/backend/storage/freespace/freespace.c
+++ b/src/backend/storage/freespace/freespace.c
@@ -556,7 +556,7 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 	 * not on extension.)
 	 */
 	if (rel->rd_smgr->smgr_fsm_nblocks == InvalidBlockNumber ||
-		blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks == 0)
 	{
 		if (smgrexists(rel->rd_smgr, FSM_FORKNUM))
 			rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
@@ -564,6 +564,9 @@ fsm_readbuf(Relation rel, FSMAddress addr, bool extend)
 		else
 			rel->rd_smgr->smgr_fsm_nblocks = 0;
 	}
+	else if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
+		rel->rd_smgr->smgr_fsm_nblocks = smgrnblocks(rel->rd_smgr,
+													 FSM_FORKNUM);
 
 	/* Handle requests beyond EOF */
 	if (blkno >= rel->rd_smgr->smgr_fsm_nblocks)
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c03..efbd25b84da 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -27,6 +27,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
@@ -270,6 +271,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	SyncScanShmemInit();
 	AsyncShmemInit();
 	BackendRandomShmemInit();
+	FileShmemInit();
 
 #ifdef EXEC_BACKEND
 
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index c129446f9c9..d4f3ad0d44d 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -878,6 +878,12 @@ WaitEventAdjustWin32(WaitEventSet *set, WaitEvent *event)
 	{
 		*handle = PostmasterHandle;
 	}
+#ifdef WIN32
+	else if (event->events == WL_WIN32_HANDLE)
+	{
+		*handle = *(HANDLE *)event->user_data;
+	}
+#endif
 	else
 	{
 		int			flags = FD_CLOSE;	/* always check for errors/EOF */
@@ -1453,6 +1459,12 @@ WaitEventSetWaitBlock(WaitEventSet *set, int cur_timeout,
 			returned_events++;
 		}
 	}
+	else if (cur_event->events & WL_WIN32_HANDLE)
+	{
+		occurred_events->events |= WL_WIN32_HANDLE;
+		occurred_events++;
+		returned_events++;
+	}
 
 	return returned_events;
 }
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0df16..c9c4be325ed 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrsync.o smgrtype.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 4c6a50509f8..344e0e12d6f 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -30,37 +30,24 @@
 #include "access/xlog.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
 
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
 
 /*
  * On Windows, we have to interpret EACCES as possibly meaning the same as
  * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
  * that's what you get.  Ugh.  This code is designed so that we don't
  * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
+ * a pending fsync request getting canceled ... see smgrsync).
  */
 #ifndef WIN32
 #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
@@ -110,6 +97,7 @@ typedef struct _MdfdVec
 {
 	File		mdfd_vfd;		/* fd number in fd.c's pool */
 	BlockNumber mdfd_segno;		/* segment number, from 0 */
+	uint64		mdfd_dirtied_cycle;
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
@@ -134,30 +122,9 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * (Regular backends do not track pending operations locally, but forward
  * them to the checkpointer.)
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
 
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
 
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
 
 
 /*** behavior for mdopen & _mdfd_getseg ***/
@@ -184,8 +151,7 @@ static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
 			 bool isRedo);
 static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
 static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
+								   MdfdVec *seg);
 static void _fdvec_resize(SMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg);
@@ -208,64 +174,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -334,6 +242,7 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 	mdfd = &reln->md_seg_fds[forkNum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 }
 
 /*
@@ -388,7 +297,7 @@ mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
 	/*
 	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
+	 * relation, else the next smgrsync() will fail.  There can't be any such
 	 * requests for a temp relation, though.  We can send just one request
 	 * even when deleting multiple forks, since the fsync queuing code accepts
 	 * the "InvalidForkNumber = all forks" convention.
@@ -448,7 +357,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		UnlinkAfterCheckpoint(rnode);
 	}
 
 	/*
@@ -540,7 +449,16 @@ mdextend(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 
 	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+	{
+		SmgrFileTag tag;
+
+		tag.node = reln->smgr_rnode.node;
+		tag.forknum = forknum;
+		tag.segno = v->mdfd_segno;
+		v->mdfd_dirtied_cycle = FsyncAtCheckpoint(&tag,
+												  v->mdfd_vfd,
+												  v->mdfd_dirtied_cycle);
+	}
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 }
@@ -600,6 +518,7 @@ mdopen(SMgrRelation reln, ForkNumber forknum, int behavior)
 	mdfd = &reln->md_seg_fds[forknum][0];
 	mdfd->mdfd_vfd = fd;
 	mdfd->mdfd_segno = 0;
+	mdfd->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, mdfd) <= ((BlockNumber) RELSEG_SIZE));
 
@@ -831,7 +750,16 @@ mdwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
 	}
 
 	if (!skipFsync && !SmgrIsTemp(reln))
-		register_dirty_segment(reln, forknum, v);
+	{
+		SmgrFileTag tag;
+
+		tag.node = reln->smgr_rnode.node;
+		tag.forknum = forknum;
+		tag.segno = v->mdfd_segno;
+		v->mdfd_dirtied_cycle = FsyncAtCheckpoint(&tag,
+												  v->mdfd_vfd,
+												  v->mdfd_dirtied_cycle);
+	}
 }
 
 /*
@@ -1021,673 +949,38 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	mdsync() -- Sync previous writes to stable storage.
- */
-void
-mdsync(void)
-{
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
-	}
-
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
-
-/*
- * mdpostckpt() -- Do post-checkpoint work
- *
- * Remove any lingering files that can now be safely removed.
+ * Return the filename for the specified segment of the relation. The
+ * returned string is palloc'd.
  */
 void
-mdpostckpt(void)
+mdpath(const SmgrFileTag *tag, char *out)
 {
-	int			absorb_counter;
-
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
+	char	   *path;
 
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
+	path = relpathperm(tag->node, tag->forknum);
 
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
+	if (tag->segno > 0)
+		snprintf(out, MAXPGPATH, "%s.%u", path, tag->segno);
+	else
+		snprintf(out, MAXPGPATH, "%s", path);
 
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
+	pfree(path);
 }
 
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
- *
- * If there is a local pending-ops table, just make an entry in it for
- * mdsync to process later.  Otherwise, try to pass off the fsync request
- * to the checkpointer process.  If that fails, just do the fsync
- * locally before returning (we hope this will not happen often enough
- * to be a performance problem).
  */
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
-	/* Temp relations should never be fsync'd */
-	Assert(!SmgrIsTemp(reln));
-
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
-		ereport(DEBUG1,
-				(errmsg("could not forward fsync request because request queue is full")));
-
-		if (FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
-			ereport(data_sync_elevel(ERROR),
-					(errcode_for_file_access(),
-					 errmsg("could not fsync file \"%s\": %m",
-							FilePathName(seg->mdfd_vfd))));
-	}
-}
-
-/*
- * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
- */
-static void
-register_unlink(RelFileNodeBackend rnode)
-{
-	/* Should never be used with temp relations */
-	Assert(!RelFileNodeBackendIsTemp(rnode));
-
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
-/*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
- */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
-{
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
-}
-
-/*
- * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
- */
-void
-ForgetDatabaseFsyncRequests(Oid dbid)
-{
-	RelFileNode rnode;
-
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
-
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	SmgrFileTag tag;
+
+	tag.node = reln->smgr_rnode.node;
+	tag.forknum = forknum;
+	tag.segno = seg->mdfd_segno;
+	seg->mdfd_dirtied_cycle = FsyncAtCheckpoint(&tag,
+												seg->mdfd_vfd,
+												seg->mdfd_dirtied_cycle);
 }
 
 /*
@@ -1817,6 +1110,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	v = &reln->md_seg_fds[forknum][segno];
 	v->mdfd_vfd = fd;
 	v->mdfd_segno = segno;
+	v->mdfd_dirtied_cycle = GetCheckpointSyncCycle() - 1;
 
 	Assert(_mdnblocks(reln, forknum, v) <= ((BlockNumber) RELSEG_SIZE));
 
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 189342ef86a..c36ba4298b7 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
@@ -59,9 +60,7 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
+	void		(*smgr_path) (const SmgrFileTag *tag, char *out);
 } f_smgr;
 
 
@@ -82,9 +81,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
+		.smgr_path = mdpath
 	}
 };
 
@@ -104,6 +101,15 @@ static void smgrshutdown(int code, Datum arg);
 static void add_to_unowned_list(SMgrRelation reln);
 static void remove_from_unowned_list(SMgrRelation reln);
 
+/*
+ * For now there is only one implementation.  If more are added, we'll need to
+ * be able to dispatch based on a file tag.
+ */
+static inline int
+which_for_file_tag(const SmgrFileTag *tag)
+{
+	return 0;
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -118,6 +124,8 @@ smgrinit(void)
 {
 	int			i;
 
+	smgrsync_init();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
@@ -751,50 +759,13 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
 /*
- *	smgrsync() -- Sync files to disk during checkpoint.
+ * smgrpath() -- Expand a tag to a path.
  */
 void
-smgrsync(void)
+smgrpath(const SmgrFileTag *tag, char *out)
 {
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
+	smgrsw[which_for_file_tag(tag)].smgr_path(tag, out);
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgrsync.c b/src/backend/storage/smgr/smgrsync.c
new file mode 100644
index 00000000000..f3aaeff134f
--- /dev/null
+++ b/src/backend/storage/smgr/smgrsync.c
@@ -0,0 +1,803 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.c
+ *	  management of file synchronization.
+ *
+ * This modules tracks which files need to be fsynced or unlinked at the
+ * next checkpoint, and performs those actions.  Normally the work is done
+ * when called by the checkpointer, but it is also done in standalone mode
+ * and startup.
+ *
+ * Originally this logic was inside md.c, but it is now made more general,
+ * for reuse by other SMGR implementations that work with files.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/smgrsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "pgstat.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "storage/relfilenode.h"
+#include "storage/smgrsync.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+
+
+/*
+ * Special values for the segno member of SmgrFileTag.
+ *
+ * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
+ * fsync request from the queue if an identical, subsequent request is found.
+ * See comments there before making changes here.
+ */
+#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
+#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
+#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
+
+/* intervals for calling AbsorbFsyncRequests in smgrsync and smgrpostckpt */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * An entry in the hash table of files that need to be flushed for the next
+ * checkpoint.
+ */
+typedef struct PendingFsyncEntry
+{
+	SmgrFileTag	tag;
+	File		file;
+	uint64		cycle_ctr;
+} PendingFsyncEntry;
+
+typedef struct PendingUnlinkEntry
+{
+	RelFileNode rnode;			/* the dead relation to delete */
+	uint64		cycle_ctr;		/* ckpt_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static uint32 open_fsync_queue_files = 0;
+static bool sync_in_progress = false;
+static uint64 ckpt_cycle_ctr = 0;
+
+static HTAB *pendingFsyncTable = NULL;
+static List *pendingUnlinks = NIL;
+static MemoryContext pendingOpsCxt; /* context for the above  */
+
+static void syncpass(bool include_current);
+
+/*
+ * Initialize the pending operations state, if necessary.
+ */
+void
+smgrsync_init(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(SmgrFileTag);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingFsyncTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+}
+
+/*
+ * Do pre-checkpoint work.
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+smgrpreckpt(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	ckpt_cycle_ctr++;
+}
+
+/*
+ * Sync previous writes to stable storage.
+ */
+void
+smgrsync(void)
+{
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingFsyncTable.
+	 */
+	if (!pendingFsyncTable)
+		elog(ERROR, "cannot sync without a pendingFsyncTable");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbAllFsyncRequests();
+
+	syncpass(false);
+}
+
+/*
+ * Do one pass over the the fsync request hashtable and perform the necessary
+ * fsyncs. Increments the sync cycle counter.
+ *
+ * If include_current is true perform all fsyncs (this is done if too many
+ * files are open), otherwise only perform the fsyncs belonging to the cycle
+ * valid at call time.
+ */
+static void
+syncpass(bool include_current)
+{
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	int			processed = CheckpointStats.ckpt_sync_rels;
+	uint64		longest = CheckpointStats.ckpt_longest_sync;
+	uint64		total_elapsed = CheckpointStats.ckpt_agg_sync_time;
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use GetCheckpointSyncCycle() to tell old entries apart
+	 * from new ones: new ones will have cycle_ctr equal to
+	 * IncCheckpointSyncCycle().
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous smgrsync() failed to complete, run through the table and
+	 * forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+			entry->cycle_ctr = GetCheckpointSyncCycle();
+	}
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	IncCheckpointSyncCycle();
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingFsyncTable);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)))
+	{
+		/*
+		 * If processing fsync requests because of too may file handles, close
+		 * regardless of cycle. Otherwise nothing to be closed might be found,
+		 * and we want to make room as quickly as possible so more requests
+		 * can be absorbed.
+		 */
+		if (!include_current)
+		{
+			/* If the entry is new then don't process it this time. */
+			if (entry->cycle_ctr == GetCheckpointSyncCycle())
+				continue;
+
+			/* Else assert we haven't missed it */
+			Assert((entry->cycle_ctr + 1) == GetCheckpointSyncCycle());
+		}
+
+		/*
+		 * If fsync is off then we don't have to bother opening the file at
+		 * all.  (We delay checking until this point so that changing fsync on
+		 * the fly behaves sensibly.)
+		 *
+		 * XXX: Why is that an important goal? Doesn't give any interesting
+		 * guarantees afaict?
+		 */
+		if (enableFsync)
+		{
+			File		file;
+
+			/*
+			 * The fsync table could contain requests to fsync segments that
+			 * have been deleted (unlinked) by the time we get to them.  That
+			 * used to be problematic, but now we have a filehandle to the
+			 * deleted file. That means we might fsync an empty file
+			 * superfluously, in a relatively tight window, which is
+			 * acceptable.
+			 */
+			INSTR_TIME_SET_CURRENT(sync_start);
+
+			if (entry->file == -1)
+			{
+				/*
+				 * If we aren't transferring file descriptors directly to the
+				 * checkpointer on this platform, we'll have to convert the
+				 * tag to the path and open it (and close it again below).
+				 */
+				char		path[MAXPGPATH];
+
+				smgrpath(&entry->tag, path);
+				file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+				if (file < 0)
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open file \"%s\" to fsync: %m",
+									path)));
+			}
+			else
+			{
+				/*
+				 * Otherwise, we have kept the file descriptor from the oldest
+				 * request for the same tag.
+				 */
+				file = entry->file;
+			}
+
+			if (FileSync(file, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+				ereport(data_sync_elevel(ERROR),
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								FilePathName(file))));
+
+			/* Success; update statistics about sync timing */
+			INSTR_TIME_SET_CURRENT(sync_end);
+			sync_diff = sync_end;
+			INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+			elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+			if (elapsed > longest)
+				longest = elapsed;
+			total_elapsed += elapsed;
+			processed++;
+
+			if (log_checkpoints)
+				ereport(DEBUG1,
+						(errmsg("checkpoint sync: number=%d file=%s time=%.3f msec",
+								processed,
+								FilePathName(file),
+								(double) elapsed / 1000),
+						 errhidestmt(true),
+						 errhidecontext(true)));
+
+			if (entry->file == -1)
+				FileClose(file);
+		}
+
+		if (entry->file >= 0)
+		{
+			/*
+			 * Close file.  XXX: centralize code.
+			 */
+			Assert(open_fsync_queue_files > 0);
+			open_fsync_queue_files--;
+			FileClose(entry->file);
+			entry->file = -1;
+		}
+
+		/* Remove the entry. */
+		if (hash_search(pendingFsyncTable, &entry->tag, HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "pendingFsyncTable corrupted");
+
+		/*
+		 * If in checkpointer, we want to absorb pending requests every so
+		 * often to prevent overflow of the fsync request queue.  It is
+		 * unspecified whether newly-added entries will be visited by
+		 * hash_seq_search, but we don't care since we don't need to
+		 * process them anyway.
+		 */
+		if (absorb_counter-- <= 0)
+		{
+			/*
+			 * Don't absorb if too many files are open. This pass will
+			 * soon close some, so check again later.
+			 */
+			if (open_fsync_queue_files < ((max_safe_fds * 7) / 10))
+				AbsorbFsyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+	}							/* end loop over hashtable entries */
+
+	/* Flag successful completion of syncpass */
+	sync_in_progress = false;
+
+	/* Maintain sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+}
+
+/*
+ * Do post-checkpoint work.
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+smgrpostckpt(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == ckpt_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = relpathperm(entry->rnode, MAIN_FORKNUM);
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in smgrsync, we don't want to stop absorbing fsync requests for a
+		 * long time when there are many deletions to be done.  We can safely
+		 * call AbsorbFsyncRequests() at this point in the loop (note it might
+		 * try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			/* XXX: Centralize this condition */
+			if (open_fsync_queue_files < ((max_safe_fds * 7) / 10))
+				AbsorbFsyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+
+/*
+ * FsyncAtCheckpoint() -- Mark a relation segment as needing fsync
+ *
+ * If there is a local pending-ops table, just make an entry in it for
+ * smgrsync to process later.  Otherwise, try to pass off the fsync request
+ * to the checkpointer process.
+ */
+uint64
+FsyncAtCheckpoint(const SmgrFileTag *tag, File file, uint64 last_cycle)
+{
+	uint64		cycle;
+
+	pg_memory_barrier();
+	cycle = GetCheckpointSyncCycle();
+
+	/*
+	 * For historical reasons checkpointer keeps track of the number of time
+	 * backends perform writes themselves.
+	 */
+	if (!AmBackgroundWriterProcess())
+		CountBackendWrite();
+
+	/* Don't repeatedly register the same segment as dirty. */
+	if (last_cycle == cycle)
+		return cycle;
+
+	if (pendingFsyncTable)
+	{
+		int fd;
+
+		/*
+		 * Push it into local pending-ops table.
+		 *
+		 * Gotta duplicate the fd - we can't have fd.c close it behind our
+		 * back, as that'd lead to losing error reporting guarantees on
+		 * Linux.  RememberFsyncRequest() will manage the lifetime.
+		 */
+		ReleaseLruFiles();
+		fd = dup(FileGetRawDesc(file));
+		if (fd < 0)
+			elog(ERROR, "couldn't dup: %m");
+		RememberFsyncRequest(tag, fd, FileGetOpenSeq(file));
+	}
+	else
+		ForwardFsyncRequest(tag, file);
+
+	return cycle;
+}
+
+/*
+ * Schedule a file to be deleted after next checkpoint.
+ *
+ * As with FsyncAtCheckpoint, this could involve either a local or a remote
+ * pending-ops table.
+ */
+void
+UnlinkAfterCheckpoint(RelFileNodeBackend rnode)
+{
+	SmgrFileTag tag;
+
+	tag.node = rnode.node;
+	tag.forknum = MAIN_FORKNUM;
+	tag.segno = UNLINK_RELATION_REQUEST;
+
+	/* Should never be used with temp relations */
+	Assert(!RelFileNodeBackendIsTemp(rnode));
+
+	if (pendingFsyncTable)
+	{
+		/* push it into local pending-ops table */
+		RememberFsyncRequest(&tag, -1, 0);
+	}
+	else
+	{
+		/* Notify the checkpointer about it. */
+		Assert(IsUnderPostmaster);
+		ForwardFsyncRequest(&tag, -1);
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingFsyncTable during initialization of the startup
+ * process.  Calling this function drops the local pendingFsyncTable so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingFsyncTable)
+	{
+		smgrsync();
+		hash_destroy(pendingFsyncTable);
+	}
+	pendingFsyncTable = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
+
+
+/*
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * The range of possible segment numbers is way less than the range of
+ * BlockNumber, so we can reserve high values of segno for special purposes.
+ * We define three:
+ * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
+ *	 either for one fork, or all forks if forknum is InvalidForkNumber
+ * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
+ * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
+ *	 checkpoint.
+ * Note also that we're assuming real segment numbers don't exceed INT_MAX.
+ *
+ * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
+ * table has to be searched linearly, but dropping a database is a pretty
+ * heavyweight operation anyhow, so we'll live with it.)
+ */
+void
+RememberFsyncRequest(const SmgrFileTag *tag, int fd, uint64 open_seq)
+{
+	Assert(pendingFsyncTable);
+
+	if (tag->segno == FORGET_RELATION_FSYNC ||
+		tag->segno == FORGET_DATABASE_FSYNC)
+	{
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if ((tag->segno == FORGET_RELATION_FSYNC &&
+				 tag->node.dbNode == entry->tag.node.dbNode &&
+				 tag->node.relNode == entry->tag.node.relNode &&
+				 (tag->forknum == InvalidForkNumber ||
+				  tag->forknum == entry->tag.forknum)) ||
+				(tag->segno == FORGET_DATABASE_FSYNC &&
+				 tag->node.dbNode == entry->tag.node.dbNode))
+			{
+				if (entry->file != -1)
+				{
+					Assert(open_fsync_queue_files > 0);
+					open_fsync_queue_files--;
+					FileClose(entry->file);
+				}
+				hash_search(pendingFsyncTable, entry, HASH_REMOVE, NULL);
+			}
+		}
+
+		/* Remove unlink requests */
+		if (tag->segno == FORGET_DATABASE_FSYNC)
+		{
+			ListCell   *cell,
+					   *next,
+					   *prev;
+
+			prev = NULL;
+			for (cell = list_head(pendingUnlinks); cell; cell = next)
+			{
+				PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+				next = lnext(cell);
+				if (tag->node.dbNode == entry->rnode.dbNode)
+				{
+					pendingUnlinks = list_delete_cell(pendingUnlinks, cell,
+													  prev);
+					pfree(entry);
+				}
+				else
+					prev = cell;
+			}
+		}
+	}
+	else if (tag->segno == UNLINK_RELATION_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
+		Assert(tag->forknum == MAIN_FORKNUM);
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->rnode = tag->node;
+		entry->cycle_ctr = ckpt_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		entry = (PendingFsyncEntry *) hash_search(pendingFsyncTable,
+												  tag,
+												  HASH_ENTER,
+												  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			entry->file = -1;
+			entry->cycle_ctr = GetCheckpointSyncCycle();
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		if (fd >= 0)
+		{
+			File existing_file;
+			File new_file;
+
+			/*
+			 * If we didn't have a file already, or we did have a file but it
+			 * was opened later than this one, we'll keep the newly arrived
+			 * one.
+			 */
+			existing_file = entry->file;
+			if (existing_file == -1 ||
+				FileGetOpenSeq(existing_file) > open_seq)
+			{
+				char path[MAXPGPATH];
+
+				smgrpath(tag, path);
+
+				new_file = FileOpenForFd(fd, path, open_seq);
+				if (new_file < 0)
+					elog(ERROR, "cannot open file");
+				/* caller must have reserved entry */
+				entry->file = new_file;
+
+				if (existing_file != -1)
+					FileClose(existing_file);
+				else
+					open_fsync_queue_files++;
+			}
+			else
+			{
+				/*
+				 * File is already open. Have to keep the older fd, errors
+				 * might only be reported to it, thus close the one we just
+				 * got.
+				 *
+				 * XXX: check for errors.
+				 */
+				close(fd);
+			}
+
+			FlushFsyncRequestQueueIfNecessary();
+		}
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+}
+
+/*
+ * Flush the fsync request queue enough to make sure there's room for at least
+ * one more entry.
+ */
+bool
+FlushFsyncRequestQueueIfNecessary(void)
+{
+	if (sync_in_progress)
+		return false;
+
+	while (true)
+	{
+		if (open_fsync_queue_files >= ((max_safe_fds * 7) / 10))
+		{
+			elog(DEBUG1,
+				 "flush fsync request queue due to %u open files",
+				 open_fsync_queue_files);
+			syncpass(true);
+			elog(DEBUG1,
+				 "flushed fsync request, now at %u open files",
+				 open_fsync_queue_files);
+		}
+		else
+			break;
+	}
+
+	return true;
+}
+
+/*
+ * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+{
+	SmgrFileTag tag;
+
+	/* Create a special "forget relation" tag. */
+	tag.node = rnode;
+	tag.forknum = forknum;
+	tag.segno = FORGET_RELATION_FSYNC;
+
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(&tag, -1, 0);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		ForwardFsyncRequest(&tag, -1);
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
+ */
+void
+ForgetDatabaseFsyncRequests(Oid dbid)
+{
+	SmgrFileTag tag;
+
+	/* Create a special "forget database" tag. */
+	tag.node.dbNode = dbid;
+	tag.node.spcNode = 0;
+	tag.node.relNode = 0;
+	tag.forknum = InvalidForkNumber;
+	tag.segno = FORGET_DATABASE_FSYNC;
+
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(&tag, -1, 0);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* see notes in ForgetRelationFsyncRequests */
+		ForwardFsyncRequest(&tag, -1);
+	}
+}
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 970c94ee805..32bc91102d7 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -59,7 +59,7 @@
 #include "commands/view.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rewriteRemove.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 0ec3ff0fd6b..19def5c8b16 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -60,6 +60,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d1..137c748dfaf 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -1,10 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * bgwriter.h
- *	  Exports from postmaster/bgwriter.c and postmaster/checkpointer.c.
- *
- * The bgwriter process used to handle checkpointing duties too.  Now
- * there is a separate process, but we did not bother to split this header.
+ *	  Exports from postmaster/bgwriter.c.
  *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  *
@@ -15,29 +12,10 @@
 #ifndef _BGWRITER_H
 #define _BGWRITER_H
 
-#include "storage/block.h"
-#include "storage/relfilenode.h"
-
-
 /* GUC options */
 extern int	BgWriterDelay;
-extern int	CheckPointTimeout;
-extern int	CheckPointWarning;
-extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void) pg_attribute_noreturn();
-extern void CheckpointerMain(void) pg_attribute_noreturn();
-
-extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
-
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
-
-extern Size CheckpointerShmemSize(void);
-extern void CheckpointerShmemInit(void);
 
-extern bool FirstCallSinceLastCheckpoint(void);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/postmaster/checkpointer.h b/src/include/postmaster/checkpointer.h
new file mode 100644
index 00000000000..252a94f2909
--- /dev/null
+++ b/src/include/postmaster/checkpointer.h
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * checkpointer.h
+ *	  Exports from postmaster/checkpointer.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/checkpointer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHECKPOINTER_H
+#define CHECKPOINTER_H
+
+#include "storage/smgr.h"
+#include "storage/smgrsync.h"
+
+/*
+ * Control whether we transfer file descriptors to the checkpointer, to
+ * preserve error state on certain kernels.  We don't yet have support for
+ * sending files on Windows (it's entirely possible but it's not clear whether
+ * it would actually be useful for anything on that platform).  The macro is
+ * here just so that it can be commented out to test the non-fd-passing code
+ * path on Unix systems.
+ */
+#ifndef WIN32
+#define CHECKPOINTER_TRANSFER_FILES
+#endif
+
+/* GUC options */
+extern int	CheckPointTimeout;
+extern int	CheckPointWarning;
+extern double CheckPointCompletionTarget;
+
+/* The type used for counting checkpoint cycles. */
+typedef uint32 CheckpointCycle;
+
+/*
+ * A tag identifying a file to be flushed by the checkpointer.  This is
+ * convertible to the file's path, but it's convenient to have a small fixed
+ * sized object to use as a hash table key.
+ */
+typedef struct DirtyFileTag
+{
+	RelFileNode node;
+	ForkNumber forknum;
+	int segno;
+} DirtyFileTag;
+
+extern void CheckpointerMain(void) pg_attribute_noreturn();
+extern CheckpointCycle register_dirty_file(const DirtyFileTag *tag,
+										   File file,
+										   CheckpointCycle last_cycle);
+
+extern void ForwardFsyncRequest(const SmgrFileTag *tag, File fd);
+extern void RequestCheckpoint(int flags);
+extern void CheckpointWriteDelay(int flags, double progress);
+
+extern void AbsorbFsyncRequests(void);
+extern void AbsorbAllFsyncRequests(void);
+
+extern Size CheckpointerShmemSize(void);
+extern void CheckpointerShmemInit(void);
+
+extern uint64 GetCheckpointSyncCycle(void);
+extern uint64 IncCheckpointSyncCycle(void);
+
+extern bool FirstCallSinceLastCheckpoint(void);
+extern void CountBackendWrite(void);
+
+#endif
diff --git a/src/include/postmaster/postmaster.h b/src/include/postmaster/postmaster.h
index a40d66e8906..8e3bc6edf90 100644
--- a/src/include/postmaster/postmaster.h
+++ b/src/include/postmaster/postmaster.h
@@ -44,6 +44,15 @@ extern int	postmaster_alive_fds[2];
 #define POSTMASTER_FD_OWN		1	/* kept open by postmaster only */
 #endif
 
+#define FSYNC_FD_SUBMIT			0
+#define FSYNC_FD_PROCESS		1
+
+#ifndef WIN32
+extern int	fsync_fds[2];
+#else
+extern HANDLE fsyncPipe[2];
+#endif
+
 extern PGDLLIMPORT const char *progname;
 
 extern void PostmasterMain(int argc, char *argv[]) pg_attribute_noreturn();
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index cb882fb74e5..5fe7d90772d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -62,6 +62,7 @@ extern int	max_safe_fds;
 /* Operations on virtual Files --- equivalent to Unix kernel file ops */
 extern File PathNameOpenFile(const char *fileName, int fileFlags);
 extern File PathNameOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
+extern File FileOpenForFd(int fd, const char *fileName, uint64 open_seq);
 extern File OpenTemporaryFile(bool interXact);
 extern void FileClose(File file);
 extern int	FilePrefetch(File file, off_t offset, int amount, uint32 wait_event_info);
@@ -75,6 +76,8 @@ extern char *FilePathName(File file);
 extern int	FileGetRawDesc(File file);
 extern int	FileGetRawFlags(File file);
 extern mode_t FileGetRawMode(File file);
+extern uint64 FileGetOpenSeq(File file);
+extern void FileSetOpenSeq(File file, uint64 seq);
 
 /* Operations used for sharing named temporary files */
 extern File PathNameCreateTemporaryFile(const char *name, bool error_on_failure);
@@ -113,6 +116,7 @@ extern int	MakePGDirectory(const char *directoryName);
 
 /* Miscellaneous support routines */
 extern void InitFileAccess(void);
+extern void FileShmemInit(void);
 extern void set_max_safe_fds(void);
 extern void closeAllVfds(void);
 extern void SetTempTablespaces(Oid *tableSpaces, int numSpaces);
@@ -124,6 +128,7 @@ extern void AtEOSubXact_Files(bool isCommit, SubTransactionId mySubid,
 				  SubTransactionId parentSubid);
 extern void RemovePgTempFiles(void);
 extern bool looks_like_temp_rel_name(const char *name);
+extern void ReleaseLruFiles(void);
 
 extern int	pg_fsync(int fd);
 extern int	pg_fsync_no_writethrough(int fd);
@@ -141,4 +146,10 @@ extern int data_sync_elevel(int elevel);
 #define PG_TEMP_FILES_DIR "pgsql_tmp"
 #define PG_TEMP_FILE_PREFIX "pgsql_tmp"
 
+#ifndef WIN32
+/* XXX; This should probably go elsewhere */
+ssize_t pg_uds_send_with_fd(int sock, void *buf, ssize_t buflen, int fd);
+ssize_t pg_uds_recv_with_fd(int sock, void *buf, ssize_t bufsize, int *fd);
+#endif
+
 #endif							/* FD_H */
diff --git a/src/include/storage/latch.h b/src/include/storage/latch.h
index fd8735b7f5f..a74eedfe4e9 100644
--- a/src/include/storage/latch.h
+++ b/src/include/storage/latch.h
@@ -128,6 +128,7 @@ typedef struct Latch
 #define WL_POSTMASTER_DEATH  (1 << 4)
 #ifdef WIN32
 #define WL_SOCKET_CONNECTED  (1 << 5)
+#define WL_WIN32_HANDLE		 (1 << 6)
 #else
 /* avoid having to deal with case on platforms not requiring it */
 #define WL_SOCKET_CONNECTED  WL_SOCKET_WRITEABLE
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index c843bbc9692..dc22efbe0a8 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -77,6 +77,18 @@ typedef struct SMgrRelationData
 
 typedef SMgrRelationData *SMgrRelation;
 
+/*
+ * A tag identifying a file to be flushed at the next checkpoint.  This is
+ * convertible to the file's path, but it's convenient to have a small fixed
+ * sized object to use as a hash table key.
+ */
+typedef struct SmgrFileTag
+{
+	RelFileNode node;
+	ForkNumber forknum;
+	int segno;
+} SmgrFileTag;
+
 #define SmgrIsTemp(smgr) \
 	RelFileNodeBackendIsTemp((smgr)->smgr_rnode)
 
@@ -106,9 +118,7 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
+extern void smgrpath(const SmgrFileTag *tag, char *out);
 extern void AtEOXact_SMgr(void);
 
 
@@ -134,13 +144,9 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
+extern void mdpath(const SmgrFileTag *tag, char *out);
 
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
+extern bool FlushFsyncRequestQueueIfNecessary(void);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/smgrsync.h b/src/include/storage/smgrsync.h
new file mode 100644
index 00000000000..f32bb22a7cc
--- /dev/null
+++ b/src/include/storage/smgrsync.h
@@ -0,0 +1,37 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.h
+ *	  management of file synchronization
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/smgrpending.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SMGRSYNC_H
+#define SMGRSYNC_H
+
+#include "postgres.h"
+
+#include "storage/fd.h"
+
+
+extern void smgrsync_init(void);
+extern void smgrpreckpt(void);
+extern void smgrsync(void);
+extern void smgrpostckpt(void);
+
+extern void UnlinkAfterCheckpoint(RelFileNodeBackend rnode);
+extern uint64 FsyncAtCheckpoint(const SmgrFileTag *tag,
+								File file,
+								uint64 last_cycle);
+extern void RememberFsyncRequest(const SmgrFileTag *tag,
+								 int fd,
+								 uint64 open_seq);
+extern void SetForwardFsyncRequests(void);
+
+
+#endif
-- 
2.19.1

0002-Fix-deadlock-by-sending-without-content-lock-but--v3.patchapplication/octet-stream; name=0002-Fix-deadlock-by-sending-without-content-lock-but--v3.patchDownload

From 9a3bdf2bf08bf617432135da306e0002fdeedba3 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 23 Nov 2018 17:13:56 +1300
Subject: [PATCH 2/2] Fix deadlock by sending without content lock, but still
 marked BM_DIRTY.

---
 src/backend/postmaster/checkpointer.c | 32 +++++++++++++++++----------
 src/backend/storage/buffer/bufmgr.c   | 23 +++++++++++++++++++
 src/backend/storage/smgr/md.c         | 19 ++++++++++++++++
 src/backend/storage/smgr/smgr.c       | 14 +++++++++++-
 src/include/storage/smgr.h            |  6 ++++-
 5 files changed, 80 insertions(+), 14 deletions(-)

diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 892654dc053..65e7dde5760 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -111,7 +111,7 @@ typedef struct
 	uint32		type;
 	SmgrFileTag	tag;
 	bool		contains_fd;
-	int			ckpt_started;
+	uint64		sync_cycle;
 	uint64		open_seq;
 	/* might add a real request-type field later; not needed yet */
 } CheckpointerRequest;
@@ -175,7 +175,7 @@ static bool IsCheckpointOnSchedule(double progress);
 static bool ImmediateCheckpointRequested(void);
 static void UpdateSharedMemoryConfig(void);
 static void SendFsyncRequest(CheckpointerRequest *request, int fd);
-static bool AbsorbFsyncRequest(bool stop_at_current_cycle);
+static bool AbsorbFsyncRequest(uint64 max_cycle);
 
 /* Signal handlers */
 
@@ -1143,13 +1143,10 @@ ForwardFsyncRequest(const SmgrFileTag *tag, File file)
 	request.open_seq = request.contains_fd ? FileGetOpenSeq(file) : (uint64) -1;
 
 	/*
-	 * We read ckpt_started without synchronization.  It is used to prevent
-	 * AbsorbAllFsyncRequests() from reading new values from after a
-	 * checkpoint began.  A slightly out-of-date value here will only cause
-	 * it to do a little bit more work than strictly necessary, but that's
-	 * OK.
+	 * Include the current sync cycle.  This is used to prevent
+	 * AbsorbAllFsyncRequests() from consuming messages sent after it began.
 	 */
-	request.ckpt_started = CheckpointerShmem->ckpt_started;
+	request.sync_cycle = GetCheckpointSyncCycle();
 
 	SendFsyncRequest(&request,
 					 request.contains_fd ? FileGetRawDesc(file) : -1);
@@ -1198,6 +1195,8 @@ AbsorbFsyncRequests(void)
 void
 AbsorbAllFsyncRequests(void)
 {
+	uint64		max_cycle;
+
 	if (!AmCheckpointerProcess())
 		return;
 
@@ -1207,12 +1206,20 @@ AbsorbAllFsyncRequests(void)
 	BgWriterStats.m_buf_fsync_backend +=
 		pg_atomic_exchange_u32(&CheckpointerShmem->num_backend_fsync, 0);
 
+	/*
+	 * The highest cycle number we normally expect to see is the current cycle
+	 * number.  Even though we only want to consume messages from the previous
+	 * cycle, they may be hiding behind other messages, so we consume until
+	 * the pipe is empty or until we see a future cycle (caused by running out
+	 * of fds while we're in the loop).
+	 */
+	max_cycle = GetCheckpointSyncCycle();
 	for (;;)
 	{
 		if (!FlushFsyncRequestQueueIfNecessary())
 			elog(FATAL, "may not happen");
 
-		if (!AbsorbFsyncRequest(true))
+		if (!AbsorbFsyncRequest(max_cycle))
 			break;
 	}
 }
@@ -1220,9 +1227,11 @@ AbsorbAllFsyncRequests(void)
 /*
  * AbsorbFsyncRequest
  *		Retrieve one queued fsync request and pass them to local smgr.
+ *		Return false if there is nothing to absorb or we see a message
+ *		with a sync cycle higher than max_cycle.
  */
 static bool
-AbsorbFsyncRequest(bool stop_at_current_cycle)
+AbsorbFsyncRequest(uint64 max_cycle)
 {
 	static CheckpointerRequest req;
 	int fd = -1;
@@ -1306,8 +1315,7 @@ AbsorbFsyncRequest(bool stop_at_current_cycle)
 	RememberFsyncRequest(&req.tag, fd, req.open_seq);
 	END_CRIT_SECTION();
 
-	if (stop_at_current_cycle &&
-		req.ckpt_started == CheckpointerShmem->ckpt_started)
+	if (req.sync_cycle > max_cycle)
 		return false;
 
 	return true;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 256cc5e0217..8a73d4fb384 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -178,6 +178,7 @@ static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
 static inline int32 GetPrivateRefCount(Buffer buffer);
 static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
+static void ScheduleBufferTagForFsync(const BufferTag *tag, SMgrRelation reln);
 
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
@@ -1142,6 +1143,7 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 				FlushBuffer(buf, NULL);
 				LWLockRelease(BufferDescriptorGetContentLock(buf));
 
+				ScheduleBufferTagForFsync(&buf->tag, NULL);
 				ScheduleBufferTagForWriteback(&BackendWritebackContext,
 											  &buf->tag);
 
@@ -2401,6 +2403,7 @@ SyncOneBuffer(int buf_id, bool skip_recently_used, WritebackContext *wb_context)
 
 	UnpinBuffer(bufHdr, true);
 
+	ScheduleBufferTagForFsync(&tag, NULL);
 	ScheduleBufferTagForWriteback(wb_context, &tag);
 
 	return result | BUF_WRITTEN;
@@ -2662,6 +2665,11 @@ BufferGetTag(Buffer buffer, RelFileNode *rnode, ForkNumber *forknum,
  *
  * If the caller has an smgr reference for the buffer's relation, pass it
  * as the second parameter.  If not, pass NULL.
+ *
+ * The caller must call ScheduleBufferTagForFsync() after releasing the
+ * content lock, but before clearing the BM_DIRTY flag.  This ensures that a
+ * concurrent checkpoint will either receive the fsync request, or consider it
+ * dirty and flush it (again) itself.
  */
 static void
 FlushBuffer(BufferDesc *buf, SMgrRelation reln)
@@ -3223,6 +3231,7 @@ FlushRelationBuffers(Relation rel)
 			FlushBuffer(bufHdr, rel->rd_smgr);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
+			ScheduleBufferTagForFsync(&bufHdr->tag, rel->rd_smgr);
 		}
 		else
 			UnlockBufHdr(bufHdr, buf_state);
@@ -3277,6 +3286,7 @@ FlushDatabaseBuffers(Oid dbid)
 			FlushBuffer(bufHdr, NULL);
 			LWLockRelease(BufferDescriptorGetContentLock(bufHdr));
 			UnpinBuffer(bufHdr, true);
+			ScheduleBufferTagForFsync(&bufHdr->tag, NULL);
 		}
 		else
 			UnlockBufHdr(bufHdr, buf_state);
@@ -4239,6 +4249,19 @@ WritebackContextInit(WritebackContext *context, int *max_pending)
 	context->nr_pending = 0;
 }
 
+/*
+ * Register a block that is dirty in the kernel page cache, for later fsync.
+ */
+static void
+ScheduleBufferTagForFsync(const BufferTag *tag, SMgrRelation reln)
+{
+	/* Open if not already passed in. */
+	if (reln == NULL)
+		reln = smgropen(tag->rnode, InvalidBackendId);
+
+	smgrregdirtyblock(reln, tag->forkNum, tag->blockNum);
+}
+
 /*
  * Add buffer to list of pending writeback requests.
  */
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 344e0e12d6f..56c4d15fa60 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -967,6 +967,25 @@ mdpath(const SmgrFileTag *tag, char *out)
 	pfree(path);
 }
 
+/*
+ * Register the file behind a dirty block for syncing.
+ */
+void
+mdregdirtyblock(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	MdfdVec	   *seg;
+	int			segno;
+
+	/* Find the segment. */
+	segno = blocknum / RELSEG_SIZE;
+	if (segno >= reln->md_num_open_segs[forknum])
+		elog(ERROR, "block number past end of relation");
+	seg = &reln->md_seg_fds[forknum][segno];
+
+	/* Register it as dirty. */
+	register_dirty_segment(reln, forknum, seg);
+}
+
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index c36ba4298b7..95794b3a945 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -61,6 +61,8 @@ typedef struct f_smgr
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_path) (const SmgrFileTag *tag, char *out);
+	void		(*smgr_regdirtyblock) (SMgrRelation reln, ForkNumber forknum,
+									   BlockNumber blocknum);
 } f_smgr;
 
 
@@ -81,7 +83,8 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_path = mdpath
+		.smgr_path = mdpath,
+		.smgr_regdirtyblock = mdregdirtyblock
 	}
 };
 
@@ -768,6 +771,15 @@ smgrpath(const SmgrFileTag *tag, char *out)
 	smgrsw[which_for_file_tag(tag)].smgr_path(tag, out);
 }
 
+/*
+ * smgrregdirtblock() -- Register a dirty block for later fsync.
+ */
+void
+smgrregdirtyblock(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	smgrsw[reln->smgr_which].smgr_regdirtyblock(reln, forknum, blocknum);
+}
+
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index dc22efbe0a8..c8afb6ee1ca 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -118,7 +118,9 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpath(const SmgrFileTag *tag, char *out);
+extern void smgrpath(const SmgrFileTag *file_tag, char *out);
+extern void smgrregdirtyblock(SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
 extern void AtEOXact_SMgr(void);
 
 
@@ -145,6 +147,8 @@ extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpath(const SmgrFileTag *tag, char *out);
+extern void mdregdirtyblock(SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum);
 
 extern bool FlushFsyncRequestQueueIfNecessary(void);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-- 
2.19.1

#16

Thomas Munro

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: Thomas Munro (#15)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I do have a new plan though...

Ugh. The plan in my previous email doesn't work, I was confused about
the timing of the buffer header update. Back to the drawing board.

--
Thomas Munro
http://www.enterprisedb.com

#17

Dmitry Dolgov

9erthalion6@gmail.com

about 7 years ago

In reply to: Thomas Munro (#16)

Re: Refactoring the checkpointer's fsync request queue

On Mon, Nov 26, 2018 at 11:47 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I do have a new plan though...

Ugh. The plan in my previous email doesn't work, I was confused about
the timing of the buffer header update. Back to the drawing board.

Any chance to share the drawing board with the ideas? :)

On the serious note, I assume you have plans to work on this during the next
CF, right?

#18

Thomas Munro

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: Dmitry Dolgov (#17)

2 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Sun, Dec 2, 2018 at 1:46 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Mon, Nov 26, 2018 at 11:47 PM Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Fri, Nov 23, 2018 at 5:45 PM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
I do have a new plan though...

Ugh. The plan in my previous email doesn't work, I was confused about
the timing of the buffer header update. Back to the drawing board.

Any chance to share the drawing board with the ideas? :)

On the serious note, I assume you have plans to work on this during the next
CF, right?

Indeed I am. Unfortunately, the solution to that deadlock eludes me still.

So, I have split this work into multiple patches. 0001 is a draft
version of some new infrastructure I'd like to propose, 0002 is the
thing originally described by the first two paragraphs in the first
email in this thread, and the rest I'll have to defer for now (the fd
passing stuff).

To restate the purpose of this work: I want to make it possible for
other patches to teach the checkpointer to fsync new kinds of files
that are accessed through the buffer pool. Specifically, undo segment
files (for zheap) and SLRU files (see Shawn Debnath's plan to put clog
et al into the standard buffer pool). The main changes are:

1. A bunch of stuff moved out of md.c into smgrsync.c, where the same
pendingOpTable machinery can be shared by any block storage
implementation.
2. The actual fsync'ing now happens by going through smgrsyncimmed().
3. You can now tell the checkpointer to forget individual segments
(undo and slru both need to be able to do that when they truncate data
from the 'front').
4. The protocol for forgetting relations etc is slightly different:
if a file is found to be missing, AbsortFsyncRequests() and then probe
to see if the segment number disappeared from the set (instead of
cancel flags), though I need to test this case.
5. Requests (ie segment numbers) are now stored in a sorted vector,
because it doesn't make sense to store large and potentially sparse
integers in bitmapsets. See patch 0001 for new machinery to support
that.

The interfaces in 0001 are perhaps a bit wordy and verbose (and hard
to fit in 80 columns). Maybe I need something better for memory
contexts. Speaking of which, it wasn't possible to do a
guaranteed-no-alloc merge (like the one done for zero-anchored
bitmapset in commit 1556cb2fc), so I had to add a second vector for
'in progress' segments. I merge them with the main set on the next
attempt, if it's found to be non-empty. Very open to better ideas on
how to do any of this.

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Add-parameterized-vectors-and-sorting-searching-s-v4.patchapplication/octet-stream; name=0001-Add-parameterized-vectors-and-sorting-searching-s-v4.patchDownload

From 63ce898e6db0b6eeb545e8b4f10aded1ba283e63 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Tue, 1 Jan 2019 07:05:46 +1300
Subject: [PATCH 1/2] Add parameterized vectors and sorting/searching support.

To make it a bit easier to work with arrays (rather than lists or
bitmaps), create a mechanism along the lines of StringInfo, but usable
with other types (eg ints, structs, ...) that can be parameterized at
compile time.  Follow the example of simplehash.h.

Provide some simple sorting and searching algorithms for working with
sorted vectors.

Author: Thomas Munro
Reviewed-by:
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 src/include/lib/simplealgo.h   | 171 +++++++++++++
 src/include/lib/simplevector.h | 454 +++++++++++++++++++++++++++++++++
 2 files changed, 625 insertions(+)
 create mode 100644 src/include/lib/simplealgo.h
 create mode 100644 src/include/lib/simplevector.h

diff --git a/src/include/lib/simplealgo.h b/src/include/lib/simplealgo.h
new file mode 100644
index 00000000000..b9b5b248aef
--- /dev/null
+++ b/src/include/lib/simplealgo.h
@@ -0,0 +1,171 @@
+/*-------------------------------------------------------------------------
+ *
+ * simplealgo.h
+ *
+ *	  Simple algorithms specialized for arrays of user-defined types.  For
+ *	  now, functions related to sorting and searching.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * Usage notes:
+ *
+ *	  To generate functions specialized for a type, the following parameter
+ *	  macros should be #define'd before this file is included.
+ *
+ *	  - SA_PREFIX - prefix for all symbol names generated.
+ *	  - SA_ELEMENT_TYPE - type of the referenced elements
+ *	  - SA_DECLARE - if defined the functions and types are declared
+ *	  - SA_DEFINE - if defined the functions and types are defined
+ *	  - SA_SCOPE - scope (e.g. extern, static inline) for functions
+ *
+ *	  The following are relevant only when SS_DEFINE is defined:
+ *
+ *	  - SA_COMPARE(a, b) - an expression to compare pointers to two values
+ *
+ * IDENTIFICATION
+ *		src/include/lib/simplealgorithm.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define SA_MAKE_PREFIX(a) CppConcat(a,_)
+#define SA_MAKE_NAME(name) SA_MAKE_NAME_(SA_MAKE_PREFIX(SA_PREFIX),name)
+#define SA_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define SA_SORT SA_MAKE_NAME(sort)
+#define SA_UNIQUE SA_MAKE_NAME(unique)
+#define SA_BINARY_SEARCH SA_MAKE_NAME(binary_search)
+#define SA_LOWER_BOUND SA_MAKE_NAME(lower_bound)
+
+#ifdef SA_DECLARE
+
+SA_SCOPE void SA_SORT(SA_ELEMENT_TYPE *first, SA_ELEMENT_TYPE *last);
+SA_SCOPE SA_ELEMENT_TYPE *SA_UNIQUE(SA_ELEMENT_TYPE *first,
+									SA_ELEMENT_TYPE *last);
+SA_SCOPE bool SA_BINARY_SEARCH(SA_ELEMENT_TYPE *first,
+							   SA_ELEMENT_TYPE *last,
+							   SA_ELEMENT_TYPE *value);
+SA_SCOPE SA_ELEMENT_TYPE *SA_LOWER_BOUND(SA_ELEMENT_TYPE *first,
+										 SA_ELEMENT_TYPE *last,
+										 SA_ELEMENT_TYPE *value);
+
+#endif
+
+#ifdef SA_DEFINE
+
+/* helper functions */
+#define SA_QSORT_COMPARATOR SA_MAKE_NAME(qsort_comparator)
+
+/*
+ * Function wrapper for comparator expression.
+ */
+static inline int
+SA_QSORT_COMPARATOR(const void *a, const void *b)
+{
+	return SA_COMPARE((SA_ELEMENT_TYPE *) a, (SA_ELEMENT_TYPE *) b);
+}
+
+/*
+ * Sort an array [first, last) in place.
+ */
+SA_SCOPE void
+SA_SORT(SA_ELEMENT_TYPE *first, SA_ELEMENT_TYPE *last)
+{
+	qsort(first, last - first, sizeof(SA_ELEMENT_TYPE), SA_QSORT_COMPARATOR);
+}
+
+/*
+ * Remove duplicates from an array [first, last).  Return the new last pointer
+ * (ie one past the new end).
+ */
+SA_SCOPE SA_ELEMENT_TYPE *
+SA_UNIQUE(SA_ELEMENT_TYPE *first, SA_ELEMENT_TYPE *last)
+{
+	SA_ELEMENT_TYPE *write_head;
+	SA_ELEMENT_TYPE *read_head;
+
+	if (last - first <= 1)
+		return last;
+
+	write_head = first;
+	read_head = first + 1;
+
+	while (read_head < last)
+	{
+		if (SA_COMPARE(read_head, write_head) != 0)
+			*++write_head = *read_head;
+		++read_head;
+	}
+	return write_head + 1;
+}
+
+/*
+ * Check if a sorted array [first, last) contains a value.
+ */
+SA_SCOPE bool
+SA_BINARY_SEARCH(SA_ELEMENT_TYPE *first,
+				 SA_ELEMENT_TYPE *last,
+				 SA_ELEMENT_TYPE *value)
+{
+	SA_ELEMENT_TYPE *lower = first;
+	SA_ELEMENT_TYPE *upper = last - 1;
+
+	while (lower <= upper)
+	{
+		SA_ELEMENT_TYPE *mid;
+		int			cmp;
+
+		mid = lower + (upper - lower) / 2;
+		cmp = SA_COMPARE(mid, value);
+		if (cmp < 0)
+			lower = mid + 1;
+		else if (cmp > 0)
+			upper = mid - 1;
+		else
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Find the first element in the range [first, last) that is not less than
+ * value, in a sorted array.
+ */
+SA_SCOPE SA_ELEMENT_TYPE *
+SA_LOWER_BOUND(SA_ELEMENT_TYPE *first,
+			   SA_ELEMENT_TYPE *last,
+			   SA_ELEMENT_TYPE *value)
+{
+	SA_ELEMENT_TYPE *lower = first;
+	SA_ELEMENT_TYPE *upper = last - 1;
+	SA_ELEMENT_TYPE *mid = first;
+
+	while (lower <= upper)
+	{
+		int			cmp;
+		mid = lower + (upper - lower) / 2;
+		cmp = SA_COMPARE(mid, value);
+		if (cmp < 0)
+			lower = mid + 1;
+		else if (cmp > 0)
+			upper = mid - 1;
+		else
+			break;
+	}
+
+	return mid;
+}
+
+#endif
+
+#undef SA_MAKE_PREFIX
+#undef SA_MAKE_NAME
+#undef SA_MAKE_NAME_
+#undef SA_SORT
+#undef SA_UNIQUE
+#undef SA_BINARY_SEARCH
+#undef SA_LOWER_BOUND
+#undef SA_DECLARE
+#undef SA_DEFINE
diff --git a/src/include/lib/simplevector.h b/src/include/lib/simplevector.h
new file mode 100644
index 00000000000..ac4c1a01bcf
--- /dev/null
+++ b/src/include/lib/simplevector.h
@@ -0,0 +1,454 @@
+/*-------------------------------------------------------------------------
+ *
+ * simplevector.h
+ *
+ *	  Vector implementation that will be specialized for user-defined types,
+ *	  by including this file to generate the required code.  Suitable for
+ *	  value types that can be bitwise copied and moved.  Includes an in-place
+ *	  small-vector optimization, so that allocation can be avoided until the
+ *	  internal space is exceeded.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * Usage notes:
+ *
+ *	  To generate a type and associates functions, the following parameter
+ *	  macros should be #define'd before this file is included.
+ *
+ *	  - SV_PREFIX - prefix for all symbol names generated.
+ *	  - SV_ELEMENT_TYPE - type of the contained elements
+ *	  - SV_DECLARE - if defined the functions and types are declared
+ *	  - SV_DEFINE - if defined the functions and types are defined
+ *	  - SV_SCOPE - scope (e.g. extern, static inline) for functions
+ *
+ * IDENTIFICATION
+ *		src/include/lib/simplevector.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* helpers */
+#define SV_MAKE_PREFIX(a) CppConcat(a,_)
+#define SV_MAKE_NAME(name) SV_MAKE_NAME_(SV_MAKE_PREFIX(SV_PREFIX),name)
+#define SV_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* type declarations */
+#define SV_TYPE SV_PREFIX
+
+/* function declarations */
+#define SV_INIT SV_MAKE_NAME(init)
+#define SV_DESTROY SV_MAKE_NAME(destroy)
+#define SV_RESET SV_MAKE_NAME(reset)
+#define SV_CLEAR SV_MAKE_NAME(clear)
+#define SV_DATA SV_MAKE_NAME(data)
+#define SV_EMPTY SV_MAKE_NAME(empty)
+#define SV_SIZE SV_MAKE_NAME(size)
+#define SV_RESIZE SV_MAKE_NAME(resize)
+#define SV_CAPACITY SV_MAKE_NAME(capacity)
+#define SV_RESERVE SV_MAKE_NAME(reserve)
+#define SV_APPEND SV_MAKE_NAME(append)
+#define SV_APPEND_N SV_MAKE_NAME(append_n)
+#define SV_INSERT SV_MAKE_NAME(insert)
+#define SV_INSERT_N SV_MAKE_NAME(insert_n)
+#define SV_ERASE SV_MAKE_NAME(erase)
+#define SV_ERASE_N SV_MAKE_NAME(erase_n)
+#define SV_BEGIN SV_MAKE_NAME(begin)
+#define SV_END SV_MAKE_NAME(end)
+#define SV_BACK SV_MAKE_NAME(back)
+#define SV_POP_BACK SV_MAKE_NAME(pop_back)
+#define SV_SWAP SV_MAKE_NAME(swap)
+
+#ifndef SV_IN_PLACE_CAPACITY
+#define SV_IN_PLACE_CAPACITY 3
+#endif
+
+#ifdef SV_DECLARE
+
+typedef struct SV_TYPE
+{
+	/*
+	 * If size is <= SV_IN_PLACE_CAPACITY, then it represents the number of
+	 * elements stored in u.elements.  Otherwise, it is the capacity of the
+	 * buffer in u.overflow.data (in number of potential elements), and
+	 * u.overflow.count represents the number of occupied elements.
+	 */
+	uint32		size;
+	union
+	{
+		struct
+		{
+			void	   *data;
+			uint32		count;
+		} overflow;
+		SV_ELEMENT_TYPE elements[SV_IN_PLACE_CAPACITY];
+	} u;
+}		SV_TYPE;
+
+/* externally visible function prototypes */
+SV_SCOPE void SV_INIT(SV_TYPE *vec);
+SV_SCOPE void SV_DESTROY(SV_TYPE *vec);
+SV_SCOPE void SV_RESET(SV_TYPE *vec);
+SV_SCOPE void SV_CLEAR(SV_TYPE *vec);
+SV_SCOPE SV_ELEMENT_TYPE *SV_DATA(SV_TYPE *vec);
+SV_SCOPE bool SV_EMPTY(SV_TYPE *vec);
+SV_SCOPE uint32 SV_SIZE(SV_TYPE *vec);
+SV_SCOPE void SV_RESIZE(SV_TYPE *vec, uint32 size);
+SV_SCOPE uint32 SV_CAPACITY(SV_TYPE *vec);
+SV_SCOPE void SV_RESERVE(SV_TYPE *vec, uint32 capacity);
+SV_SCOPE void SV_APPEND(SV_TYPE *vec, const SV_ELEMENT_TYPE *value);
+SV_SCOPE void SV_APPEND_N(SV_TYPE *vec, const SV_ELEMENT_TYPE *values,
+						  uint32 size);
+SV_SCOPE void SV_INSERT(SV_TYPE *vec,
+						SV_ELEMENT_TYPE *position,
+						const SV_ELEMENT_TYPE *value);
+SV_SCOPE void SV_INSERT_N(SV_TYPE *vec,
+						  SV_ELEMENT_TYPE *position,
+						  const SV_ELEMENT_TYPE *values,
+						  uint32 n);
+SV_SCOPE void SV_ERASE(SV_TYPE *vec, SV_ELEMENT_TYPE *position);
+SV_SCOPE void SV_ERASE_N(SV_TYPE *vec, SV_ELEMENT_TYPE *position, uint32 n);
+SV_SCOPE void SV_SWAP(SV_TYPE *a, SV_TYPE *b);
+SV_SCOPE SV_ELEMENT_TYPE *SV_BEGIN(SV_TYPE *vec);
+SV_SCOPE SV_ELEMENT_TYPE *SV_END(SV_TYPE *vec);
+SV_SCOPE SV_ELEMENT_TYPE *SV_BACK(SV_TYPE *vec);
+SV_SCOPE void SV_POP_BACK(SV_TYPE *vec);
+
+#ifdef SV_COMPARE
+SV_SCOPE bool SV_INSERT_SORTED(SV_TYPE *vec, SV_ELEMENT_TYPE *value);
+SV_SCOPE bool SV_LOWER_BOUND(SV_TYPE *vec, SV_ELEMENT_TYPE *value);
+SV_SCOPE void SV_SORT(SV_TYPE *vec);
+SV_SCOPE void SV_UNIQUE(SV_TYPE *vec);
+#endif
+
+#endif
+
+#ifdef SV_DEFINE
+
+/*
+ * Initialize a vector in-place.
+ */
+SV_SCOPE void
+SV_INIT(SV_TYPE *vec)
+{
+	vec->size = 0;
+}
+
+/*
+ * Free any resources owned by the vector.
+ */
+SV_SCOPE void
+SV_DESTROY(SV_TYPE *vec)
+{
+	SV_RESET(vec);
+}
+
+/*
+ * Free any resources owned by the vector.
+ */
+SV_SCOPE void
+SV_RESET(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		pfree(vec->u.overflow.data);
+	vec->size = 0;
+}
+
+/*
+ * Clear the vector so that it contains no elements.
+ */
+SV_SCOPE void
+SV_CLEAR(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count = 0;
+	else
+		vec->size = 0;
+}
+
+/*
+ * Return a pointer to the elements in the vector.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_DATA(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		return vec->u.overflow.data;
+	else
+		return &vec->u.elements[0];
+}
+
+/*
+ * Check if the vector is empty (has no elements).
+ */
+SV_SCOPE bool
+SV_EMPTY(SV_TYPE *vec)
+{
+	return SV_SIZE(vec) == 0;
+}
+
+/*
+ * Return the number of elements in the vector.
+ */
+SV_SCOPE uint32
+SV_SIZE(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		return vec->u.overflow.count;
+	else
+		return vec->size;
+}
+
+/*
+ * Resize the vector, discarding elements at the end, or creating new
+ * zero-initialized elements as required.
+ */
+SV_SCOPE void
+SV_RESIZE(SV_TYPE *vec, uint32 size)
+{
+	uint32		old_size = SV_SIZE(vec);
+
+	/* Growing? */
+	if (size > old_size)
+	{
+		SV_RESERVE(vec, size);
+		memset(&SV_DATA(vec)[old_size], 0,
+			   sizeof(SV_ELEMENT_TYPE) * (size - old_size));
+	}
+
+	/* Set the new size. */
+	if (vec->size <= SV_IN_PLACE_CAPACITY)
+		vec->size = size;
+	else
+		vec->u.overflow.count = size;
+}
+
+/*
+ * Return the number of elements that can be held in the vector before it
+ * needs to reallocate.
+ */
+SV_SCOPE uint32
+SV_CAPACITY(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		return vec->size;
+	else
+		return SV_IN_PLACE_CAPACITY;
+}
+
+/*
+ * Make sure we have capacity for a given number of elements without having to
+ * reallocate.
+ */
+SV_SCOPE void
+SV_RESERVE(SV_TYPE *vec, uint32 capacity)
+{
+	void *new_buffer;
+
+	/* Do nothing if we already have that much capacity. */
+	if (capacity <= SV_IN_PLACE_CAPACITY || capacity < vec->size)
+		return;
+
+	/* Allocate larger buffer. */
+#ifdef SV_GLOBAL_MEMORY_CONTEXT
+	new_buffer = MemoryContextAlloc(SV_GLOBAL_MEMORY_CONTEXT,
+									sizeof(SV_ELEMENT_TYPE) * capacity);
+#else
+	new_buffer = palloc(sizeof(SV_ELEMENT_TYPE) * capacity);
+#endif
+
+	/* Copy existing data to new buffer. */
+	if (vec->size <= SV_IN_PLACE_CAPACITY)
+	{
+		/* Promote from in-line format. */
+		if (vec->size > 0)
+			memcpy(new_buffer,
+				   vec->u.elements,
+				   sizeof(SV_ELEMENT_TYPE) * vec->size);
+		vec->u.overflow.count = vec->size;
+	}
+	else
+	{
+		/* Copy from existing smaller overflow buffer, and free it. */
+		if (vec->u.overflow.count > 0)
+			memcpy(new_buffer,
+				   vec->u.overflow.data,
+				   sizeof(SV_ELEMENT_TYPE) * vec->u.overflow.count);
+		Assert(vec->u.overflow.data);
+		pfree(vec->u.overflow.data);
+	}
+	vec->u.overflow.data = new_buffer;
+	vec->size = capacity;
+}
+
+/*
+ * Append a value to the end of a vector.
+ */
+SV_SCOPE void
+SV_APPEND(SV_TYPE *vec, const SV_ELEMENT_TYPE *value)
+{
+	SV_APPEND_N(vec, value, 1);
+}
+
+/*
+ * Append N values to the end of a vector.
+ */
+SV_SCOPE void
+SV_APPEND_N(SV_TYPE *vec, const SV_ELEMENT_TYPE *values, uint32 n)
+{
+	uint32		size = SV_SIZE(vec);
+
+	SV_RESERVE(vec, size + n);
+	memcpy(&SV_DATA(vec)[size], values, sizeof(SV_ELEMENT_TYPE) * n);
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count += n;
+	else
+		vec->size += n;
+}
+
+/*
+ * Insert a value before an arbitrary position in the vector.  This is not
+ * especially efficient as it must shift values to make space.
+ */
+SV_SCOPE void
+SV_INSERT(SV_TYPE *vec, SV_ELEMENT_TYPE *position, const SV_ELEMENT_TYPE *value)
+{
+	SV_INSERT_N(vec, position, value, 1);
+}
+
+/*
+ * Insert N values before an arbitrary position in the vector.  This is not
+ * especially efficient as it must shift values to make space.
+ */
+SV_SCOPE void
+SV_INSERT_N(SV_TYPE *vec, SV_ELEMENT_TYPE *position,
+			const SV_ELEMENT_TYPE *values, uint32 n)
+{
+	uint32		size = SV_SIZE(vec);
+	uint32		i = position - SV_DATA(vec);
+	SV_ELEMENT_TYPE *data;
+
+	if (n == 0)
+		return;
+
+	Assert(position >= SV_DATA(vec) &&
+		   position <= SV_DATA(vec) + size);
+	SV_RESERVE(vec, size + n);
+	data = SV_DATA(vec);
+	memmove(&data[i + n],
+			&data[i],
+			sizeof(SV_ELEMENT_TYPE) * (size - i));
+	memcpy(&data[i], values, sizeof(SV_ELEMENT_TYPE) * n);
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count += n;
+	else
+		vec->size += n;
+}
+
+/*
+ * Erase an arbitarary element in the vector.  This is not especially
+ * efficient as it must shift trailing values.
+ */
+SV_SCOPE void
+SV_ERASE(SV_TYPE *vec, SV_ELEMENT_TYPE *position)
+{
+	SV_ERASE_N(vec, position, 1);
+}
+
+/*
+ * Erase N values begining with an arbitarary element in the vector.  This is
+ * not especially efficient as it must shift trailing values.
+ */
+SV_SCOPE void
+SV_ERASE_N(SV_TYPE *vec, SV_ELEMENT_TYPE *position, uint32 n)
+{
+	Assert(position >= SV_DATA(vec) &&
+		   position + n <= SV_DATA(vec) + SV_SIZE(vec));
+	memmove(position,
+			position + n,
+			sizeof(SV_ELEMENT_TYPE) * (SV_SIZE(vec) - n));
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count -= n;
+	else
+		vec->size -= n;
+}
+
+/*
+ * Get a pointer to the first element, if there is one.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_BEGIN(SV_TYPE *vec)
+{
+	return SV_DATA(vec);
+}
+
+/*
+ * Get a pointer to the element past the last element.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_END(SV_TYPE *vec)
+{
+	return SV_DATA(vec) + SV_SIZE(vec);
+}
+
+/*
+ * Get a pointer to the back (last) element.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_BACK(SV_TYPE *vec)
+{
+	Assert(!SV_EMPTY(vec));
+	return SV_DATA(vec) + SV_SIZE(vec) - 1;
+}
+
+/*
+ * Remove the back (last) element.
+ */
+SV_SCOPE void
+SV_POP_BACK(SV_TYPE *vec)
+{
+	Assert(!SV_EMPTY(vec));
+	SV_RESIZE(vec, SV_SIZE(vec) - 1);
+}
+
+/*
+ * Swap the contents of two vectors.
+ */
+SV_SCOPE void
+SV_SWAP(SV_TYPE *a, SV_TYPE *b)
+{
+	SV_TYPE		tmp;
+
+	tmp = *a;
+	*a = *b;
+	*b = tmp;
+}
+
+#endif
+
+#undef SV_MAKE_PREFIX
+#undef SV_MAKE_NAME
+#undef SV_MAKE_NAME_
+#undef SV_INIT
+#undef SV_DESTROY
+#undef SV_RESET
+#undef SV_CLEAR
+#undef SV_DATA
+#undef SV_EMPTY
+#undef SV_SIZE
+#undef SV_RESIZE
+#undef SV_CAPACITY
+#undef SV_RESERVE
+#undef SV_APPEND
+#undef SV_APPEND_N
+#undef SV_INSERT
+#undef SV_INSERT_N
+#undef SV_ERASE
+#undef SV_ERASE_N
+#undef SV_BEGIN
+#undef SV_END
+#undef SV_BACK
+#undef SV_POP_BACK
+#undef SV_SWAP
+#undef SV_IN_PLACE_CAPACITY
+#undef SV_DECLARE
+#undef SV_DEFINE
-- 
2.19.1

0002-Refactor-the-fsync-machinery-to-support-future-SM-v4.patchapplication/octet-stream; name=0002-Refactor-the-fsync-machinery-to-support-future-SM-v4.patchDownload

From afd2eb044eacf07601e711f2d292c66ecb8bb134 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Mon, 31 Dec 2018 15:25:16 +1300
Subject: [PATCH 2/2] Refactor the fsync machinery to support future SMGR
 implementations.

In anticipation of proposed block storage managers alongside md.c that
map bufmgr.c blocks to files optimised for different usage patterns:

1.  Move the system for requesting fsyncs out of md.c into a new
translation unit smgrsync.c.

2.  Have smgrsync.c perform the actual fsync() calls via the existing
polymorphic smgrimmedsync() interface, extended to allow an individual
segment number to be specified.

3.  Teach the checkpointer how to forget individual segments that are
unlinked from the 'front' after having been dropped from shared
buffers.

4.  Move the request tracking from a bitmapset into a sorted vector,
because the proposed block storage managers are not anchored at zero
and use potentially very large and sparse integers.

Author: Thomas Munro
Reviewed-by:
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 src/backend/access/heap/heapam.c      |   4 +-
 src/backend/access/nbtree/nbtree.c    |   2 +-
 src/backend/access/nbtree/nbtsort.c   |   2 +-
 src/backend/access/spgist/spginsert.c |   2 +-
 src/backend/access/transam/xlog.c     |   2 +
 src/backend/bootstrap/bootstrap.c     |   1 +
 src/backend/catalog/heap.c            |   2 +-
 src/backend/commands/dbcommands.c     |   2 +-
 src/backend/commands/tablecmds.c      |   2 +-
 src/backend/commands/tablespace.c     |   2 +-
 src/backend/postmaster/bgwriter.c     |   1 +
 src/backend/postmaster/checkpointer.c |  21 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +
 src/backend/storage/ipc/ipci.c        |   1 +
 src/backend/storage/smgr/Makefile     |   2 +-
 src/backend/storage/smgr/md.c         | 801 ++-----------------------
 src/backend/storage/smgr/smgr.c       | 104 ++--
 src/backend/storage/smgr/smgrsync.c   | 834 ++++++++++++++++++++++++++
 src/backend/tcop/utility.c            |   2 +-
 src/backend/utils/misc/guc.c          |   1 +
 src/include/postmaster/bgwriter.h     |  24 +-
 src/include/storage/smgr.h            |  29 +-
 src/include/storage/smgrsync.h        |  36 ++
 23 files changed, 992 insertions(+), 887 deletions(-)
 create mode 100644 src/backend/storage/smgr/smgrsync.c
 create mode 100644 src/include/storage/smgrsync.h

diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 96501456422..f3d53bb47dd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -9361,7 +9361,7 @@ heap_sync(Relation rel)
 	/* main heap */
 	FlushRelationBuffers(rel);
 	/* FlushRelationBuffers will have opened rd_smgr */
-	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
+	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM, InvalidSegmentNumber);
 
 	/* FSM is not critical, don't bother syncing it */
 
@@ -9372,7 +9372,7 @@ heap_sync(Relation rel)
 
 		toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
 		FlushRelationBuffers(toastrel);
-		smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM, InvalidSegmentNumber);
 		heap_close(toastrel, AccessShareLock);
 	}
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1e..a0f957d1ef4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -178,7 +178,7 @@ btbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f57557776..a829c9cc034 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1207,7 +1207,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	if (RelationNeedsWAL(wstate->index))
 	{
 		RelationOpenSmgr(wstate->index);
-		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM, InvalidBlockNumber);
 	}
 }
 
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 7dd0d61fbbc..7201b6533f3 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -205,7 +205,7 @@ spgbuildempty(Relation index)
 	 * writes did not go through shared buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a28be4f7db8..23e840ede2b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/walwriter.h"
 #include "postmaster/startup.h"
 #include "replication/basebackup.h"
@@ -64,6 +65,7 @@
 #include "storage/procarray.h"
 #include "storage/reinit.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/backend_random.h"
 #include "utils/builtins.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index fc1927c537b..f04cb86d650 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -31,6 +31,7 @@
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 4d5b82aaa95..7927b353fcf 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1405,7 +1405,7 @@ heap_create_init_fork(Relation rel)
 	RelationOpenSmgr(rel);
 	smgrcreate(rel->rd_smgr, INIT_FORKNUM, false);
 	log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM);
-	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f640f469729..b59414b3350 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -47,7 +47,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "replication/slot.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c8c50e8c989..a5f19eaf3f0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11300,7 +11300,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	 * here, they might still not be on disk when the crash occurs.
 	 */
 	if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+		smgrimmedsync(dst, forkNum, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 4a714f6e2be..aa76b8d25ec 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -70,7 +70,7 @@
 #include "commands/tablespace.h"
 #include "common/file_perm.h"
 #include "miscadmin.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/standby.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 7612b17b442..b37a25fc2a6 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -44,6 +44,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index b9c118e1560..f420fce60dd 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -47,6 +47,8 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "postmaster/postmaster.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -108,10 +110,10 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
+	int			type;
+	RelFileNode	rnode;
 	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
+	SegmentNumber segno;
 } CheckpointerRequest;
 
 typedef struct
@@ -1077,9 +1079,7 @@ RequestCheckpoint(int flags)
  * RelFileNodeBackend.
  *
  * segno specifies which segment (not block!) of the relation needs to be
- * fsync'd.  (Since the valid range is much less than BlockNumber, we can
- * use high values for special flags; that's all internal to md.c, which
- * see for details.)
+ * fsync'd.
  *
  * To avoid holding the lock for longer than necessary, we normally write
  * to the requests[] queue without checking for duplicates.  The checkpointer
@@ -1092,13 +1092,14 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardFsyncRequest(int type, RelFileNode rnode, ForkNumber forknum,
+					SegmentNumber segno)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
 
 	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
+		elog(ERROR, "ForwardFsyncRequest must not be called in single user mode");
 
 	if (AmCheckpointerProcess())
 		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
@@ -1130,6 +1131,7 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
+	request->type = type;
 	request->rnode = rnode;
 	request->forknum = forknum;
 	request->segno = segno;
@@ -1314,7 +1316,8 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberFsyncRequest(request->type, request->rnode, request->forknum,
+							 request->segno);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9817770affc..52c4801ddf4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -42,11 +42,13 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/standby.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c03..4531e4dc4f7 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -27,6 +27,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0df16..c9c4be325ed 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrsync.o smgrtype.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 4c6a50509f8..114963ff42a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -30,37 +30,24 @@
 #include "access/xlog.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
 
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
 
 /*
  * On Windows, we have to interpret EACCES as possibly meaning the same as
  * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
  * that's what you get.  Ugh.  This code is designed so that we don't
  * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
+ * a pending fsync request getting canceled ... see smgrsync).
  */
 #ifndef WIN32
 #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
@@ -134,30 +121,9 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * (Regular backends do not track pending operations locally, but forward
  * them to the checkpointer.)
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
 
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
 
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
 
 
 /*** behavior for mdopen & _mdfd_getseg ***/
@@ -184,8 +150,7 @@ static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
 			 bool isRedo);
 static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
 static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
+							   MdfdVec *seg);
 static void _fdvec_resize(SMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg);
@@ -208,64 +173,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -388,7 +295,7 @@ mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
 	/*
 	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
+	 * relation, else the next smgrsync() will fail.  There can't be any such
 	 * requests for a temp relation, though.  We can send just one request
 	 * even when deleting multiple forks, since the fsync queuing code accepts
 	 * the "InvalidForkNumber = all forks" convention.
@@ -448,7 +355,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		UnlinkAfterCheckpoint(rnode);
 	}
 
 	/*
@@ -993,423 +900,55 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *
  * Note that only writes already issued are synced; this routine knows
  * nothing of dirty buffers that may exist inside the buffer manager.
+ *
+ * See smgrimmedsync comment for contract.
  */
-void
-mdimmedsync(SMgrRelation reln, ForkNumber forknum)
+bool
+mdimmedsync(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
-	int			segno;
+	MdfdVec	   *segments;
+	size_t		num_segments;
+	size_t		i;
 
-	/*
-	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
-	 * fsync loop will get them all!
-	 */
-	mdnblocks(reln, forknum);
-
-	segno = reln->md_num_open_segs[forknum];
+	if (segno != InvalidSegmentNumber)
+	{
+		/*
+		 * Get the specified segment, or report failure if it doesn't seem to
+		 * exist.
+		 */
+		segments = _mdfd_openseg(reln, forknum, segno * RELSEG_SIZE,
+								 EXTENSION_RETURN_NULL);
+		if (segments == NULL)
+			return false;
+		num_segments = 1;
+	}
+	else
+	{
+		/*
+		 * NOTE: mdnblocks makes sure we have opened all active segments, so that
+		 * fsync loop will get them all!
+		 */
+		mdnblocks(reln, forknum);
+		num_segments = reln->md_num_open_segs[forknum];
+		segments = &reln->md_seg_fds[forknum][0];
+	}
 
-	while (segno > 0)
+	for (i = 0; i < num_segments; ++i)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec    *v = &segments[i];
 
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(v->mdfd_vfd))));
-		segno--;
-	}
-}
-
-/*
- *	mdsync() -- Sync previous writes to stable storage.
- */
-void
-mdsync(void)
-{
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
 	}
 
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
-
-/*
- * mdpostckpt() -- Do post-checkpoint work
- *
- * Remove any lingering files that can now be safely removed.
- */
-void
-mdpostckpt(void)
-{
-	int			absorb_counter;
-
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
-
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
+	return true;
 }
 
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
- *
- * If there is a local pending-ops table, just make an entry in it for
- * mdsync to process later.  Otherwise, try to pass off the fsync request
- * to the checkpointer process.  If that fails, just do the fsync
- * locally before returning (we hope this will not happen often enough
- * to be a performance problem).
  */
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
@@ -1417,16 +956,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
+	if (!FsyncAtCheckpoint(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
 	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1438,258 +969,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
-/*
- * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
- */
-static void
-register_unlink(RelFileNodeBackend rnode)
-{
-	/* Should never be used with temp relations */
-	Assert(!RelFileNodeBackendIsTemp(rnode));
-
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
-/*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
- */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
-{
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
-}
-
-/*
- * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
- */
-void
-ForgetDatabaseFsyncRequests(Oid dbid)
-{
-	RelFileNode rnode;
-
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
-
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
 /*
  * DropRelationFiles -- drop files of all given relations
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 189342ef86a..42596f14e9e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
@@ -58,10 +59,8 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
+	bool		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum,
+								   SegmentNumber segno);
 } f_smgr;
 
 
@@ -81,10 +80,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
-		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
+		.smgr_immedsync = mdimmedsync
 	}
 };
 
@@ -104,6 +100,14 @@ static void smgrshutdown(int code, Datum arg);
 static void add_to_unowned_list(SMgrRelation reln);
 static void remove_from_unowned_list(SMgrRelation reln);
 
+/*
+ * For now there is only one implementation.
+ */
+static inline int
+which_for_relfilenode(RelFileNode rnode)
+{
+	return 0;	/* we only have md.c at present */
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -118,6 +122,8 @@ smgrinit(void)
 {
 	int			i;
 
+	smgrsync_init();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
@@ -185,7 +191,7 @@ smgropen(RelFileNode rnode, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		reln->smgr_fsm_nblocks = InvalidBlockNumber;
 		reln->smgr_vm_nblocks = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+		reln->smgr_which = which_for_relfilenode(rnode);
 
 		/* mark it not open */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
@@ -726,17 +732,20 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *	smgrimmedsync() -- Force the specified relation to stable storage.
  *
  *		Synchronously force all previous writes to the specified relation
- *		down to disk.
- *
- *		This is useful for building completely new relations (eg, new
- *		indexes).  Instead of incrementally WAL-logging the index build
- *		steps, we can just write completed index pages to disk with smgrwrite
- *		or smgrextend, and then fsync the completed index file before
- *		committing the transaction.  (This is sufficient for purposes of
- *		crash recovery, since it effectively duplicates forcing a checkpoint
- *		for the completed index.  But it is *not* sufficient if one wishes
- *		to use the WAL log for PITR or replication purposes: in that case
- *		we have to make WAL entries as well.)
+ *		down to disk.  If segnum is >= 0, only applies to data in
+ *		one segment file.
+ *
+ *		Used for checkpointing dirty files.
+ *
+ *		This can also be used for building completely new relations (eg, new
+ *		indexes).  Instead of incrementally WAL-logging the index build steps,
+ *		we can just write completed index pages to disk with smgrwrite or
+ *		smgrextend, and then fsync the completed index file before committing
+ *		the transaction.  (This is sufficient for purposes of crash recovery,
+ *		since it effectively duplicates forcing a checkpoint for the completed
+ *		index.  But it is *not* sufficient if one wishes to use the WAL log
+ *		for PITR or replication purposes: in that case we have to make WAL
+ *		entries as well.)
  *
  *		The preceding writes should specify skipFsync = true to avoid
  *		duplicative fsyncs.
@@ -744,57 +753,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *		Note that you need to do FlushRelationBuffers() first if there is
  *		any possibility that there are dirty buffers for the relation;
  *		otherwise the sync is not very meaningful.
+ *
+ *		Fail to fsync raises an error, but non-existence of a requested
+ *		segment is reported with a false return value.
  */
-void
-smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
-{
-	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
-}
-
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
+bool
+smgrimmedsync(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
+	return smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum, segno);
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgrsync.c b/src/backend/storage/smgr/smgrsync.c
new file mode 100644
index 00000000000..b202acef7e1
--- /dev/null
+++ b/src/backend/storage/smgr/smgrsync.c
@@ -0,0 +1,834 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.c
+ *	  management of file synchronization.
+ *
+ * This modules tracks which files need to be fsynced or unlinked at the
+ * next checkpoint, and performs those actions.  Normally the work is done
+ * when called by the checkpointer, but it is also done in standalone mode
+ * and startup.
+ *
+ * Originally this logic was inside md.c, but it is now made more general,
+ * for reuse by other SMGR implementations that work with files.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/smgrsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "pgstat.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "storage/relfilenode.h"
+#include "storage/smgrsync.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+#define SV_PREFIX segnum_vector
+#define SV_DECLARE
+#define SV_DEFINE
+#define SV_ELEMENT_TYPE BlockNumber
+#define SV_SCOPE static inline
+#define SV_GLOBAL_MEMORY_CONTEXT pendingOpsCxt
+#include "lib/simplevector.h"
+
+#define SA_PREFIX segnum_array
+#define SA_COMPARE(a,b) (*a < *b ? -1 : *a == *b ? 0 : 1)
+#define SA_DECLARE
+#define SA_DEFINE
+#define SA_ELEMENT_TYPE SV_ELEMENT_TYPE
+#define SA_SCOPE static inline
+#include "lib/simplealgo.h"
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  A hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
+
+/*
+ * Values for the "type" member of CheckpointerRequest.
+ *
+ * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
+ * fsync request from the queue if an identical, subsequent request is found.
+ * See comments there before making changes here.
+ */
+#define FSYNC_SEGMENT_REQUEST	1
+#define FORGET_SEGMENT_FSYNC	2
+#define FORGET_RELATION_FSYNC	3
+#define FORGET_DATABASE_FSYNC	4
+#define UNLINK_RELATION_REQUEST 5
+#define UNLINK_SEGMENT_REQUEST	6
+
+/* intervals for calling AbsorbFsyncRequests in smgrsync and smgrpostckpt */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * An entry in the hash table of files that need to be flushed for the next
+ * checkpoint.
+ */
+typedef struct PendingFsyncEntry
+{
+	RelFileNode	rnode;
+	segnum_vector requests[MAX_FORKNUM + 1];
+	segnum_vector requests_in_progress[MAX_FORKNUM + 1];
+	CycleCtr	cycle_ctr;
+} PendingFsyncEntry;
+
+typedef struct PendingUnlinkEntry
+{
+	RelFileNode rnode;			/* the dead relation to delete */
+	CycleCtr	cycle_ctr;		/* ckpt_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static bool sync_in_progress = false;
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr ckpt_cycle_ctr = 0;
+
+static HTAB *pendingFsyncTable = NULL;
+static List *pendingUnlinks = NIL;
+
+/*
+ * Initialize the pending operations state, if necessary.
+ */
+void
+smgrsync_init(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(RelFileNode);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingFsyncTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+}
+
+/*
+ * Do pre-checkpoint work.
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+smgrpreckpt(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	ckpt_cycle_ctr++;
+}
+
+/*
+ * Sync previous writes to stable storage.
+ */
+void
+smgrsync(void)
+{
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	int			processed = CheckpointStats.ckpt_sync_rels;
+	uint64		longest = CheckpointStats.ckpt_longest_sync;
+	uint64		total_elapsed = CheckpointStats.ckpt_agg_sync_time;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingFsyncTable.
+	 */
+	if (!pendingFsyncTable)
+		elog(ERROR, "cannot sync without a pendingFsyncTable");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbFsyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use GetCheckpointSyncCycle() to tell old entries apart
+	 * from new ones: new ones will have cycle_ctr equal to
+	 * IncCheckpointSyncCycle().
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous smgrsync() failed to complete, run through the table and
+	 * forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			ForkNumber		forknum;
+
+			entry->cycle_ctr = sync_cycle_ctr;
+
+			/*
+			 * If any requests remain unprocessed, they need to be merged with
+			 * the segment numbers that have arrived since.
+			 */
+			for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+			{
+				segnum_vector *requests = &entry->requests[forknum];
+				segnum_vector *requests_in_progress =
+					&entry->requests_in_progress[forknum];
+
+				if (!segnum_vector_empty(requests_in_progress))
+				{
+					/* Append the unfinished requests that were not yet handled. */
+					segnum_vector_append_n(requests,
+										   segnum_vector_data(requests_in_progress),
+										   segnum_vector_size(requests_in_progress));
+					segnum_vector_reset(requests_in_progress);
+
+					/* Sort and make unique. */
+					segnum_array_sort(segnum_vector_begin(requests),
+									  segnum_vector_end(requests));
+					segnum_vector_resize(requests,
+									 segnum_array_unique(segnum_vector_begin(requests),
+														 segnum_vector_end(requests)) -
+										 segnum_vector_begin(requests));
+				}
+			}
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingFsyncTable);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)))
+	{
+		ForkNumber forknum;
+		SMgrRelation reln;
+
+		/*
+		 * If the entry is new then don't process it this time; it might
+		 * contain multiple fsync requests, but they are all new.  Note
+		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * Scan over the forks and segments represented by the entry.
+		 *
+		 * The vector manipulations are slightly tricky, because we can call
+		 * AbsorbFsyncRequests() inside the loop and that could result in new
+		 * segment numbers being added.  So we swap the contents of "requests"
+		 * with "requests_in_progress", and if we fail we'll merge it with any
+		 * new requests that have arrived in the meantime.
+		 */
+		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			segnum_vector *requests_in_progress =
+				&entry->requests_in_progress[forknum];
+
+			/*
+			 * Transfer the current set of segment numbers into the "in
+			 * progress" vector (which must be empty initially).
+			 */
+			Assert(segnum_vector_empty(requests_in_progress));
+			segnum_vector_swap(&entry->requests[forknum], requests_in_progress);
+
+			/* Loop until all requests have been handled. */
+			while (!segnum_vector_empty(requests_in_progress))
+			{
+				SegmentNumber	segno = *segnum_vector_back(requests_in_progress);
+
+				INSTR_TIME_SET_CURRENT(sync_start);
+
+				reln = smgropen(entry->rnode, InvalidBackendId);
+				if (!smgrimmedsync(reln, forknum, segno))
+				{
+					/*
+					 * The underlying file couldn't be found.  Check if a
+					 * later message in the queue reports that it has been
+					 * unlinked; if so it will be removed from the vector,
+					 * indicating that we can safely skip it.
+					 */
+					AbsorbFsyncRequests();
+					if (!segnum_array_binary_search(segnum_vector_begin(requests_in_progress),
+													segnum_vector_end(requests_in_progress),
+													&segno))
+						continue;
+
+					/* Otherwise it's an unexpectedly missing file. */
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open backing file to fsync: %u/%u/%u",
+									entry->rnode.dbNode,
+									entry->rnode.relNode,
+									segno)));
+				}
+
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				/* Remove this segment number. */
+				Assert(segno == *segnum_vector_back(requests_in_progress));
+				segnum_vector_pop_back(requests_in_progress);
+
+				if (log_checkpoints)
+					ereport(DEBUG1,
+							(errmsg("checkpoint sync: number=%d db=%u rel=%u seg=%u time=%.3f msec",
+									processed,
+									entry->rnode.dbNode,
+									entry->rnode.relNode,
+									segno,
+									(double) elapsed / 1000),
+							 errhidestmt(true),
+							 errhidecontext(true)));
+			}
+		}
+
+		/*
+		 * We've finished everything that was requested before we started to
+		 * scan the entry.  If no new requests have been inserted meanwhile,
+		 * remove the entry.  Otherwise, update its cycle counter, as all the
+		 * requests now in it must have arrived during this cycle.
+		 */
+		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			Assert(segnum_vector_empty(&entry->requests_in_progress[forknum]));
+			if (!segnum_vector_empty(&entry->requests[forknum]))
+				break;
+			segnum_vector_reset(&entry->requests[forknum]);
+		}
+		if (forknum <= MAX_FORKNUM)
+			entry->cycle_ctr = sync_cycle_ctr;
+		else
+		{
+			/* Okay to remove it */
+			if (hash_search(pendingFsyncTable, &entry->rnode,
+							HASH_REMOVE, NULL) == NULL)
+				elog(ERROR, "pendingOpsTable corrupted");
+		}
+	}							/* end loop over hashtable entries */
+
+	/* Maintain sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of smgrsync */
+	sync_in_progress = false;
+}
+
+/*
+ * Do post-checkpoint work.
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+smgrpostckpt(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == ckpt_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = relpathperm(entry->rnode, MAIN_FORKNUM);
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in smgrsync, we don't want to stop absorbing fsync requests for a
+		 * long time when there are many deletions to be done.  We can safely
+		 * call AbsorbFsyncRequests() at this point in the loop (note it might
+		 * try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbFsyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+
+/*
+ * Mark a file as needing fsync.
+ *
+ * If there is a local pending-ops table, just make an entry in it for
+ * smgrsync to process later.  Otherwise, try to pass off the fsync request to
+ * the checkpointer process.
+ *
+ * Returns true on success, but false if the queue was full and we couldn't
+ * pass the request to the the checkpointer, meaning that the caller must
+ * perform the fsync.
+ */
+bool
+FsyncAtCheckpoint(RelFileNode rnode, ForkNumber forknum, SegmentNumber segno)
+{
+	if (pendingFsyncTable)
+	{
+		RememberFsyncRequest(FSYNC_SEGMENT_REQUEST, rnode, forknum, segno);
+		return true;
+	}
+	else
+		return ForwardFsyncRequest(FSYNC_SEGMENT_REQUEST, rnode, forknum,
+								   segno);
+}
+
+/*
+ * Schedule a file to be deleted after next checkpoint.
+ *
+ * As with FsyncAtCheckpoint, this could involve either a local or a remote
+ * pending-ops table.
+ */
+void
+UnlinkAfterCheckpoint(RelFileNodeBackend rnode)
+{
+	/* Should never be used with temp relations */
+	Assert(!RelFileNodeBackendIsTemp(rnode));
+
+	if (pendingFsyncTable)
+	{
+		/* push it into local pending-ops table */
+		RememberFsyncRequest(UNLINK_RELATION_REQUEST,
+							 rnode.node,
+							 MAIN_FORKNUM,
+							 InvalidSegmentNumber);
+	}
+	else
+	{
+		/* Notify the checkpointer about it. */
+		Assert(IsUnderPostmaster);
+
+		ForwardFsyncRequest(UNLINK_RELATION_REQUEST,
+							rnode.node,
+							MAIN_FORKNUM,
+							InvalidSegmentNumber);
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingFsyncTable during initialization of the startup
+ * process.  Calling this function drops the local pendingFsyncTable so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingFsyncTable)
+	{
+		smgrsync();
+		hash_destroy(pendingFsyncTable);
+	}
+	pendingFsyncTable = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
+
+/*
+ * Find and remove a segment number by binary search.
+ */
+static inline void
+delete_segno(segnum_vector *vec, SegmentNumber segno)
+{
+	SegmentNumber *position =
+		segnum_array_lower_bound(segnum_vector_begin(vec),
+								 segnum_vector_end(vec),
+								 &segno);
+
+	if (position != segnum_vector_end(vec) &&
+		*position == segno)
+		segnum_vector_erase(vec, position);
+}
+
+/*
+ * Add a segment number by binary search.  Hopefully these tend to be added a
+ * the high end, which is cheap.
+ */
+static inline void
+insert_segno(segnum_vector *vec, SegmentNumber segno)
+{
+	segnum_vector_insert(vec,
+						 segnum_array_lower_bound(segnum_vector_begin(vec),
+												  segnum_vector_end(vec),
+												  &segno),
+						 &segno);
+}
+
+/*
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * Valid valid values for 'type':
+ * - FSYNC_SEGMENT_REQUEST means to schedule an fsync
+ * - FORGET_SEGMENT_FSYNC means to cancel pending fsyncs for one segment
+ * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
+ *	 either for one fork, or all forks if forknum is InvalidForkNumber
+ * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
+ * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
+ *	 checkpoint.
+ * Note also that we're assuming real segment numbers don't exceed INT_MAX.
+ *
+ * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
+ * table has to be searched linearly, but dropping a database is a pretty
+ * heavyweight operation anyhow, so we'll live with it.)
+ */
+void
+RememberFsyncRequest(int type, RelFileNode rnode, ForkNumber forknum,
+					 SegmentNumber segno)
+{
+	Assert(pendingFsyncTable);
+
+	if (type == FORGET_SEGMENT_FSYNC || type == FORGET_RELATION_FSYNC)
+	{
+		PendingFsyncEntry *entry;
+
+		entry = hash_search(pendingFsyncTable, &rnode, HASH_FIND, NULL);
+		if (entry)
+		{
+			if (type == FORGET_SEGMENT_FSYNC)
+			{
+				delete_segno(&entry->requests[forknum], segno);
+				delete_segno(&entry->requests_in_progress[forknum], segno);
+			}
+			else if (forknum == InvalidForkNumber)
+			{
+				/* Remove requests for all forks. */
+				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+				{
+					segnum_vector_reset(&entry->requests[forknum]);
+					segnum_vector_reset(&entry->requests_in_progress[forknum]);
+				}
+			}
+			else
+			{
+				/* Forget about all segments for one fork. */
+				segnum_vector_reset(&entry->requests[forknum]);
+				segnum_vector_reset(&entry->requests_in_progress[forknum]);
+			}
+		}
+	}
+	else if (type == FORGET_DATABASE_FSYNC)
+	{
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (rnode.dbNode == entry->rnode.dbNode)
+			{
+				/* Remove requests for all forks. */
+				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+				{
+					segnum_vector_reset(&entry->requests[forknum]);
+					segnum_vector_reset(&entry->requests_in_progress[forknum]);
+				}
+			}
+		}
+
+		/* Remove unlink requests */
+		if (segno == FORGET_DATABASE_FSYNC)
+		{
+			ListCell   *cell,
+					   *next,
+					   *prev;
+
+			prev = NULL;
+			for (cell = list_head(pendingUnlinks); cell; cell = next)
+			{
+				PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+				next = lnext(cell);
+				if (rnode.dbNode == entry->rnode.dbNode)
+				{
+					pendingUnlinks = list_delete_cell(pendingUnlinks, cell,
+													  prev);
+					pfree(entry);
+				}
+				else
+					prev = cell;
+			}
+		}
+	}
+	else if (type == UNLINK_RELATION_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
+		Assert(forknum == MAIN_FORKNUM);
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->rnode = rnode;
+		entry->cycle_ctr = ckpt_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		entry = (PendingFsyncEntry *) hash_search(pendingFsyncTable,
+												  &rnode,
+												  HASH_ENTER,
+												  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			ForkNumber	f;
+
+			entry->cycle_ctr = ckpt_cycle_ctr;
+			for (f = 0; f <= MAX_FORKNUM; f++)
+			{
+				segnum_vector_init(&entry->requests[f]);
+				segnum_vector_init(&entry->requests_in_progress[f]);
+			}
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		insert_segno(&entry->requests[forknum], segno);
+	}
+}
+
+/*
+ * ForgetSegmentFsyncRequests -- forget any fsyncs for one segment of a
+ * relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetSegmentFsyncRequests(RelFileNode rnode, ForkNumber forknum,
+						   SegmentNumber segno)
+{
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_SEGMENT_FSYNC, rnode, forknum, segno);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		while (!ForwardFsyncRequest(FORGET_SEGMENT_FSYNC, rnode, forknum,
+									segno))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+{
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_RELATION_FSYNC, rnode, forknum,
+							 InvalidSegmentNumber);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		while (!ForwardFsyncRequest(FORGET_RELATION_FSYNC, rnode, forknum,
+									InvalidSegmentNumber))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
+ */
+void
+ForgetDatabaseFsyncRequests(Oid dbid)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = dbid;
+	rnode.spcNode = 0;
+	rnode.relNode = 0;
+
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_DATABASE_FSYNC, rnode, 0,
+							 InvalidSegmentNumber);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* see notes in ForgetRelationFsyncRequests */
+		while (!ForwardFsyncRequest(FORGET_DATABASE_FSYNC, rnode, 0,
+									InvalidSegmentNumber))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+	}
+}
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 970c94ee805..32bc91102d7 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -59,7 +59,7 @@
 #include "commands/view.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rewriteRemove.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6fe19398812..db23de3a131 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -60,6 +60,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d1..137c748dfaf 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -1,10 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * bgwriter.h
- *	  Exports from postmaster/bgwriter.c and postmaster/checkpointer.c.
- *
- * The bgwriter process used to handle checkpointing duties too.  Now
- * there is a separate process, but we did not bother to split this header.
+ *	  Exports from postmaster/bgwriter.c.
  *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  *
@@ -15,29 +12,10 @@
 #ifndef _BGWRITER_H
 #define _BGWRITER_H
 
-#include "storage/block.h"
-#include "storage/relfilenode.h"
-
-
 /* GUC options */
 extern int	BgWriterDelay;
-extern int	CheckPointTimeout;
-extern int	CheckPointWarning;
-extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void) pg_attribute_noreturn();
-extern void CheckpointerMain(void) pg_attribute_noreturn();
-
-extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
-
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
-
-extern Size CheckpointerShmemSize(void);
-extern void CheckpointerShmemInit(void);
 
-extern bool FirstCallSinceLastCheckpoint(void);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index c843bbc9692..61fe0276f74 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,15 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/*
+ * The type used to identify segment numbers.  Generally, segments are an
+ * internal detail of individual storage manager implementations, but since
+ * they appear in various places to allow them to be passed between processes,
+ * it seemed worthwhile to have a typename.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -105,10 +114,9 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
-extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
+extern bool smgrimmedsync(SMgrRelation reln, ForkNumber forknum,
+						  SegmentNumber segno);
+
 extern void AtEOXact_SMgr(void);
 
 
@@ -133,16 +141,9 @@ extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
+extern bool mdimmedsync(SMgrRelation reln, ForkNumber forknum,
+						SegmentNumber segno);
+
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 #endif							/* SMGR_H */
diff --git a/src/include/storage/smgrsync.h b/src/include/storage/smgrsync.h
new file mode 100644
index 00000000000..8ef7093f801
--- /dev/null
+++ b/src/include/storage/smgrsync.h
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.h
+ *	  management of file synchronization
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/smgrpending.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SMGRSYNC_H
+#define SMGRSYNC_H
+
+#include "storage/smgr.h"
+
+extern void smgrsync_init(void);
+extern void smgrpreckpt(void);
+extern void smgrsync(void);
+extern void smgrpostckpt(void);
+
+extern void UnlinkAfterCheckpoint(RelFileNodeBackend rnode);
+extern bool FsyncAtCheckpoint(RelFileNode rnode, ForkNumber forknum,
+							  SegmentNumber segno);
+extern void RememberFsyncRequest(int type, RelFileNode rnode,
+								 ForkNumber forknum, SegmentNumber segno);
+extern void SetForwardFsyncRequests(void);
+extern void ForgetSegmentFsyncRequests(RelFileNode rnode, ForkNumber forknum,
+									   SegmentNumber segno);
+extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
+extern void ForgetDatabaseFsyncRequests(Oid dbid);
+
+
+#endif
-- 
2.19.1

#19

Thomas Munro

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: Thomas Munro (#18)

2 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Jan 1, 2019 at 10:41 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

So, I have split this work into multiple patches. 0001 is a draft
version of some new infrastructure I'd like to propose, 0002 is the
thing originally described by the first two paragraphs in the first
email in this thread, and the rest I'll have to defer for now (the fd
passing stuff).

Apologies, there was a header missing from 0002, and a small change
needed to a contrib file that I missed. Here is a new version.

For the 0001 patch, I'll probably want to reconsider the naming a bit
("simple -> "specialized", "generic", ...?), refine (ability to turn
off the small vector optimisation? optional MemoryContext? ability
to extend without copying or zero-initialising at the same time?
comparators with a user data parameter? two-value comparators vs
three-value comparators? qsort with inline comparator? etc etc), and
remove some gratuitous C++ cargo cultisms, and maybe also instantiate
the thing centrally for some common types (I mean, perhaps 0002 should
use a common uint32_vector rather than defining its own
segnum_vector?).

I suppose an alternative would be to use simplehash for the set of
segment numbers here, but it seems like overkill and would waste a ton
of memory in the common case of holding a single number. I wondered
also about some kind of tree (basically, C++ set) but it seems much
more complicated and would still be much larger for common cases.
Sorted vectors seem to work pretty well here (but would lose in
theoretical cases where you insert low values in large sets, but not
in practice here AFAICS).

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Add-parameterized-vectors-and-sorting-searching-s-v5.patchapplication/octet-stream; name=0001-Add-parameterized-vectors-and-sorting-searching-s-v5.patchDownload

From 84b01f0feb236e0854f0a04e2651f87ec445b3fe Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Tue, 1 Jan 2019 07:05:46 +1300
Subject: [PATCH 1/2] Add parameterized vectors and sorting/searching support.

To make it a bit easier to work with arrays (rather than lists or
bitmaps), create a mechanism along the lines of StringInfo, but usable
with other types (eg ints, structs, ...) that can be parameterized at
compile time.  Follow the example of simplehash.h.

Provide some simple sorting and searching algorithms for working with
sorted arrays and vectors.

Author: Thomas Munro
Reviewed-by:
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 src/include/lib/simplevector.h | 447 +++++++++++++++++++++++++++++++++
 src/include/lib/sort_utils.h   | 174 +++++++++++++
 2 files changed, 621 insertions(+)
 create mode 100644 src/include/lib/simplevector.h
 create mode 100644 src/include/lib/sort_utils.h

diff --git a/src/include/lib/simplevector.h b/src/include/lib/simplevector.h
new file mode 100644
index 00000000000..e416af9cf6b
--- /dev/null
+++ b/src/include/lib/simplevector.h
@@ -0,0 +1,447 @@
+/*-------------------------------------------------------------------------
+ *
+ * simplevector.h
+ *
+ *	  Vector implementation that will be specialized for user-defined types,
+ *	  by including this file to generate the required code.  Suitable for
+ *	  value types that can be bitwise copied and moved.  Includes an in-place
+ *	  small-vector optimization, so that allocation can be avoided until the
+ *	  internal space is exceeded.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * Usage notes:
+ *
+ *	  To generate a type and associates functions, the following parameter
+ *	  macros should be #define'd before this file is included.
+ *
+ *	  - SV_PREFIX - prefix for all symbol names generated.
+ *	  - SV_ELEMENT_TYPE - type of the contained elements
+ *	  - SV_DECLARE - if defined the functions and types are declared
+ *	  - SV_DEFINE - if defined the functions and types are defined
+ *	  - SV_SCOPE - scope (e.g. extern, static inline) for functions
+ *
+ * IDENTIFICATION
+ *		src/include/lib/simplevector.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* helpers */
+#define SV_MAKE_PREFIX(a) CppConcat(a,_)
+#define SV_MAKE_NAME(name) SV_MAKE_NAME_(SV_MAKE_PREFIX(SV_PREFIX),name)
+#define SV_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* type declarations */
+#define SV_TYPE SV_PREFIX
+
+/* function declarations */
+#define SV_INIT SV_MAKE_NAME(init)
+#define SV_DESTROY SV_MAKE_NAME(destroy)
+#define SV_RESET SV_MAKE_NAME(reset)
+#define SV_CLEAR SV_MAKE_NAME(clear)
+#define SV_DATA SV_MAKE_NAME(data)
+#define SV_EMPTY SV_MAKE_NAME(empty)
+#define SV_SIZE SV_MAKE_NAME(size)
+#define SV_RESIZE SV_MAKE_NAME(resize)
+#define SV_CAPACITY SV_MAKE_NAME(capacity)
+#define SV_RESERVE SV_MAKE_NAME(reserve)
+#define SV_APPEND SV_MAKE_NAME(append)
+#define SV_APPEND_N SV_MAKE_NAME(append_n)
+#define SV_INSERT SV_MAKE_NAME(insert)
+#define SV_INSERT_N SV_MAKE_NAME(insert_n)
+#define SV_ERASE SV_MAKE_NAME(erase)
+#define SV_ERASE_N SV_MAKE_NAME(erase_n)
+#define SV_BEGIN SV_MAKE_NAME(begin)
+#define SV_END SV_MAKE_NAME(end)
+#define SV_BACK SV_MAKE_NAME(back)
+#define SV_POP_BACK SV_MAKE_NAME(pop_back)
+#define SV_SWAP SV_MAKE_NAME(swap)
+
+#ifndef SV_IN_PLACE_CAPACITY
+#define SV_IN_PLACE_CAPACITY 3
+#endif
+
+#ifdef SV_DECLARE
+
+typedef struct SV_TYPE
+{
+	/*
+	 * If size is <= SV_IN_PLACE_CAPACITY, then it represents the number of
+	 * elements stored in u.elements.  Otherwise, it is the capacity of the
+	 * buffer in u.overflow.data (in number of potential elements), and
+	 * u.overflow.count represents the number of occupied elements.
+	 */
+	uint32		size;
+	union
+	{
+		struct
+		{
+			void	   *data;
+			uint32		count;
+		} overflow;
+		SV_ELEMENT_TYPE elements[SV_IN_PLACE_CAPACITY];
+	} u;
+}		SV_TYPE;
+
+/* externally visible function prototypes */
+SV_SCOPE void SV_INIT(SV_TYPE *vec);
+SV_SCOPE void SV_DESTROY(SV_TYPE *vec);
+SV_SCOPE void SV_RESET(SV_TYPE *vec);
+SV_SCOPE void SV_CLEAR(SV_TYPE *vec);
+SV_SCOPE SV_ELEMENT_TYPE *SV_DATA(SV_TYPE *vec);
+SV_SCOPE bool SV_EMPTY(SV_TYPE *vec);
+SV_SCOPE uint32 SV_SIZE(SV_TYPE *vec);
+SV_SCOPE void SV_RESIZE(SV_TYPE *vec, uint32 size);
+SV_SCOPE uint32 SV_CAPACITY(SV_TYPE *vec);
+SV_SCOPE void SV_RESERVE(SV_TYPE *vec, uint32 capacity);
+SV_SCOPE void SV_APPEND(SV_TYPE *vec, const SV_ELEMENT_TYPE *value);
+SV_SCOPE void SV_APPEND_N(SV_TYPE *vec, const SV_ELEMENT_TYPE *values,
+						  uint32 size);
+SV_SCOPE void SV_INSERT(SV_TYPE *vec,
+						SV_ELEMENT_TYPE *position,
+						const SV_ELEMENT_TYPE *value);
+SV_SCOPE void SV_INSERT_N(SV_TYPE *vec,
+						  SV_ELEMENT_TYPE *position,
+						  const SV_ELEMENT_TYPE *values,
+						  uint32 n);
+SV_SCOPE void SV_ERASE(SV_TYPE *vec, SV_ELEMENT_TYPE *position);
+SV_SCOPE void SV_ERASE_N(SV_TYPE *vec, SV_ELEMENT_TYPE *position, uint32 n);
+SV_SCOPE void SV_SWAP(SV_TYPE *a, SV_TYPE *b);
+SV_SCOPE SV_ELEMENT_TYPE *SV_BEGIN(SV_TYPE *vec);
+SV_SCOPE SV_ELEMENT_TYPE *SV_END(SV_TYPE *vec);
+SV_SCOPE SV_ELEMENT_TYPE *SV_BACK(SV_TYPE *vec);
+SV_SCOPE void SV_POP_BACK(SV_TYPE *vec);
+
+#endif
+
+#ifdef SV_DEFINE
+
+/*
+ * Initialize a vector in-place.
+ */
+SV_SCOPE void
+SV_INIT(SV_TYPE *vec)
+{
+	vec->size = 0;
+}
+
+/*
+ * Free any resources owned by the vector.
+ */
+SV_SCOPE void
+SV_DESTROY(SV_TYPE *vec)
+{
+	SV_RESET(vec);
+}
+
+/*
+ * Free any resources owned by the vector.
+ */
+SV_SCOPE void
+SV_RESET(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		pfree(vec->u.overflow.data);
+	vec->size = 0;
+}
+
+/*
+ * Clear the vector so that it contains no elements.
+ */
+SV_SCOPE void
+SV_CLEAR(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count = 0;
+	else
+		vec->size = 0;
+}
+
+/*
+ * Return a pointer to the elements in the vector.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_DATA(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		return vec->u.overflow.data;
+	else
+		return &vec->u.elements[0];
+}
+
+/*
+ * Check if the vector is empty (has no elements).
+ */
+SV_SCOPE bool
+SV_EMPTY(SV_TYPE *vec)
+{
+	return SV_SIZE(vec) == 0;
+}
+
+/*
+ * Return the number of elements in the vector.
+ */
+SV_SCOPE uint32
+SV_SIZE(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		return vec->u.overflow.count;
+	else
+		return vec->size;
+}
+
+/*
+ * Resize the vector, discarding elements at the end, or creating new
+ * zero-initialized elements as required.
+ */
+SV_SCOPE void
+SV_RESIZE(SV_TYPE *vec, uint32 size)
+{
+	uint32		old_size = SV_SIZE(vec);
+
+	/* Growing? */
+	if (size > old_size)
+	{
+		SV_RESERVE(vec, size);
+		memset(&SV_DATA(vec)[old_size], 0,
+			   sizeof(SV_ELEMENT_TYPE) * (size - old_size));
+	}
+
+	/* Set the new size. */
+	if (vec->size <= SV_IN_PLACE_CAPACITY)
+		vec->size = size;
+	else
+		vec->u.overflow.count = size;
+}
+
+/*
+ * Return the number of elements that can be held in the vector before it
+ * needs to reallocate.
+ */
+SV_SCOPE uint32
+SV_CAPACITY(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		return vec->size;
+	else
+		return SV_IN_PLACE_CAPACITY;
+}
+
+/*
+ * Make sure we have capacity for a given number of elements without having to
+ * reallocate.
+ */
+SV_SCOPE void
+SV_RESERVE(SV_TYPE *vec, uint32 capacity)
+{
+	void *new_buffer;
+
+	/* Do nothing if we already have that much capacity. */
+	if (capacity <= SV_IN_PLACE_CAPACITY || capacity < vec->size)
+		return;
+
+	/* Allocate larger buffer. */
+#ifdef SV_GLOBAL_MEMORY_CONTEXT
+	new_buffer = MemoryContextAlloc(SV_GLOBAL_MEMORY_CONTEXT,
+									sizeof(SV_ELEMENT_TYPE) * capacity);
+#else
+	new_buffer = palloc(sizeof(SV_ELEMENT_TYPE) * capacity);
+#endif
+
+	/* Copy existing data to new buffer. */
+	if (vec->size <= SV_IN_PLACE_CAPACITY)
+	{
+		/* Promote from in-line format. */
+		if (vec->size > 0)
+			memcpy(new_buffer,
+				   vec->u.elements,
+				   sizeof(SV_ELEMENT_TYPE) * vec->size);
+		vec->u.overflow.count = vec->size;
+	}
+	else
+	{
+		/* Copy from existing smaller overflow buffer, and free it. */
+		if (vec->u.overflow.count > 0)
+			memcpy(new_buffer,
+				   vec->u.overflow.data,
+				   sizeof(SV_ELEMENT_TYPE) * vec->u.overflow.count);
+		Assert(vec->u.overflow.data);
+		pfree(vec->u.overflow.data);
+	}
+	vec->u.overflow.data = new_buffer;
+	vec->size = capacity;
+}
+
+/*
+ * Append a value to the end of a vector.
+ */
+SV_SCOPE void
+SV_APPEND(SV_TYPE *vec, const SV_ELEMENT_TYPE *value)
+{
+	SV_APPEND_N(vec, value, 1);
+}
+
+/*
+ * Append N values to the end of a vector.
+ */
+SV_SCOPE void
+SV_APPEND_N(SV_TYPE *vec, const SV_ELEMENT_TYPE *values, uint32 n)
+{
+	uint32		size = SV_SIZE(vec);
+
+	SV_RESERVE(vec, size + n);
+	memcpy(&SV_DATA(vec)[size], values, sizeof(SV_ELEMENT_TYPE) * n);
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count += n;
+	else
+		vec->size += n;
+}
+
+/*
+ * Insert a value before an arbitrary position in the vector.  This is not
+ * especially efficient as it must shift values to make space.
+ */
+SV_SCOPE void
+SV_INSERT(SV_TYPE *vec, SV_ELEMENT_TYPE *position, const SV_ELEMENT_TYPE *value)
+{
+	SV_INSERT_N(vec, position, value, 1);
+}
+
+/*
+ * Insert N values before an arbitrary position in the vector.  This is not
+ * especially efficient as it must shift values to make space.
+ */
+SV_SCOPE void
+SV_INSERT_N(SV_TYPE *vec, SV_ELEMENT_TYPE *position,
+			const SV_ELEMENT_TYPE *values, uint32 n)
+{
+	uint32		size = SV_SIZE(vec);
+	uint32		i = position - SV_DATA(vec);
+	SV_ELEMENT_TYPE *data;
+
+	if (n == 0)
+		return;
+
+	Assert(position >= SV_DATA(vec) &&
+		   position <= SV_DATA(vec) + size);
+	SV_RESERVE(vec, size + n);
+	data = SV_DATA(vec);
+	memmove(&data[i + n],
+			&data[i],
+			sizeof(SV_ELEMENT_TYPE) * (size - i));
+	memcpy(&data[i], values, sizeof(SV_ELEMENT_TYPE) * n);
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count += n;
+	else
+		vec->size += n;
+}
+
+/*
+ * Erase an arbitarary element in the vector.  This is not especially
+ * efficient as it must shift trailing values.
+ */
+SV_SCOPE void
+SV_ERASE(SV_TYPE *vec, SV_ELEMENT_TYPE *position)
+{
+	SV_ERASE_N(vec, position, 1);
+}
+
+/*
+ * Erase N values begining with an arbitarary element in the vector.  This is
+ * not especially efficient as it must shift trailing values.
+ */
+SV_SCOPE void
+SV_ERASE_N(SV_TYPE *vec, SV_ELEMENT_TYPE *position, uint32 n)
+{
+	Assert(position >= SV_DATA(vec) &&
+		   position + n <= SV_DATA(vec) + SV_SIZE(vec));
+	memmove(position,
+			position + n,
+			sizeof(SV_ELEMENT_TYPE) * (SV_SIZE(vec) - n));
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count -= n;
+	else
+		vec->size -= n;
+}
+
+/*
+ * Get a pointer to the first element, if there is one.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_BEGIN(SV_TYPE *vec)
+{
+	return SV_DATA(vec);
+}
+
+/*
+ * Get a pointer to the element past the last element.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_END(SV_TYPE *vec)
+{
+	return SV_DATA(vec) + SV_SIZE(vec);
+}
+
+/*
+ * Get a pointer to the back (last) element.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_BACK(SV_TYPE *vec)
+{
+	Assert(!SV_EMPTY(vec));
+	return SV_DATA(vec) + SV_SIZE(vec) - 1;
+}
+
+/*
+ * Remove the back (last) element.
+ */
+SV_SCOPE void
+SV_POP_BACK(SV_TYPE *vec)
+{
+	Assert(!SV_EMPTY(vec));
+	SV_RESIZE(vec, SV_SIZE(vec) - 1);
+}
+
+/*
+ * Swap the contents of two vectors.
+ */
+SV_SCOPE void
+SV_SWAP(SV_TYPE *a, SV_TYPE *b)
+{
+	SV_TYPE		tmp;
+
+	tmp = *a;
+	*a = *b;
+	*b = tmp;
+}
+
+#endif
+
+#undef SV_MAKE_PREFIX
+#undef SV_MAKE_NAME
+#undef SV_MAKE_NAME_
+#undef SV_INIT
+#undef SV_DESTROY
+#undef SV_RESET
+#undef SV_CLEAR
+#undef SV_DATA
+#undef SV_EMPTY
+#undef SV_SIZE
+#undef SV_RESIZE
+#undef SV_CAPACITY
+#undef SV_RESERVE
+#undef SV_APPEND
+#undef SV_APPEND_N
+#undef SV_INSERT
+#undef SV_INSERT_N
+#undef SV_ERASE
+#undef SV_ERASE_N
+#undef SV_BEGIN
+#undef SV_END
+#undef SV_BACK
+#undef SV_POP_BACK
+#undef SV_SWAP
+#undef SV_IN_PLACE_CAPACITY
+#undef SV_DECLARE
+#undef SV_DEFINE
diff --git a/src/include/lib/sort_utils.h b/src/include/lib/sort_utils.h
new file mode 100644
index 00000000000..6f5466f5fd5
--- /dev/null
+++ b/src/include/lib/sort_utils.h
@@ -0,0 +1,174 @@
+/*-------------------------------------------------------------------------
+ *
+ * sort_utils.h
+ *
+ *	  Simple sorting-related algorithms specialized for arrays of
+ *	  paramaterized type, using inlined comparators.
+ *
+ * Copyright (c) 2018, PostgreSQL Global Development Group
+ *
+ * Usage notes:
+ *
+ *	  To generate functions specialized for a type, the following parameter
+ *	  macros should be #define'd before this file is included.
+ *
+ *	  - SA_PREFIX - prefix for all symbol names generated.
+ *	  - SA_ELEMENT_TYPE - type of the referenced elements
+ *	  - SA_DECLARE - if defined the functions and types are declared
+ *	  - SA_DEFINE - if defined the functions and types are defined
+ *	  - SA_SCOPE - scope (e.g. extern, static inline) for functions
+ *
+ *	  The following are relevant only when SA_DEFINE is defined:
+ *
+ *	  - SA_COMPARE(a, b) - an expression to compare pointers to two values
+ *
+ * IDENTIFICATION
+ *		src/include/lib/sort_utils.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define SA_MAKE_PREFIX(a) CppConcat(a,_)
+#define SA_MAKE_NAME(name) SA_MAKE_NAME_(SA_MAKE_PREFIX(SA_PREFIX),name)
+#define SA_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define SA_SORT SA_MAKE_NAME(sort)
+#define SA_UNIQUE SA_MAKE_NAME(unique)
+#define SA_BINARY_SEARCH SA_MAKE_NAME(binary_search)
+#define SA_LOWER_BOUND SA_MAKE_NAME(lower_bound)
+
+#ifdef SA_DECLARE
+
+SA_SCOPE void SA_SORT(SA_ELEMENT_TYPE *first, SA_ELEMENT_TYPE *last);
+SA_SCOPE SA_ELEMENT_TYPE *SA_UNIQUE(SA_ELEMENT_TYPE *first,
+									SA_ELEMENT_TYPE *last);
+SA_SCOPE bool SA_BINARY_SEARCH(SA_ELEMENT_TYPE *first,
+							   SA_ELEMENT_TYPE *last,
+							   SA_ELEMENT_TYPE *value);
+SA_SCOPE SA_ELEMENT_TYPE *SA_LOWER_BOUND(SA_ELEMENT_TYPE *first,
+										 SA_ELEMENT_TYPE *last,
+										 SA_ELEMENT_TYPE *value);
+
+#endif
+
+#ifdef SA_DEFINE
+
+/* helper functions */
+#define SA_QSORT_COMPARATOR SA_MAKE_NAME(qsort_comparator)
+
+/*
+ * Function wrapper for comparator expression.
+ */
+static inline int
+SA_QSORT_COMPARATOR(const void *a, const void *b)
+{
+	return SA_COMPARE((SA_ELEMENT_TYPE *) a, (SA_ELEMENT_TYPE *) b);
+}
+
+/*
+ * Sort an array [first, last) in place.  For now, just calls out to qsort,
+ * but a quicksort with inlined comparators is known to be faster so we could
+ * consider that here in future.
+ */
+SA_SCOPE void
+SA_SORT(SA_ELEMENT_TYPE *first, SA_ELEMENT_TYPE *last)
+{
+	qsort(first, last - first, sizeof(SA_ELEMENT_TYPE), SA_QSORT_COMPARATOR);
+}
+
+/*
+ * Remove duplicates from an array [first, last).  Return the new last pointer
+ * (ie one past the new end).
+ */
+SA_SCOPE SA_ELEMENT_TYPE *
+SA_UNIQUE(SA_ELEMENT_TYPE *first, SA_ELEMENT_TYPE *last)
+{
+	SA_ELEMENT_TYPE *write_head;
+	SA_ELEMENT_TYPE *read_head;
+
+	if (last - first <= 1)
+		return last;
+
+	write_head = first;
+	read_head = first + 1;
+
+	while (read_head < last)
+	{
+		if (SA_COMPARE(read_head, write_head) != 0)
+			*++write_head = *read_head;
+		++read_head;
+	}
+	return write_head + 1;
+}
+
+/*
+ * Check if a sorted array [first, last) contains a value.
+ */
+SA_SCOPE bool
+SA_BINARY_SEARCH(SA_ELEMENT_TYPE *first,
+				 SA_ELEMENT_TYPE *last,
+				 SA_ELEMENT_TYPE *value)
+{
+	SA_ELEMENT_TYPE *lower = first;
+	SA_ELEMENT_TYPE *upper = last - 1;
+
+	while (lower <= upper)
+	{
+		SA_ELEMENT_TYPE *mid;
+		int			cmp;
+
+		mid = lower + (upper - lower) / 2;
+		cmp = SA_COMPARE(mid, value);
+		if (cmp < 0)
+			lower = mid + 1;
+		else if (cmp > 0)
+			upper = mid - 1;
+		else
+			return true;
+	}
+
+	return false;
+}
+
+/*
+ * Find the first element in the range [first, last) that is not less than
+ * value, in a sorted array.
+ */
+SA_SCOPE SA_ELEMENT_TYPE *
+SA_LOWER_BOUND(SA_ELEMENT_TYPE *first,
+			   SA_ELEMENT_TYPE *last,
+			   SA_ELEMENT_TYPE *value)
+{
+	SA_ELEMENT_TYPE *lower = first;
+	SA_ELEMENT_TYPE *upper = last - 1;
+	SA_ELEMENT_TYPE *mid = first;
+
+	while (lower <= upper)
+	{
+		int			cmp;
+
+		mid = lower + (upper - lower) / 2;
+		cmp = SA_COMPARE(mid, value);
+		if (cmp < 0)
+			lower = mid + 1;
+		else if (cmp > 0)
+			upper = mid - 1;
+		else
+			break;
+	}
+
+	return mid;
+}
+
+#endif
+
+#undef SA_MAKE_PREFIX
+#undef SA_MAKE_NAME
+#undef SA_MAKE_NAME_
+#undef SA_SORT
+#undef SA_UNIQUE
+#undef SA_BINARY_SEARCH
+#undef SA_LOWER_BOUND
+#undef SA_DECLARE
+#undef SA_DEFINE
-- 
2.19.1

0002-Refactor-the-fsync-machinery-to-support-future-SM-v5.patchapplication/octet-stream; name=0002-Refactor-the-fsync-machinery-to-support-future-SM-v5.patchDownload

From 1c6a82a9b0506eb303f3582704b2255e3ffc6abf Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Mon, 31 Dec 2018 15:25:16 +1300
Subject: [PATCH 2/2] Refactor the fsync machinery to support future SMGR
 implementations.

In anticipation of proposed block storage managers alongside md.c that
map bufmgr.c blocks to files optimised for different usage patterns:

1.  Move the system for requesting fsyncs out of md.c into a new
translation unit smgrsync.c.

2.  Have smgrsync.c perform the actual fsync() calls via the existing
polymorphic smgrimmedsync() interface, extended to allow an individual
segment number to be specified.

3.  Teach the checkpointer how to forget individual segments that are
unlinked from the 'front' after having been dropped from shared
buffers.

4.  Move the request tracking from a bitmapset into a sorted vector,
because the proposed block storage managers are not anchored at zero
and use potentially very large and sparse integers.

Author: Thomas Munro
Reviewed-by:
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 contrib/bloom/blinsert.c              |   2 +-
 src/backend/access/heap/heapam.c      |   4 +-
 src/backend/access/nbtree/nbtree.c    |   2 +-
 src/backend/access/nbtree/nbtsort.c   |   2 +-
 src/backend/access/spgist/spginsert.c |   2 +-
 src/backend/access/transam/xlog.c     |   2 +
 src/backend/bootstrap/bootstrap.c     |   1 +
 src/backend/catalog/heap.c            |   2 +-
 src/backend/commands/dbcommands.c     |   3 +-
 src/backend/commands/tablecmds.c      |   2 +-
 src/backend/commands/tablespace.c     |   2 +-
 src/backend/postmaster/bgwriter.c     |   1 +
 src/backend/postmaster/checkpointer.c |  22 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +
 src/backend/storage/ipc/ipci.c        |   1 +
 src/backend/storage/smgr/Makefile     |   2 +-
 src/backend/storage/smgr/md.c         | 801 ++-----------------------
 src/backend/storage/smgr/smgr.c       | 104 ++--
 src/backend/storage/smgr/smgrsync.c   | 834 ++++++++++++++++++++++++++
 src/backend/tcop/utility.c            |   2 +-
 src/backend/utils/misc/guc.c          |   1 +
 src/include/postmaster/bgwriter.h     |  24 +-
 src/include/postmaster/checkpointer.h |  42 ++
 src/include/storage/smgr.h            |  29 +-
 src/include/storage/smgrsync.h        |  36 ++
 25 files changed, 1037 insertions(+), 888 deletions(-)
 create mode 100644 src/backend/storage/smgr/smgrsync.c
 create mode 100644 src/include/postmaster/checkpointer.h
 create mode 100644 src/include/storage/smgrsync.h

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index 9f223d3b2a7..4a9ef0c8be4 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -188,7 +188,7 @@ blbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 96501456422..f3d53bb47dd 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -9361,7 +9361,7 @@ heap_sync(Relation rel)
 	/* main heap */
 	FlushRelationBuffers(rel);
 	/* FlushRelationBuffers will have opened rd_smgr */
-	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
+	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM, InvalidSegmentNumber);
 
 	/* FSM is not critical, don't bother syncing it */
 
@@ -9372,7 +9372,7 @@ heap_sync(Relation rel)
 
 		toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
 		FlushRelationBuffers(toastrel);
-		smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM, InvalidSegmentNumber);
 		heap_close(toastrel, AccessShareLock);
 	}
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index e8725fbbe1e..a0f957d1ef4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -178,7 +178,7 @@ btbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index 16f57557776..a829c9cc034 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1207,7 +1207,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	if (RelationNeedsWAL(wstate->index))
 	{
 		RelationOpenSmgr(wstate->index);
-		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM, InvalidBlockNumber);
 	}
 }
 
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index 7dd0d61fbbc..7201b6533f3 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -205,7 +205,7 @@ spgbuildempty(Relation index)
 	 * writes did not go through shared buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 26b4977acbe..806ca1c8504 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/walwriter.h"
 #include "postmaster/startup.h"
 #include "replication/basebackup.h"
@@ -64,6 +65,7 @@
 #include "storage/procarray.h"
 #include "storage/reinit.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index fc1927c537b..f04cb86d650 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -31,6 +31,7 @@
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 4d5b82aaa95..7927b353fcf 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1405,7 +1405,7 @@ heap_create_init_fork(Relation rel)
 	RelationOpenSmgr(rel);
 	smgrcreate(rel->rd_smgr, INIT_FORKNUM, false);
 	log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM);
-	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index f640f469729..854d5fd2e9b 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -47,7 +47,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "replication/slot.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
@@ -55,6 +55,7 @@
 #include "storage/ipc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index c8c50e8c989..a5f19eaf3f0 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11300,7 +11300,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	 * here, they might still not be on disk when the crash occurs.
 	 */
 	if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+		smgrimmedsync(dst, forkNum, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 4a714f6e2be..aa76b8d25ec 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -70,7 +70,7 @@
 #include "commands/tablespace.h"
 #include "common/file_perm.h"
 #include "miscadmin.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/standby.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index 7612b17b442..b37a25fc2a6 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -44,6 +44,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index b9c118e1560..be93f745ca5 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -47,6 +47,8 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "postmaster/postmaster.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -56,6 +58,7 @@
 #include "storage/proc.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -108,10 +111,10 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
+	int			type;
+	RelFileNode	rnode;
 	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
+	SegmentNumber segno;
 } CheckpointerRequest;
 
 typedef struct
@@ -1077,9 +1080,7 @@ RequestCheckpoint(int flags)
  * RelFileNodeBackend.
  *
  * segno specifies which segment (not block!) of the relation needs to be
- * fsync'd.  (Since the valid range is much less than BlockNumber, we can
- * use high values for special flags; that's all internal to md.c, which
- * see for details.)
+ * fsync'd.
  *
  * To avoid holding the lock for longer than necessary, we normally write
  * to the requests[] queue without checking for duplicates.  The checkpointer
@@ -1092,13 +1093,14 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardFsyncRequest(int type, RelFileNode rnode, ForkNumber forknum,
+					SegmentNumber segno)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
 
 	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
+		elog(ERROR, "ForwardFsyncRequest must not be called in single user mode");
 
 	if (AmCheckpointerProcess())
 		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
@@ -1130,6 +1132,7 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
+	request->type = type;
 	request->rnode = rnode;
 	request->forknum = forknum;
 	request->segno = segno;
@@ -1314,7 +1317,8 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberFsyncRequest(request->type, request->rnode, request->forknum,
+							 request->segno);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9817770affc..52c4801ddf4 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -42,11 +42,13 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/standby.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 473513a9272..53e846ea918 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -27,6 +27,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0df16..c9c4be325ed 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrsync.o smgrtype.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 4c6a50509f8..114963ff42a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -30,37 +30,24 @@
 #include "access/xlog.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
 
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
 
 /*
  * On Windows, we have to interpret EACCES as possibly meaning the same as
  * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
  * that's what you get.  Ugh.  This code is designed so that we don't
  * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
+ * a pending fsync request getting canceled ... see smgrsync).
  */
 #ifndef WIN32
 #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
@@ -134,30 +121,9 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * (Regular backends do not track pending operations locally, but forward
  * them to the checkpointer.)
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
 
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
 
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
 
 
 /*** behavior for mdopen & _mdfd_getseg ***/
@@ -184,8 +150,7 @@ static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
 			 bool isRedo);
 static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
 static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
+							   MdfdVec *seg);
 static void _fdvec_resize(SMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg);
@@ -208,64 +173,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -388,7 +295,7 @@ mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
 	/*
 	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
+	 * relation, else the next smgrsync() will fail.  There can't be any such
 	 * requests for a temp relation, though.  We can send just one request
 	 * even when deleting multiple forks, since the fsync queuing code accepts
 	 * the "InvalidForkNumber = all forks" convention.
@@ -448,7 +355,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		UnlinkAfterCheckpoint(rnode);
 	}
 
 	/*
@@ -993,423 +900,55 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *
  * Note that only writes already issued are synced; this routine knows
  * nothing of dirty buffers that may exist inside the buffer manager.
+ *
+ * See smgrimmedsync comment for contract.
  */
-void
-mdimmedsync(SMgrRelation reln, ForkNumber forknum)
+bool
+mdimmedsync(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
-	int			segno;
+	MdfdVec	   *segments;
+	size_t		num_segments;
+	size_t		i;
 
-	/*
-	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
-	 * fsync loop will get them all!
-	 */
-	mdnblocks(reln, forknum);
-
-	segno = reln->md_num_open_segs[forknum];
+	if (segno != InvalidSegmentNumber)
+	{
+		/*
+		 * Get the specified segment, or report failure if it doesn't seem to
+		 * exist.
+		 */
+		segments = _mdfd_openseg(reln, forknum, segno * RELSEG_SIZE,
+								 EXTENSION_RETURN_NULL);
+		if (segments == NULL)
+			return false;
+		num_segments = 1;
+	}
+	else
+	{
+		/*
+		 * NOTE: mdnblocks makes sure we have opened all active segments, so that
+		 * fsync loop will get them all!
+		 */
+		mdnblocks(reln, forknum);
+		num_segments = reln->md_num_open_segs[forknum];
+		segments = &reln->md_seg_fds[forknum][0];
+	}
 
-	while (segno > 0)
+	for (i = 0; i < num_segments; ++i)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec    *v = &segments[i];
 
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(v->mdfd_vfd))));
-		segno--;
-	}
-}
-
-/*
- *	mdsync() -- Sync previous writes to stable storage.
- */
-void
-mdsync(void)
-{
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
 	}
 
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
-
-/*
- * mdpostckpt() -- Do post-checkpoint work
- *
- * Remove any lingering files that can now be safely removed.
- */
-void
-mdpostckpt(void)
-{
-	int			absorb_counter;
-
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
-
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
+	return true;
 }
 
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
- *
- * If there is a local pending-ops table, just make an entry in it for
- * mdsync to process later.  Otherwise, try to pass off the fsync request
- * to the checkpointer process.  If that fails, just do the fsync
- * locally before returning (we hope this will not happen often enough
- * to be a performance problem).
  */
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
@@ -1417,16 +956,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
+	if (!FsyncAtCheckpoint(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
 	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1438,258 +969,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
-/*
- * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
- */
-static void
-register_unlink(RelFileNodeBackend rnode)
-{
-	/* Should never be used with temp relations */
-	Assert(!RelFileNodeBackendIsTemp(rnode));
-
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
-/*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
- */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
-{
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
-}
-
-/*
- * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
- */
-void
-ForgetDatabaseFsyncRequests(Oid dbid)
-{
-	RelFileNode rnode;
-
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
-
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
 /*
  * DropRelationFiles -- drop files of all given relations
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 189342ef86a..42596f14e9e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
@@ -58,10 +59,8 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
+	bool		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum,
+								   SegmentNumber segno);
 } f_smgr;
 
 
@@ -81,10 +80,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
-		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
+		.smgr_immedsync = mdimmedsync
 	}
 };
 
@@ -104,6 +100,14 @@ static void smgrshutdown(int code, Datum arg);
 static void add_to_unowned_list(SMgrRelation reln);
 static void remove_from_unowned_list(SMgrRelation reln);
 
+/*
+ * For now there is only one implementation.
+ */
+static inline int
+which_for_relfilenode(RelFileNode rnode)
+{
+	return 0;	/* we only have md.c at present */
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -118,6 +122,8 @@ smgrinit(void)
 {
 	int			i;
 
+	smgrsync_init();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
@@ -185,7 +191,7 @@ smgropen(RelFileNode rnode, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		reln->smgr_fsm_nblocks = InvalidBlockNumber;
 		reln->smgr_vm_nblocks = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+		reln->smgr_which = which_for_relfilenode(rnode);
 
 		/* mark it not open */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
@@ -726,17 +732,20 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *	smgrimmedsync() -- Force the specified relation to stable storage.
  *
  *		Synchronously force all previous writes to the specified relation
- *		down to disk.
- *
- *		This is useful for building completely new relations (eg, new
- *		indexes).  Instead of incrementally WAL-logging the index build
- *		steps, we can just write completed index pages to disk with smgrwrite
- *		or smgrextend, and then fsync the completed index file before
- *		committing the transaction.  (This is sufficient for purposes of
- *		crash recovery, since it effectively duplicates forcing a checkpoint
- *		for the completed index.  But it is *not* sufficient if one wishes
- *		to use the WAL log for PITR or replication purposes: in that case
- *		we have to make WAL entries as well.)
+ *		down to disk.  If segnum is >= 0, only applies to data in
+ *		one segment file.
+ *
+ *		Used for checkpointing dirty files.
+ *
+ *		This can also be used for building completely new relations (eg, new
+ *		indexes).  Instead of incrementally WAL-logging the index build steps,
+ *		we can just write completed index pages to disk with smgrwrite or
+ *		smgrextend, and then fsync the completed index file before committing
+ *		the transaction.  (This is sufficient for purposes of crash recovery,
+ *		since it effectively duplicates forcing a checkpoint for the completed
+ *		index.  But it is *not* sufficient if one wishes to use the WAL log
+ *		for PITR or replication purposes: in that case we have to make WAL
+ *		entries as well.)
  *
  *		The preceding writes should specify skipFsync = true to avoid
  *		duplicative fsyncs.
@@ -744,57 +753,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *		Note that you need to do FlushRelationBuffers() first if there is
  *		any possibility that there are dirty buffers for the relation;
  *		otherwise the sync is not very meaningful.
+ *
+ *		Fail to fsync raises an error, but non-existence of a requested
+ *		segment is reported with a false return value.
  */
-void
-smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
-{
-	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
-}
-
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
+bool
+smgrimmedsync(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
+	return smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum, segno);
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgrsync.c b/src/backend/storage/smgr/smgrsync.c
new file mode 100644
index 00000000000..914df0479d3
--- /dev/null
+++ b/src/backend/storage/smgr/smgrsync.c
@@ -0,0 +1,834 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.c
+ *	  management of file synchronization.
+ *
+ * This modules tracks which files need to be fsynced or unlinked at the
+ * next checkpoint, and performs those actions.  Normally the work is done
+ * when called by the checkpointer, but it is also done in standalone mode
+ * and startup.
+ *
+ * Originally this logic was inside md.c, but it is now made more general,
+ * for reuse by other SMGR implementations that work with files.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/smgrsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "pgstat.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "storage/relfilenode.h"
+#include "storage/smgrsync.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+#define SV_PREFIX segnum_vector
+#define SV_DECLARE
+#define SV_DEFINE
+#define SV_ELEMENT_TYPE BlockNumber
+#define SV_SCOPE static inline
+#define SV_GLOBAL_MEMORY_CONTEXT pendingOpsCxt
+#include "lib/simplevector.h"
+
+#define SA_PREFIX segnum_array
+#define SA_COMPARE(a,b) (*a < *b ? -1 : *a == *b ? 0 : 1)
+#define SA_DECLARE
+#define SA_DEFINE
+#define SA_ELEMENT_TYPE SV_ELEMENT_TYPE
+#define SA_SCOPE static inline
+#include "lib/sort_utils.h"
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  A hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
+
+/*
+ * Values for the "type" member of CheckpointerRequest.
+ *
+ * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
+ * fsync request from the queue if an identical, subsequent request is found.
+ * See comments there before making changes here.
+ */
+#define FSYNC_SEGMENT_REQUEST	1
+#define FORGET_SEGMENT_FSYNC	2
+#define FORGET_RELATION_FSYNC	3
+#define FORGET_DATABASE_FSYNC	4
+#define UNLINK_RELATION_REQUEST 5
+#define UNLINK_SEGMENT_REQUEST	6
+
+/* intervals for calling AbsorbFsyncRequests in smgrsync and smgrpostckpt */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * An entry in the hash table of files that need to be flushed for the next
+ * checkpoint.
+ */
+typedef struct PendingFsyncEntry
+{
+	RelFileNode	rnode;
+	segnum_vector requests[MAX_FORKNUM + 1];
+	segnum_vector requests_in_progress[MAX_FORKNUM + 1];
+	CycleCtr	cycle_ctr;
+} PendingFsyncEntry;
+
+typedef struct PendingUnlinkEntry
+{
+	RelFileNode rnode;			/* the dead relation to delete */
+	CycleCtr	cycle_ctr;		/* ckpt_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static bool sync_in_progress = false;
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr ckpt_cycle_ctr = 0;
+
+static HTAB *pendingFsyncTable = NULL;
+static List *pendingUnlinks = NIL;
+
+/*
+ * Initialize the pending operations state, if necessary.
+ */
+void
+smgrsync_init(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(RelFileNode);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingFsyncTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+}
+
+/*
+ * Do pre-checkpoint work.
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+smgrpreckpt(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	ckpt_cycle_ctr++;
+}
+
+/*
+ * Sync previous writes to stable storage.
+ */
+void
+smgrsync(void)
+{
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	int			processed = CheckpointStats.ckpt_sync_rels;
+	uint64		longest = CheckpointStats.ckpt_longest_sync;
+	uint64		total_elapsed = CheckpointStats.ckpt_agg_sync_time;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingFsyncTable.
+	 */
+	if (!pendingFsyncTable)
+		elog(ERROR, "cannot sync without a pendingFsyncTable");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbFsyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use GetCheckpointSyncCycle() to tell old entries apart
+	 * from new ones: new ones will have cycle_ctr equal to
+	 * IncCheckpointSyncCycle().
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous smgrsync() failed to complete, run through the table and
+	 * forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			ForkNumber		forknum;
+
+			entry->cycle_ctr = sync_cycle_ctr;
+
+			/*
+			 * If any requests remain unprocessed, they need to be merged with
+			 * the segment numbers that have arrived since.
+			 */
+			for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+			{
+				segnum_vector *requests = &entry->requests[forknum];
+				segnum_vector *requests_in_progress =
+					&entry->requests_in_progress[forknum];
+
+				if (!segnum_vector_empty(requests_in_progress))
+				{
+					/* Append the unfinished requests that were not yet handled. */
+					segnum_vector_append_n(requests,
+										   segnum_vector_data(requests_in_progress),
+										   segnum_vector_size(requests_in_progress));
+					segnum_vector_reset(requests_in_progress);
+
+					/* Sort and make unique. */
+					segnum_array_sort(segnum_vector_begin(requests),
+									  segnum_vector_end(requests));
+					segnum_vector_resize(requests,
+									 segnum_array_unique(segnum_vector_begin(requests),
+														 segnum_vector_end(requests)) -
+										 segnum_vector_begin(requests));
+				}
+			}
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingFsyncTable);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)))
+	{
+		ForkNumber forknum;
+		SMgrRelation reln;
+
+		/*
+		 * If the entry is new then don't process it this time; it might
+		 * contain multiple fsync requests, but they are all new.  Note
+		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * Scan over the forks and segments represented by the entry.
+		 *
+		 * The vector manipulations are slightly tricky, because we can call
+		 * AbsorbFsyncRequests() inside the loop and that could result in new
+		 * segment numbers being added.  So we swap the contents of "requests"
+		 * with "requests_in_progress", and if we fail we'll merge it with any
+		 * new requests that have arrived in the meantime.
+		 */
+		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			segnum_vector *requests_in_progress =
+				&entry->requests_in_progress[forknum];
+
+			/*
+			 * Transfer the current set of segment numbers into the "in
+			 * progress" vector (which must be empty initially).
+			 */
+			Assert(segnum_vector_empty(requests_in_progress));
+			segnum_vector_swap(&entry->requests[forknum], requests_in_progress);
+
+			/* Loop until all requests have been handled. */
+			while (!segnum_vector_empty(requests_in_progress))
+			{
+				SegmentNumber	segno = *segnum_vector_back(requests_in_progress);
+
+				INSTR_TIME_SET_CURRENT(sync_start);
+
+				reln = smgropen(entry->rnode, InvalidBackendId);
+				if (!smgrimmedsync(reln, forknum, segno))
+				{
+					/*
+					 * The underlying file couldn't be found.  Check if a
+					 * later message in the queue reports that it has been
+					 * unlinked; if so it will be removed from the vector,
+					 * indicating that we can safely skip it.
+					 */
+					AbsorbFsyncRequests();
+					if (!segnum_array_binary_search(segnum_vector_begin(requests_in_progress),
+													segnum_vector_end(requests_in_progress),
+													&segno))
+						continue;
+
+					/* Otherwise it's an unexpectedly missing file. */
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open backing file to fsync: %u/%u/%u",
+									entry->rnode.dbNode,
+									entry->rnode.relNode,
+									segno)));
+				}
+
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				/* Remove this segment number. */
+				Assert(segno == *segnum_vector_back(requests_in_progress));
+				segnum_vector_pop_back(requests_in_progress);
+
+				if (log_checkpoints)
+					ereport(DEBUG1,
+							(errmsg("checkpoint sync: number=%d db=%u rel=%u seg=%u time=%.3f msec",
+									processed,
+									entry->rnode.dbNode,
+									entry->rnode.relNode,
+									segno,
+									(double) elapsed / 1000),
+							 errhidestmt(true),
+							 errhidecontext(true)));
+			}
+		}
+
+		/*
+		 * We've finished everything that was requested before we started to
+		 * scan the entry.  If no new requests have been inserted meanwhile,
+		 * remove the entry.  Otherwise, update its cycle counter, as all the
+		 * requests now in it must have arrived during this cycle.
+		 */
+		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			Assert(segnum_vector_empty(&entry->requests_in_progress[forknum]));
+			if (!segnum_vector_empty(&entry->requests[forknum]))
+				break;
+			segnum_vector_reset(&entry->requests[forknum]);
+		}
+		if (forknum <= MAX_FORKNUM)
+			entry->cycle_ctr = sync_cycle_ctr;
+		else
+		{
+			/* Okay to remove it */
+			if (hash_search(pendingFsyncTable, &entry->rnode,
+							HASH_REMOVE, NULL) == NULL)
+				elog(ERROR, "pendingOpsTable corrupted");
+		}
+	}							/* end loop over hashtable entries */
+
+	/* Maintain sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of smgrsync */
+	sync_in_progress = false;
+}
+
+/*
+ * Do post-checkpoint work.
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+smgrpostckpt(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == ckpt_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = relpathperm(entry->rnode, MAIN_FORKNUM);
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in smgrsync, we don't want to stop absorbing fsync requests for a
+		 * long time when there are many deletions to be done.  We can safely
+		 * call AbsorbFsyncRequests() at this point in the loop (note it might
+		 * try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbFsyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+
+/*
+ * Mark a file as needing fsync.
+ *
+ * If there is a local pending-ops table, just make an entry in it for
+ * smgrsync to process later.  Otherwise, try to pass off the fsync request to
+ * the checkpointer process.
+ *
+ * Returns true on success, but false if the queue was full and we couldn't
+ * pass the request to the the checkpointer, meaning that the caller must
+ * perform the fsync.
+ */
+bool
+FsyncAtCheckpoint(RelFileNode rnode, ForkNumber forknum, SegmentNumber segno)
+{
+	if (pendingFsyncTable)
+	{
+		RememberFsyncRequest(FSYNC_SEGMENT_REQUEST, rnode, forknum, segno);
+		return true;
+	}
+	else
+		return ForwardFsyncRequest(FSYNC_SEGMENT_REQUEST, rnode, forknum,
+								   segno);
+}
+
+/*
+ * Schedule a file to be deleted after next checkpoint.
+ *
+ * As with FsyncAtCheckpoint, this could involve either a local or a remote
+ * pending-ops table.
+ */
+void
+UnlinkAfterCheckpoint(RelFileNodeBackend rnode)
+{
+	/* Should never be used with temp relations */
+	Assert(!RelFileNodeBackendIsTemp(rnode));
+
+	if (pendingFsyncTable)
+	{
+		/* push it into local pending-ops table */
+		RememberFsyncRequest(UNLINK_RELATION_REQUEST,
+							 rnode.node,
+							 MAIN_FORKNUM,
+							 InvalidSegmentNumber);
+	}
+	else
+	{
+		/* Notify the checkpointer about it. */
+		Assert(IsUnderPostmaster);
+
+		ForwardFsyncRequest(UNLINK_RELATION_REQUEST,
+							rnode.node,
+							MAIN_FORKNUM,
+							InvalidSegmentNumber);
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingFsyncTable during initialization of the startup
+ * process.  Calling this function drops the local pendingFsyncTable so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingFsyncTable)
+	{
+		smgrsync();
+		hash_destroy(pendingFsyncTable);
+	}
+	pendingFsyncTable = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
+
+/*
+ * Find and remove a segment number by binary search.
+ */
+static inline void
+delete_segno(segnum_vector *vec, SegmentNumber segno)
+{
+	SegmentNumber *position =
+		segnum_array_lower_bound(segnum_vector_begin(vec),
+								 segnum_vector_end(vec),
+								 &segno);
+
+	if (position != segnum_vector_end(vec) &&
+		*position == segno)
+		segnum_vector_erase(vec, position);
+}
+
+/*
+ * Add a segment number by binary search.  Hopefully these tend to be added a
+ * the high end, which is cheap.
+ */
+static inline void
+insert_segno(segnum_vector *vec, SegmentNumber segno)
+{
+	segnum_vector_insert(vec,
+						 segnum_array_lower_bound(segnum_vector_begin(vec),
+												  segnum_vector_end(vec),
+												  &segno),
+						 &segno);
+}
+
+/*
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * Valid valid values for 'type':
+ * - FSYNC_SEGMENT_REQUEST means to schedule an fsync
+ * - FORGET_SEGMENT_FSYNC means to cancel pending fsyncs for one segment
+ * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
+ *	 either for one fork, or all forks if forknum is InvalidForkNumber
+ * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
+ * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
+ *	 checkpoint.
+ * Note also that we're assuming real segment numbers don't exceed INT_MAX.
+ *
+ * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
+ * table has to be searched linearly, but dropping a database is a pretty
+ * heavyweight operation anyhow, so we'll live with it.)
+ */
+void
+RememberFsyncRequest(int type, RelFileNode rnode, ForkNumber forknum,
+					 SegmentNumber segno)
+{
+	Assert(pendingFsyncTable);
+
+	if (type == FORGET_SEGMENT_FSYNC || type == FORGET_RELATION_FSYNC)
+	{
+		PendingFsyncEntry *entry;
+
+		entry = hash_search(pendingFsyncTable, &rnode, HASH_FIND, NULL);
+		if (entry)
+		{
+			if (type == FORGET_SEGMENT_FSYNC)
+			{
+				delete_segno(&entry->requests[forknum], segno);
+				delete_segno(&entry->requests_in_progress[forknum], segno);
+			}
+			else if (forknum == InvalidForkNumber)
+			{
+				/* Remove requests for all forks. */
+				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+				{
+					segnum_vector_reset(&entry->requests[forknum]);
+					segnum_vector_reset(&entry->requests_in_progress[forknum]);
+				}
+			}
+			else
+			{
+				/* Forget about all segments for one fork. */
+				segnum_vector_reset(&entry->requests[forknum]);
+				segnum_vector_reset(&entry->requests_in_progress[forknum]);
+			}
+		}
+	}
+	else if (type == FORGET_DATABASE_FSYNC)
+	{
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (rnode.dbNode == entry->rnode.dbNode)
+			{
+				/* Remove requests for all forks. */
+				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+				{
+					segnum_vector_reset(&entry->requests[forknum]);
+					segnum_vector_reset(&entry->requests_in_progress[forknum]);
+				}
+			}
+		}
+
+		/* Remove unlink requests */
+		if (segno == FORGET_DATABASE_FSYNC)
+		{
+			ListCell   *cell,
+					   *next,
+					   *prev;
+
+			prev = NULL;
+			for (cell = list_head(pendingUnlinks); cell; cell = next)
+			{
+				PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+				next = lnext(cell);
+				if (rnode.dbNode == entry->rnode.dbNode)
+				{
+					pendingUnlinks = list_delete_cell(pendingUnlinks, cell,
+													  prev);
+					pfree(entry);
+				}
+				else
+					prev = cell;
+			}
+		}
+	}
+	else if (type == UNLINK_RELATION_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
+		Assert(forknum == MAIN_FORKNUM);
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->rnode = rnode;
+		entry->cycle_ctr = ckpt_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		entry = (PendingFsyncEntry *) hash_search(pendingFsyncTable,
+												  &rnode,
+												  HASH_ENTER,
+												  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			ForkNumber	f;
+
+			entry->cycle_ctr = ckpt_cycle_ctr;
+			for (f = 0; f <= MAX_FORKNUM; f++)
+			{
+				segnum_vector_init(&entry->requests[f]);
+				segnum_vector_init(&entry->requests_in_progress[f]);
+			}
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		insert_segno(&entry->requests[forknum], segno);
+	}
+}
+
+/*
+ * ForgetSegmentFsyncRequests -- forget any fsyncs for one segment of a
+ * relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetSegmentFsyncRequests(RelFileNode rnode, ForkNumber forknum,
+						   SegmentNumber segno)
+{
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_SEGMENT_FSYNC, rnode, forknum, segno);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		while (!ForwardFsyncRequest(FORGET_SEGMENT_FSYNC, rnode, forknum,
+									segno))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+{
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_RELATION_FSYNC, rnode, forknum,
+							 InvalidSegmentNumber);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		while (!ForwardFsyncRequest(FORGET_RELATION_FSYNC, rnode, forknum,
+									InvalidSegmentNumber))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
+ */
+void
+ForgetDatabaseFsyncRequests(Oid dbid)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = dbid;
+	rnode.spcNode = 0;
+	rnode.relNode = 0;
+
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_DATABASE_FSYNC, rnode, 0,
+							 InvalidSegmentNumber);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* see notes in ForgetRelationFsyncRequests */
+		while (!ForwardFsyncRequest(FORGET_DATABASE_FSYNC, rnode, 0,
+									InvalidSegmentNumber))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+	}
+}
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 970c94ee805..32bc91102d7 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -59,7 +59,7 @@
 #include "commands/view.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rewriteRemove.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6fe19398812..db23de3a131 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -60,6 +60,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 941c6aba7d1..137c748dfaf 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -1,10 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * bgwriter.h
- *	  Exports from postmaster/bgwriter.c and postmaster/checkpointer.c.
- *
- * The bgwriter process used to handle checkpointing duties too.  Now
- * there is a separate process, but we did not bother to split this header.
+ *	  Exports from postmaster/bgwriter.c.
  *
  * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
  *
@@ -15,29 +12,10 @@
 #ifndef _BGWRITER_H
 #define _BGWRITER_H
 
-#include "storage/block.h"
-#include "storage/relfilenode.h"
-
-
 /* GUC options */
 extern int	BgWriterDelay;
-extern int	CheckPointTimeout;
-extern int	CheckPointWarning;
-extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void) pg_attribute_noreturn();
-extern void CheckpointerMain(void) pg_attribute_noreturn();
-
-extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
-
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
-
-extern Size CheckpointerShmemSize(void);
-extern void CheckpointerShmemInit(void);
 
-extern bool FirstCallSinceLastCheckpoint(void);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/postmaster/checkpointer.h b/src/include/postmaster/checkpointer.h
new file mode 100644
index 00000000000..a53e2fc6788
--- /dev/null
+++ b/src/include/postmaster/checkpointer.h
@@ -0,0 +1,42 @@
+/*-------------------------------------------------------------------------
+ *
+ * checkpointer.h
+ *	  Exports from postmaster/checkpointer.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/checkpointer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHECKPOINTER_H
+#define CHECKPOINTER_H
+
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+
+/* GUC options */
+extern int	CheckPointTimeout;
+extern int	CheckPointWarning;
+extern double CheckPointCompletionTarget;
+
+extern void CheckpointerMain(void) pg_attribute_noreturn();
+extern bool ForwardFsyncRequest(int type, RelFileNode rnode,
+								ForkNumber forknum, BlockNumber segno);
+extern void RequestCheckpoint(int flags);
+extern void CheckpointWriteDelay(int flags, double progress);
+
+extern void AbsorbFsyncRequests(void);
+extern void AbsorbAllFsyncRequests(void);
+
+extern Size CheckpointerShmemSize(void);
+extern void CheckpointerShmemInit(void);
+
+extern uint64 GetCheckpointSyncCycle(void);
+extern uint64 IncCheckpointSyncCycle(void);
+
+extern bool FirstCallSinceLastCheckpoint(void);
+extern void CountBackendWrite(void);
+
+#endif
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index c843bbc9692..61fe0276f74 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,15 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/*
+ * The type used to identify segment numbers.  Generally, segments are an
+ * internal detail of individual storage manager implementations, but since
+ * they appear in various places to allow them to be passed between processes,
+ * it seemed worthwhile to have a typename.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -105,10 +114,9 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
-extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
+extern bool smgrimmedsync(SMgrRelation reln, ForkNumber forknum,
+						  SegmentNumber segno);
+
 extern void AtEOXact_SMgr(void);
 
 
@@ -133,16 +141,9 @@ extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
+extern bool mdimmedsync(SMgrRelation reln, ForkNumber forknum,
+						SegmentNumber segno);
+
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 #endif							/* SMGR_H */
diff --git a/src/include/storage/smgrsync.h b/src/include/storage/smgrsync.h
new file mode 100644
index 00000000000..8ef7093f801
--- /dev/null
+++ b/src/include/storage/smgrsync.h
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.h
+ *	  management of file synchronization
+ *
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/smgrpending.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SMGRSYNC_H
+#define SMGRSYNC_H
+
+#include "storage/smgr.h"
+
+extern void smgrsync_init(void);
+extern void smgrpreckpt(void);
+extern void smgrsync(void);
+extern void smgrpostckpt(void);
+
+extern void UnlinkAfterCheckpoint(RelFileNodeBackend rnode);
+extern bool FsyncAtCheckpoint(RelFileNode rnode, ForkNumber forknum,
+							  SegmentNumber segno);
+extern void RememberFsyncRequest(int type, RelFileNode rnode,
+								 ForkNumber forknum, SegmentNumber segno);
+extern void SetForwardFsyncRequests(void);
+extern void ForgetSegmentFsyncRequests(RelFileNode rnode, ForkNumber forknum,
+									   SegmentNumber segno);
+extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
+extern void ForgetDatabaseFsyncRequests(Oid dbid);
+
+
+#endif
-- 
2.19.1

#20

Thomas Munro

thomas.munro@enterprisedb.com

about 7 years ago

In reply to: Thomas Munro (#19)

2 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Jan 2, 2019 at 11:40 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

For the 0001 patch, I'll probably want to reconsider the naming a bit
("simple -> "specialized", "generic", ...?), refine (ability to turn
off the small vector optimisation? optional MemoryContext? ability
to extend without copying or zero-initialising at the same time?
comparators with a user data parameter? two-value comparators vs
three-value comparators? qsort with inline comparator? etc etc), and
remove some gratuitous C++ cargo cultisms, and maybe also instantiate
the thing centrally for some common types (I mean, perhaps 0002 should
use a common uint32_vector rather than defining its own
segnum_vector?).

Here's a new version that fixes a couple of stupid bugs (mainly a
broken XXX_lower_bound(), which I replaced with the standard algorithm
I see in many sources).

I couldn't resist the urge to try porting pg_qsort() to this style.
It seems to be about twice as fast as the original at sorting integers
on my machine with -O2. I suppose people aren't going to be too
enthusiastic about yet another copy of qsort in the tree, but maybe
this approach (with a bit more work) could replace the Perl code-gen
for tuple sorting. Then the net number of copies wouldn't go up, but
this could be used for more things too, and it fits with the style of
simplehash.h and simplevector.h. Thoughts?

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Add-parameterized-vectors-and-sorting-searching-s-v6.patchapplication/octet-stream; name=0001-Add-parameterized-vectors-and-sorting-searching-s-v6.patchDownload

From 0ceb7e60c3bedd78b9c1861c8679645f78d81889 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Tue, 1 Jan 2019 07:05:46 +1300
Subject: [PATCH 1/2] Add parameterized vectors and sorting/searching support.

To make it a bit easier to work with arrays (rather than lists or
bitmaps), create a mechanism along the lines of StringInfo, but usable
with other types (eg ints, structs, ...) that can be parameterized at
compile time.  Follow the example of simplehash.h.

Provide some simple sorting and searching algorithms for working with
sorted arrays and vectors.

Author: Thomas Munro
Reviewed-by:
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 src/include/lib/simplevector.h | 447 +++++++++++++++++++++++++++++++++
 src/include/lib/sort_utils.h   | 308 +++++++++++++++++++++++
 2 files changed, 755 insertions(+)
 create mode 100644 src/include/lib/simplevector.h
 create mode 100644 src/include/lib/sort_utils.h

diff --git a/src/include/lib/simplevector.h b/src/include/lib/simplevector.h
new file mode 100644
index 00000000000..df0fffeaa94
--- /dev/null
+++ b/src/include/lib/simplevector.h
@@ -0,0 +1,447 @@
+/*-------------------------------------------------------------------------
+ *
+ * simplevector.h
+ *
+ *	  Vector implementation that will be specialized for user-defined types,
+ *	  by including this file to generate the required code.  Suitable for
+ *	  value types that can be bitwise copied and moved.  Includes an in-place
+ *	  small-vector optimization, so that allocation can be avoided until the
+ *	  internal space is exceeded.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * Usage notes:
+ *
+ *	  To generate a type and associates functions, the following parameter
+ *	  macros should be #define'd before this file is included.
+ *
+ *	  - SV_PREFIX - prefix for all symbol names generated.
+ *	  - SV_ELEMENT_TYPE - type of the contained elements
+ *	  - SV_DECLARE - if defined the functions and types are declared
+ *	  - SV_DEFINE - if defined the functions and types are defined
+ *	  - SV_SCOPE - scope (e.g. extern, static inline) for functions
+ *
+ * IDENTIFICATION
+ *		src/include/lib/simplevector.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+/* helpers */
+#define SV_MAKE_PREFIX(a) CppConcat(a,_)
+#define SV_MAKE_NAME(name) SV_MAKE_NAME_(SV_MAKE_PREFIX(SV_PREFIX),name)
+#define SV_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* type declarations */
+#define SV_TYPE SV_PREFIX
+
+/* function declarations */
+#define SV_INIT SV_MAKE_NAME(init)
+#define SV_DESTROY SV_MAKE_NAME(destroy)
+#define SV_RESET SV_MAKE_NAME(reset)
+#define SV_CLEAR SV_MAKE_NAME(clear)
+#define SV_DATA SV_MAKE_NAME(data)
+#define SV_EMPTY SV_MAKE_NAME(empty)
+#define SV_SIZE SV_MAKE_NAME(size)
+#define SV_RESIZE SV_MAKE_NAME(resize)
+#define SV_CAPACITY SV_MAKE_NAME(capacity)
+#define SV_RESERVE SV_MAKE_NAME(reserve)
+#define SV_APPEND SV_MAKE_NAME(append)
+#define SV_APPEND_N SV_MAKE_NAME(append_n)
+#define SV_INSERT SV_MAKE_NAME(insert)
+#define SV_INSERT_N SV_MAKE_NAME(insert_n)
+#define SV_ERASE SV_MAKE_NAME(erase)
+#define SV_ERASE_N SV_MAKE_NAME(erase_n)
+#define SV_BEGIN SV_MAKE_NAME(begin)
+#define SV_END SV_MAKE_NAME(end)
+#define SV_BACK SV_MAKE_NAME(back)
+#define SV_POP_BACK SV_MAKE_NAME(pop_back)
+#define SV_SWAP SV_MAKE_NAME(swap)
+
+#ifndef SV_IN_PLACE_CAPACITY
+#define SV_IN_PLACE_CAPACITY 3
+#endif
+
+#ifdef SV_DECLARE
+
+typedef struct SV_TYPE
+{
+	/*
+	 * If size is <= SV_IN_PLACE_CAPACITY, then it represents the number of
+	 * elements stored in u.elements.  Otherwise, it is the capacity of the
+	 * buffer in u.overflow.data (in number of potential elements), and
+	 * u.overflow.count represents the number of occupied elements.
+	 */
+	uint32		size;
+	union
+	{
+		struct
+		{
+			void	   *data;
+			uint32		count;
+		} overflow;
+		SV_ELEMENT_TYPE elements[SV_IN_PLACE_CAPACITY];
+	} u;
+}		SV_TYPE;
+
+/* externally visible function prototypes */
+SV_SCOPE void SV_INIT(SV_TYPE *vec);
+SV_SCOPE void SV_DESTROY(SV_TYPE *vec);
+SV_SCOPE void SV_RESET(SV_TYPE *vec);
+SV_SCOPE void SV_CLEAR(SV_TYPE *vec);
+SV_SCOPE SV_ELEMENT_TYPE *SV_DATA(SV_TYPE *vec);
+SV_SCOPE bool SV_EMPTY(SV_TYPE *vec);
+SV_SCOPE uint32 SV_SIZE(SV_TYPE *vec);
+SV_SCOPE void SV_RESIZE(SV_TYPE *vec, uint32 size);
+SV_SCOPE uint32 SV_CAPACITY(SV_TYPE *vec);
+SV_SCOPE void SV_RESERVE(SV_TYPE *vec, uint32 capacity);
+SV_SCOPE void SV_APPEND(SV_TYPE *vec, const SV_ELEMENT_TYPE *value);
+SV_SCOPE void SV_APPEND_N(SV_TYPE *vec, const SV_ELEMENT_TYPE *values,
+						  uint32 size);
+SV_SCOPE void SV_INSERT(SV_TYPE *vec,
+						SV_ELEMENT_TYPE *position,
+						const SV_ELEMENT_TYPE *value);
+SV_SCOPE void SV_INSERT_N(SV_TYPE *vec,
+						  SV_ELEMENT_TYPE *position,
+						  const SV_ELEMENT_TYPE *values,
+						  uint32 n);
+SV_SCOPE void SV_ERASE(SV_TYPE *vec, SV_ELEMENT_TYPE *position);
+SV_SCOPE void SV_ERASE_N(SV_TYPE *vec, SV_ELEMENT_TYPE *position, uint32 n);
+SV_SCOPE void SV_SWAP(SV_TYPE *a, SV_TYPE *b);
+SV_SCOPE SV_ELEMENT_TYPE *SV_BEGIN(SV_TYPE *vec);
+SV_SCOPE SV_ELEMENT_TYPE *SV_END(SV_TYPE *vec);
+SV_SCOPE SV_ELEMENT_TYPE *SV_BACK(SV_TYPE *vec);
+SV_SCOPE void SV_POP_BACK(SV_TYPE *vec);
+
+#endif
+
+#ifdef SV_DEFINE
+
+/*
+ * Initialize a vector in-place.
+ */
+SV_SCOPE void
+SV_INIT(SV_TYPE *vec)
+{
+	vec->size = 0;
+}
+
+/*
+ * Free any resources owned by the vector.
+ */
+SV_SCOPE void
+SV_DESTROY(SV_TYPE *vec)
+{
+	SV_RESET(vec);
+}
+
+/*
+ * Free any resources owned by the vector.
+ */
+SV_SCOPE void
+SV_RESET(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		pfree(vec->u.overflow.data);
+	vec->size = 0;
+}
+
+/*
+ * Clear the vector so that it contains no elements.
+ */
+SV_SCOPE void
+SV_CLEAR(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count = 0;
+	else
+		vec->size = 0;
+}
+
+/*
+ * Return a pointer to the elements in the vector.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_DATA(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		return vec->u.overflow.data;
+	else
+		return &vec->u.elements[0];
+}
+
+/*
+ * Check if the vector is empty (has no elements).
+ */
+SV_SCOPE bool
+SV_EMPTY(SV_TYPE *vec)
+{
+	return SV_SIZE(vec) == 0;
+}
+
+/*
+ * Return the number of elements in the vector.
+ */
+SV_SCOPE uint32
+SV_SIZE(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		return vec->u.overflow.count;
+	else
+		return vec->size;
+}
+
+/*
+ * Resize the vector, discarding elements at the end, or creating new
+ * zero-initialized elements as required.
+ */
+SV_SCOPE void
+SV_RESIZE(SV_TYPE *vec, uint32 size)
+{
+	uint32		old_size = SV_SIZE(vec);
+
+	/* Growing? */
+	if (size > old_size)
+	{
+		SV_RESERVE(vec, size);
+		memset(&SV_DATA(vec)[old_size], 0,
+			   sizeof(SV_ELEMENT_TYPE) * (size - old_size));
+	}
+
+	/* Set the new size. */
+	if (vec->size <= SV_IN_PLACE_CAPACITY)
+		vec->size = size;
+	else
+		vec->u.overflow.count = size;
+}
+
+/*
+ * Return the number of elements that can be held in the vector before it
+ * needs to reallocate.
+ */
+SV_SCOPE uint32
+SV_CAPACITY(SV_TYPE *vec)
+{
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		return vec->size;
+	else
+		return SV_IN_PLACE_CAPACITY;
+}
+
+/*
+ * Make sure we have capacity for a given number of elements without having to
+ * reallocate.
+ */
+SV_SCOPE void
+SV_RESERVE(SV_TYPE *vec, uint32 capacity)
+{
+	void *new_buffer;
+
+	/* Do nothing if we already have that much capacity. */
+	if (capacity <= SV_IN_PLACE_CAPACITY || capacity < vec->size)
+		return;
+
+	/* Allocate larger buffer. */
+#ifdef SV_GLOBAL_MEMORY_CONTEXT
+	new_buffer = MemoryContextAlloc(SV_GLOBAL_MEMORY_CONTEXT,
+									sizeof(SV_ELEMENT_TYPE) * capacity);
+#else
+	new_buffer = palloc(sizeof(SV_ELEMENT_TYPE) * capacity);
+#endif
+
+	/* Copy existing data to new buffer. */
+	if (vec->size <= SV_IN_PLACE_CAPACITY)
+	{
+		/* Promote from in-line format. */
+		if (vec->size > 0)
+			memcpy(new_buffer,
+				   vec->u.elements,
+				   sizeof(SV_ELEMENT_TYPE) * vec->size);
+		vec->u.overflow.count = vec->size;
+	}
+	else
+	{
+		/* Copy from existing smaller overflow buffer, and free it. */
+		if (vec->u.overflow.count > 0)
+			memcpy(new_buffer,
+				   vec->u.overflow.data,
+				   sizeof(SV_ELEMENT_TYPE) * vec->u.overflow.count);
+		Assert(vec->u.overflow.data);
+		pfree(vec->u.overflow.data);
+	}
+	vec->u.overflow.data = new_buffer;
+	vec->size = capacity;
+}
+
+/*
+ * Append a value to the end of a vector.
+ */
+SV_SCOPE void
+SV_APPEND(SV_TYPE *vec, const SV_ELEMENT_TYPE *value)
+{
+	SV_APPEND_N(vec, value, 1);
+}
+
+/*
+ * Append N values to the end of a vector.
+ */
+SV_SCOPE void
+SV_APPEND_N(SV_TYPE *vec, const SV_ELEMENT_TYPE *values, uint32 n)
+{
+	uint32		size = SV_SIZE(vec);
+
+	SV_RESERVE(vec, size + n);
+	memcpy(&SV_DATA(vec)[size], values, sizeof(SV_ELEMENT_TYPE) * n);
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count += n;
+	else
+		vec->size += n;
+}
+
+/*
+ * Insert a value before an arbitrary position in the vector.  This is not
+ * especially efficient as it must shift values to make space.
+ */
+SV_SCOPE void
+SV_INSERT(SV_TYPE *vec, SV_ELEMENT_TYPE *position, const SV_ELEMENT_TYPE *value)
+{
+	SV_INSERT_N(vec, position, value, 1);
+}
+
+/*
+ * Insert N values before an arbitrary position in the vector.  This is not
+ * especially efficient as it must shift values to make space.
+ */
+SV_SCOPE void
+SV_INSERT_N(SV_TYPE *vec, SV_ELEMENT_TYPE *position,
+			const SV_ELEMENT_TYPE *values, uint32 n)
+{
+	uint32		size = SV_SIZE(vec);
+	uint32		i = position - SV_DATA(vec);
+	SV_ELEMENT_TYPE *data;
+
+	if (n == 0)
+		return;
+
+	Assert(position >= SV_DATA(vec) &&
+		   position <= SV_DATA(vec) + size);
+	SV_RESERVE(vec, size + n);
+	data = SV_DATA(vec);
+	memmove(&data[i + n],
+			&data[i],
+			sizeof(SV_ELEMENT_TYPE) * (size - i));
+	memcpy(&data[i], values, sizeof(SV_ELEMENT_TYPE) * n);
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count += n;
+	else
+		vec->size += n;
+}
+
+/*
+ * Erase an arbitarary element in the vector.  This is not especially
+ * efficient as it must shift trailing values.
+ */
+SV_SCOPE void
+SV_ERASE(SV_TYPE *vec, SV_ELEMENT_TYPE *position)
+{
+	SV_ERASE_N(vec, position, 1);
+}
+
+/*
+ * Erase N values begining with an arbitarary element in the vector.  This is
+ * not especially efficient as it must shift trailing values.
+ */
+SV_SCOPE void
+SV_ERASE_N(SV_TYPE *vec, SV_ELEMENT_TYPE *position, uint32 n)
+{
+	Assert(position >= SV_DATA(vec) &&
+		   position + n <= SV_DATA(vec) + SV_SIZE(vec));
+	memmove(position,
+			position + n,
+			sizeof(SV_ELEMENT_TYPE) * (SV_SIZE(vec) - n));
+	if (vec->size > SV_IN_PLACE_CAPACITY)
+		vec->u.overflow.count -= n;
+	else
+		vec->size -= n;
+}
+
+/*
+ * Get a pointer to the first element, if there is one.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_BEGIN(SV_TYPE *vec)
+{
+	return SV_DATA(vec);
+}
+
+/*
+ * Get a pointer to the element past the last element.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_END(SV_TYPE *vec)
+{
+	return SV_DATA(vec) + SV_SIZE(vec);
+}
+
+/*
+ * Get a pointer to the back (last) element.
+ */
+SV_SCOPE SV_ELEMENT_TYPE *
+SV_BACK(SV_TYPE *vec)
+{
+	Assert(!SV_EMPTY(vec));
+	return SV_DATA(vec) + SV_SIZE(vec) - 1;
+}
+
+/*
+ * Remove the back (last) element.
+ */
+SV_SCOPE void
+SV_POP_BACK(SV_TYPE *vec)
+{
+	Assert(!SV_EMPTY(vec));
+	SV_RESIZE(vec, SV_SIZE(vec) - 1);
+}
+
+/*
+ * Swap the contents of two vectors.
+ */
+SV_SCOPE void
+SV_SWAP(SV_TYPE *a, SV_TYPE *b)
+{
+	SV_TYPE		tmp;
+
+	tmp = *a;
+	*a = *b;
+	*b = tmp;
+}
+
+#endif
+
+#undef SV_APPEND
+#undef SV_APPEND_N
+#undef SV_BACK
+#undef SV_BEGIN
+#undef SV_CAPACITY
+#undef SV_CLEAR
+#undef SV_DATA
+#undef SV_DECLARE
+#undef SV_DEFINE
+#undef SV_DESTROY
+#undef SV_EMPTY
+#undef SV_END
+#undef SV_ERASE
+#undef SV_ERASE_N
+#undef SV_INIT
+#undef SV_INSERT
+#undef SV_INSERT_N
+#undef SV_IN_PLACE_CAPACITY
+#undef SV_MAKE_NAME
+#undef SV_MAKE_NAME_
+#undef SV_MAKE_PREFIX
+#undef SV_POP_BACK
+#undef SV_RESERVE
+#undef SV_RESET
+#undef SV_RESIZE
+#undef SV_SIZE
+#undef SV_SWAP
diff --git a/src/include/lib/sort_utils.h b/src/include/lib/sort_utils.h
new file mode 100644
index 00000000000..dad0629ea75
--- /dev/null
+++ b/src/include/lib/sort_utils.h
@@ -0,0 +1,308 @@
+/*-------------------------------------------------------------------------
+ *
+ * sort_utils.h
+ *
+ *	  Simple sorting-related algorithms specialized for arrays of
+ *	  paramaterized type, using inlined comparators.
+ *
+ * Copyright (c) 2019, PostgreSQL Global Development Group
+ *
+ * Usage notes:
+ *
+ *	  To generate functions specialized for a type, the following parameter
+ *	  macros should be #define'd before this file is included.
+ *
+ *	  - SA_PREFIX - prefix for all symbol names generated.
+ *	  - SA_ELEMENT_TYPE - type of the referenced elements
+ *	  - SA_DECLARE - if defined the functions and types are declared
+ *	  - SA_DEFINE - if defined the functions and types are defined
+ *	  - SA_SCOPE - scope (e.g. extern, static inline) for functions
+ *
+ *	  The following are relevant only when SA_DEFINE is defined:
+ *
+ *	  - SA_COMPARE(a, b) - an expression to compare pointers to two values
+ *
+ * IDENTIFICATION
+ *		src/include/lib/sort_utils.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#define SA_MAKE_PREFIX(a) CppConcat(a,_)
+#define SA_MAKE_NAME(name) SA_MAKE_NAME_(SA_MAKE_PREFIX(SA_PREFIX),name)
+#define SA_MAKE_NAME_(a,b) CppConcat(a,b)
+
+/* function declarations */
+#define SA_SORT SA_MAKE_NAME(sort)
+#define SA_UNIQUE SA_MAKE_NAME(unique)
+#define SA_BINARY_SEARCH SA_MAKE_NAME(binary_search)
+#define SA_LOWER_BOUND SA_MAKE_NAME(lower_bound)
+
+#ifdef SA_DECLARE
+
+SA_SCOPE void SA_SORT(SA_ELEMENT_TYPE *first, SA_ELEMENT_TYPE *last);
+SA_SCOPE SA_ELEMENT_TYPE *SA_UNIQUE(SA_ELEMENT_TYPE *first,
+									SA_ELEMENT_TYPE *last);
+SA_SCOPE SA_ELEMENT_TYPE *SA_BINARY_SEARCH(SA_ELEMENT_TYPE *first,
+										   SA_ELEMENT_TYPE *last,
+										   SA_ELEMENT_TYPE *value);
+SA_SCOPE SA_ELEMENT_TYPE *SA_LOWER_BOUND(SA_ELEMENT_TYPE *first,
+										 SA_ELEMENT_TYPE *last,
+										 SA_ELEMENT_TYPE *value);
+
+#endif
+
+#ifdef SA_DEFINE
+
+/* helper functions */
+#define SA_MED3 SA_MAKE_NAME(med3)
+#define SA_SWAP SA_MAKE_NAME(swap)
+#define SA_SWAPN SA_MAKE_NAME(swapn)
+
+static inline SA_ELEMENT_TYPE *
+SA_MED3(SA_ELEMENT_TYPE *a,
+		SA_ELEMENT_TYPE *b,
+		SA_ELEMENT_TYPE *c)
+{
+	return SA_COMPARE(a, b) < 0 ?
+		(SA_COMPARE(b, c) < 0 ? b : (SA_COMPARE(a, c) < 0 ? c : a))
+		: (SA_COMPARE(b, c) > 0 ? b : (SA_COMPARE(a, c) < 0 ? a : c));
+}
+
+static inline void
+SA_SWAP(SA_ELEMENT_TYPE *a, SA_ELEMENT_TYPE *b)
+{
+	SA_ELEMENT_TYPE tmp = *a;
+
+	*a = *b;
+	*b = tmp;
+}
+
+static inline void
+SA_SWAPN(SA_ELEMENT_TYPE *a, SA_ELEMENT_TYPE *b, size_t n)
+{
+	size_t		i;
+
+	for (i = 0; i < n; ++i)
+		SA_SWAP(&a[i], &b[i]);
+}
+
+/*
+ * Sort an array [first, last).  This is the same algorithm as
+ * src/port/qsort.c, parameterized at compile-time for comparison and element
+ * type.
+ */
+SA_SCOPE void
+SA_SORT(SA_ELEMENT_TYPE *first, SA_ELEMENT_TYPE *last)
+{
+	SA_ELEMENT_TYPE *a = first,
+			   *pa,
+			   *pb,
+			   *pc,
+			   *pd,
+			   *pl,
+			   *pm,
+			   *pn;
+	size_t		d1,
+				d2;
+	int			r,
+				presorted;
+	size_t		n = last - first;
+
+loop:
+	if (n < 7)
+	{
+		for (pm = a + 1; pm < a + n; ++pm)
+			for (pl = pm; pl > a && SA_COMPARE(pl - 1, pl) > 0; --pl)
+				SA_SWAP(pl, pl - 1);
+		return;
+	}
+	presorted = 1;
+	for (pm = a + 1; pm < a + n; ++pm)
+	{
+		if (SA_COMPARE(pm - 1, pm) > 0)
+		{
+			presorted = 0;
+			break;
+		}
+	}
+	if (presorted)
+		return;
+	pm = a + (n / 2);
+	if (n > 7)
+	{
+		pl = a;
+		pn = a + (n - 1);
+		if (n > 40)
+		{
+			size_t		d = n / 8;
+
+			pl = SA_MED3(pl, pl + d, pl + 2 * d);
+			pm = SA_MED3(pm - d, pm, pm + d);
+			pn = SA_MED3(pn - 2 * d, pn - d, pn);
+		}
+		pm = SA_MED3(pl, pm, pn);
+	}
+	SA_SWAP(a, pm);
+	pa = pb = a + 1;
+	pc = pd = a + (n - 1);
+	for (;;)
+	{
+		while (pb <= pc && (r = SA_COMPARE(pb, a)) <= 0)
+		{
+			if (r == 0)
+			{
+				SA_SWAP(pa, pb);
+				++pa;
+			}
+			++pb;
+		}
+		while (pb <= pc && (r = SA_COMPARE(pc, a)) >= 0)
+		{
+			if (r == 0)
+			{
+				SA_SWAP(pc, pd);
+				--pd;
+			}
+			--pc;
+		}
+		if (pb > pc)
+			break;
+		SA_SWAP(pb, pc);
+		++pb;
+		--pc;
+	}
+	pn = a + n;
+	d1 = Min(pa - a, pb - pa);
+	SA_SWAPN(a, pb - d1, d1);
+	d1 = Min(pd - pc, pn - pd - 1);
+	SA_SWAPN(pb, pn - d1, d1);
+	d1 = pb - pa;
+	d2 = pd - pc;
+	if (d1 <= d2)
+	{
+		/* Recurse on left partition, then iterate on right partition */
+		if (d1 > 1)
+			SA_SORT(a, a + d1);
+		if (d2 > 1)
+		{
+			/* Iterate rather than recurse to save stack space */
+			/* SA_SORT(pn - d2, pn + d2) */
+			a = pn - d2;
+			n = d2;
+			goto loop;
+		}
+	}
+	else
+	{
+		/* Recurse on right partition, then iterate on left partition */
+		if (d2 > 1)
+			SA_SORT(pn - d2, pn);
+		if (d1 > 1)
+		{
+			/* Iterate rather than recurse to save stack space */
+			/* SA_SORT(a, a + d1) */
+			n = d1;
+			goto loop;
+		}
+	}
+}
+
+/*
+ * Remove duplicates from an array [first, last).  Return the new last pointer
+ * (ie one past the new end).
+ */
+SA_SCOPE SA_ELEMENT_TYPE *
+SA_UNIQUE(SA_ELEMENT_TYPE *first, SA_ELEMENT_TYPE *last)
+{
+	SA_ELEMENT_TYPE *write_head;
+	SA_ELEMENT_TYPE *read_head;
+
+	if (last - first <= 1)
+		return last;
+
+	write_head = first;
+	read_head = first + 1;
+
+	while (read_head < last)
+	{
+		if (SA_COMPARE(read_head, write_head) != 0)
+			*++write_head = *read_head;
+		++read_head;
+	}
+	return write_head + 1;
+}
+
+/*
+ * Find an element in the array of sorted values [first, last) that is equal
+ * to a given value, in a sorted array.  Return NULL if there is none.
+ */
+SA_SCOPE SA_ELEMENT_TYPE *
+SA_BINARY_SEARCH(SA_ELEMENT_TYPE *first,
+				 SA_ELEMENT_TYPE *last,
+				 SA_ELEMENT_TYPE *value)
+{
+	SA_ELEMENT_TYPE *lower = first;
+	SA_ELEMENT_TYPE *upper = last - 1;
+
+	while (lower <= upper)
+	{
+		SA_ELEMENT_TYPE *mid;
+		int			cmp;
+
+		mid = lower + (upper - lower) / 2;
+		cmp = SA_COMPARE(mid, value);
+		if (cmp < 0)
+			lower = mid + 1;
+		else if (cmp > 0)
+			upper = mid - 1;
+		else
+			return mid;
+	}
+
+	return NULL;
+}
+
+/*
+ * Find the first element in the range [first, last) that is not less than
+ * value, in a sorted array.
+ */
+SA_SCOPE SA_ELEMENT_TYPE *
+SA_LOWER_BOUND(SA_ELEMENT_TYPE *first,
+			   SA_ELEMENT_TYPE *last,
+			   SA_ELEMENT_TYPE *value)
+{
+	ptrdiff_t		count;
+
+	count = last - first;
+	while (count > 0)
+	{
+		SA_ELEMENT_TYPE *iter = first;
+		ptrdiff_t		step = count / 2;
+
+		iter += step;
+		if (SA_COMPARE(iter, value) < 0)
+		{
+			first = ++iter;
+			count -= step + 1;
+		}
+		else
+			count = step;
+	}
+	return first;
+}
+
+#endif
+
+#undef SA_BINARY_SEARCH
+#undef SA_DECLARE
+#undef SA_DEFINE
+#undef SA_LOWER_BOUND
+#undef SA_MAKE_NAME
+#undef SA_MAKE_NAME
+#undef SA_MAKE_NAME_
+#undef SA_MAKE_PREFIX
+#undef SA_MED3
+#undef SA_SORT
+#undef SA_SWAP
+#undef SA_UNIQUE
-- 
2.19.1

0002-Refactor-the-fsync-machinery-to-support-future-SM-v6.patchapplication/octet-stream; name=0002-Refactor-the-fsync-machinery-to-support-future-SM-v6.patchDownload

From 54112d952592b9a1d3858384b5c79b6334b20ac6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Mon, 31 Dec 2018 15:25:16 +1300
Subject: [PATCH 2/2] Refactor the fsync machinery to support future SMGR
 implementations.

In anticipation of proposed block storage managers alongside md.c that
map bufmgr.c blocks to files optimised for different usage patterns:

1.  Move the system for requesting fsyncs out of md.c into a new
translation unit smgrsync.c.

2.  Have smgrsync.c perform the actual fsync() calls via the existing
polymorphic smgrimmedsync() interface, extended to allow an individual
segment number to be specified.

3.  Teach the checkpointer how to forget individual segments that are
unlinked from the 'front' after having been dropped from shared
buffers.

4.  Move the request tracking from a bitmapset into a sorted vector,
because the proposed block storage managers are not anchored at zero
and use potentially very large and sparse integers.

Author: Thomas Munro
Reviewed-by:
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 contrib/bloom/blinsert.c              |   2 +-
 src/backend/access/heap/heapam.c      |   4 +-
 src/backend/access/nbtree/nbtree.c    |   2 +-
 src/backend/access/nbtree/nbtsort.c   |   2 +-
 src/backend/access/spgist/spginsert.c |   2 +-
 src/backend/access/transam/xlog.c     |   2 +
 src/backend/bootstrap/bootstrap.c     |   1 +
 src/backend/catalog/heap.c            |   2 +-
 src/backend/commands/dbcommands.c     |   3 +-
 src/backend/commands/tablecmds.c      |   2 +-
 src/backend/commands/tablespace.c     |   2 +-
 src/backend/postmaster/bgwriter.c     |   1 +
 src/backend/postmaster/checkpointer.c |  22 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +
 src/backend/storage/ipc/ipci.c        |   1 +
 src/backend/storage/smgr/Makefile     |   2 +-
 src/backend/storage/smgr/md.c         | 801 ++----------------------
 src/backend/storage/smgr/smgr.c       | 104 ++--
 src/backend/storage/smgr/smgrsync.c   | 855 ++++++++++++++++++++++++++
 src/backend/tcop/utility.c            |   2 +-
 src/backend/utils/misc/guc.c          |   1 +
 src/include/postmaster/bgwriter.h     |  24 +-
 src/include/postmaster/checkpointer.h |  39 ++
 src/include/storage/smgr.h            |  29 +-
 src/include/storage/smgrsync.h        |  36 ++
 25 files changed, 1055 insertions(+), 888 deletions(-)
 create mode 100644 src/backend/storage/smgr/smgrsync.c
 create mode 100644 src/include/postmaster/checkpointer.h
 create mode 100644 src/include/storage/smgrsync.h

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index e43fbe0005f..6fa07db4f8d 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -188,7 +188,7 @@ blbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index 2c4a1453576..cb4ba0d569c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -9361,7 +9361,7 @@ heap_sync(Relation rel)
 	/* main heap */
 	FlushRelationBuffers(rel);
 	/* FlushRelationBuffers will have opened rd_smgr */
-	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
+	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM, InvalidSegmentNumber);
 
 	/* FSM is not critical, don't bother syncing it */
 
@@ -9372,7 +9372,7 @@ heap_sync(Relation rel)
 
 		toastrel = heap_open(rel->rd_rel->reltoastrelid, AccessShareLock);
 		FlushRelationBuffers(toastrel);
-		smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM, InvalidSegmentNumber);
 		heap_close(toastrel, AccessShareLock);
 	}
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2efd..b29112c133f 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -178,7 +178,7 @@ btbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index d9b9229ab76..30524ded768 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1207,7 +1207,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	if (RelationNeedsWAL(wstate->index))
 	{
 		RelationOpenSmgr(wstate->index);
-		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM, InvalidBlockNumber);
 	}
 }
 
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index f428a151385..0eb5ced43d6 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -205,7 +205,7 @@ spgbuildempty(Relation index)
 	 * writes did not go through shared buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9823b757676..97195bfb529 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/walwriter.h"
 #include "postmaster/startup.h"
 #include "replication/basebackup.h"
@@ -64,6 +65,7 @@
 #include "storage/procarray.h"
 #include "storage/reinit.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 5274e26783e..b2a58d5c628 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -31,6 +31,7 @@
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 694000798a7..8a004afe216 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1405,7 +1405,7 @@ heap_create_init_fork(Relation rel)
 	RelationOpenSmgr(rel);
 	smgrcreate(rel->rd_smgr, INIT_FORKNUM, false);
 	log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM);
-	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 1208fdf33ff..8dd311b7374 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -47,7 +47,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "replication/slot.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
@@ -55,6 +55,7 @@
 #include "storage/ipc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index e1af2c44953..6513917c821 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11300,7 +11300,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	 * here, they might still not be on disk when the crash occurs.
 	 */
 	if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+		smgrimmedsync(dst, forkNum, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 946e1b99767..07f7eeaccf1 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -70,7 +70,7 @@
 #include "commands/tablespace.h"
 #include "common/file_perm.h"
 #include "miscadmin.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/standby.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index e6b6c549de5..fd5803f1959 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -44,6 +44,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359b..a43dc03be33 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -47,6 +47,8 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "postmaster/postmaster.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -56,6 +58,7 @@
 #include "storage/proc.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -108,10 +111,10 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
+	int			type;
+	RelFileNode	rnode;
 	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
+	SegmentNumber segno;
 } CheckpointerRequest;
 
 typedef struct
@@ -1077,9 +1080,7 @@ RequestCheckpoint(int flags)
  * RelFileNodeBackend.
  *
  * segno specifies which segment (not block!) of the relation needs to be
- * fsync'd.  (Since the valid range is much less than BlockNumber, we can
- * use high values for special flags; that's all internal to md.c, which
- * see for details.)
+ * fsync'd.
  *
  * To avoid holding the lock for longer than necessary, we normally write
  * to the requests[] queue without checking for duplicates.  The checkpointer
@@ -1092,13 +1093,14 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardFsyncRequest(int type, RelFileNode rnode, ForkNumber forknum,
+					SegmentNumber segno)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
 
 	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
+		elog(ERROR, "ForwardFsyncRequest must not be called in single user mode");
 
 	if (AmCheckpointerProcess())
 		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
@@ -1130,6 +1132,7 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
+	request->type = type;
 	request->rnode = rnode;
 	request->forknum = forknum;
 	request->segno = segno;
@@ -1314,7 +1317,8 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberFsyncRequest(request->type, request->rnode, request->forknum,
+							 request->segno);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385fe..97bdfcb7b33 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -42,11 +42,13 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/standby.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2849e47d99b..4e70cd9efa6 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -27,6 +27,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0df16..c9c4be325ed 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrsync.o smgrtype.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e4501ff9bc9..d145755c040 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -30,37 +30,24 @@
 #include "access/xlog.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
 
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
 
 /*
  * On Windows, we have to interpret EACCES as possibly meaning the same as
  * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
  * that's what you get.  Ugh.  This code is designed so that we don't
  * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
+ * a pending fsync request getting canceled ... see smgrsync).
  */
 #ifndef WIN32
 #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
@@ -134,30 +121,9 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * (Regular backends do not track pending operations locally, but forward
  * them to the checkpointer.)
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
 
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
 
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
 
 
 /*** behavior for mdopen & _mdfd_getseg ***/
@@ -184,8 +150,7 @@ static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
 			 bool isRedo);
 static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
 static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
+							   MdfdVec *seg);
 static void _fdvec_resize(SMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg);
@@ -208,64 +173,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -388,7 +295,7 @@ mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
 	/*
 	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
+	 * relation, else the next smgrsync() will fail.  There can't be any such
 	 * requests for a temp relation, though.  We can send just one request
 	 * even when deleting multiple forks, since the fsync queuing code accepts
 	 * the "InvalidForkNumber = all forks" convention.
@@ -448,7 +355,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		UnlinkAfterCheckpoint(rnode);
 	}
 
 	/*
@@ -993,423 +900,55 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *
  * Note that only writes already issued are synced; this routine knows
  * nothing of dirty buffers that may exist inside the buffer manager.
+ *
+ * See smgrimmedsync comment for contract.
  */
-void
-mdimmedsync(SMgrRelation reln, ForkNumber forknum)
+bool
+mdimmedsync(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
-	int			segno;
+	MdfdVec	   *segments;
+	size_t		num_segments;
+	size_t		i;
 
-	/*
-	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
-	 * fsync loop will get them all!
-	 */
-	mdnblocks(reln, forknum);
-
-	segno = reln->md_num_open_segs[forknum];
+	if (segno != InvalidSegmentNumber)
+	{
+		/*
+		 * Get the specified segment, or report failure if it doesn't seem to
+		 * exist.
+		 */
+		segments = _mdfd_openseg(reln, forknum, segno * RELSEG_SIZE,
+								 EXTENSION_RETURN_NULL);
+		if (segments == NULL)
+			return false;
+		num_segments = 1;
+	}
+	else
+	{
+		/*
+		 * NOTE: mdnblocks makes sure we have opened all active segments, so that
+		 * fsync loop will get them all!
+		 */
+		mdnblocks(reln, forknum);
+		num_segments = reln->md_num_open_segs[forknum];
+		segments = &reln->md_seg_fds[forknum][0];
+	}
 
-	while (segno > 0)
+	for (i = 0; i < num_segments; ++i)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec    *v = &segments[i];
 
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(v->mdfd_vfd))));
-		segno--;
-	}
-}
-
-/*
- *	mdsync() -- Sync previous writes to stable storage.
- */
-void
-mdsync(void)
-{
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
 	}
 
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
-
-/*
- * mdpostckpt() -- Do post-checkpoint work
- *
- * Remove any lingering files that can now be safely removed.
- */
-void
-mdpostckpt(void)
-{
-	int			absorb_counter;
-
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
-
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
+	return true;
 }
 
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
- *
- * If there is a local pending-ops table, just make an entry in it for
- * mdsync to process later.  Otherwise, try to pass off the fsync request
- * to the checkpointer process.  If that fails, just do the fsync
- * locally before returning (we hope this will not happen often enough
- * to be a performance problem).
  */
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
@@ -1417,16 +956,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
+	if (!FsyncAtCheckpoint(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
 	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1438,258 +969,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
-/*
- * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
- */
-static void
-register_unlink(RelFileNodeBackend rnode)
-{
-	/* Should never be used with temp relations */
-	Assert(!RelFileNodeBackendIsTemp(rnode));
-
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
-/*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
- */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
-{
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
-}
-
-/*
- * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
- */
-void
-ForgetDatabaseFsyncRequests(Oid dbid)
-{
-	RelFileNode rnode;
-
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
-
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
 /*
  * DropRelationFiles -- drop files of all given relations
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0c0bba4ab33..c6d3da1c1a5 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
@@ -58,10 +59,8 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
+	bool		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum,
+								   SegmentNumber segno);
 } f_smgr;
 
 
@@ -81,10 +80,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
-		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
+		.smgr_immedsync = mdimmedsync
 	}
 };
 
@@ -104,6 +100,14 @@ static void smgrshutdown(int code, Datum arg);
 static void add_to_unowned_list(SMgrRelation reln);
 static void remove_from_unowned_list(SMgrRelation reln);
 
+/*
+ * For now there is only one implementation.
+ */
+static inline int
+which_for_relfilenode(RelFileNode rnode)
+{
+	return 0;	/* we only have md.c at present */
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -118,6 +122,8 @@ smgrinit(void)
 {
 	int			i;
 
+	smgrsync_init();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
@@ -185,7 +191,7 @@ smgropen(RelFileNode rnode, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		reln->smgr_fsm_nblocks = InvalidBlockNumber;
 		reln->smgr_vm_nblocks = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+		reln->smgr_which = which_for_relfilenode(rnode);
 
 		/* mark it not open */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
@@ -726,17 +732,20 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *	smgrimmedsync() -- Force the specified relation to stable storage.
  *
  *		Synchronously force all previous writes to the specified relation
- *		down to disk.
- *
- *		This is useful for building completely new relations (eg, new
- *		indexes).  Instead of incrementally WAL-logging the index build
- *		steps, we can just write completed index pages to disk with smgrwrite
- *		or smgrextend, and then fsync the completed index file before
- *		committing the transaction.  (This is sufficient for purposes of
- *		crash recovery, since it effectively duplicates forcing a checkpoint
- *		for the completed index.  But it is *not* sufficient if one wishes
- *		to use the WAL log for PITR or replication purposes: in that case
- *		we have to make WAL entries as well.)
+ *		down to disk.  If segnum is >= 0, only applies to data in
+ *		one segment file.
+ *
+ *		Used for checkpointing dirty files.
+ *
+ *		This can also be used for building completely new relations (eg, new
+ *		indexes).  Instead of incrementally WAL-logging the index build steps,
+ *		we can just write completed index pages to disk with smgrwrite or
+ *		smgrextend, and then fsync the completed index file before committing
+ *		the transaction.  (This is sufficient for purposes of crash recovery,
+ *		since it effectively duplicates forcing a checkpoint for the completed
+ *		index.  But it is *not* sufficient if one wishes to use the WAL log
+ *		for PITR or replication purposes: in that case we have to make WAL
+ *		entries as well.)
  *
  *		The preceding writes should specify skipFsync = true to avoid
  *		duplicative fsyncs.
@@ -744,57 +753,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *		Note that you need to do FlushRelationBuffers() first if there is
  *		any possibility that there are dirty buffers for the relation;
  *		otherwise the sync is not very meaningful.
+ *
+ *		Fail to fsync raises an error, but non-existence of a requested
+ *		segment is reported with a false return value.
  */
-void
-smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
-{
-	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
-}
-
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
+bool
+smgrimmedsync(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
+	return smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum, segno);
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgrsync.c b/src/backend/storage/smgr/smgrsync.c
new file mode 100644
index 00000000000..d343e59931e
--- /dev/null
+++ b/src/backend/storage/smgr/smgrsync.c
@@ -0,0 +1,855 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.c
+ *	  management of file synchronization.
+ *
+ * This modules tracks which files need to be fsynced or unlinked at the
+ * next checkpoint, and performs those actions.  Normally the work is done
+ * when called by the checkpointer, but it is also done in standalone mode
+ * and startup.
+ *
+ * Originally this logic was inside md.c, but it is now made more general,
+ * for reuse by other SMGR implementations that work with files.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/smgrsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "pgstat.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "storage/relfilenode.h"
+#include "storage/smgrsync.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+#define SV_PREFIX segnum_vector
+#define SV_DECLARE
+#define SV_DEFINE
+#define SV_ELEMENT_TYPE BlockNumber
+#define SV_SCOPE static inline
+#define SV_GLOBAL_MEMORY_CONTEXT pendingOpsCxt
+#include "lib/simplevector.h"
+
+#define SA_PREFIX segnum_array
+#define SA_COMPARE(a,b) (*a < *b ? -1 : *a == *b ? 0 : 1)
+#define SA_DECLARE
+#define SA_DEFINE
+#define SA_ELEMENT_TYPE SV_ELEMENT_TYPE
+#define SA_SCOPE static inline
+#include "lib/sort_utils.h"
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  A hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
+
+/*
+ * Values for the "type" member of CheckpointerRequest.
+ *
+ * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
+ * fsync request from the queue if an identical, subsequent request is found.
+ * See comments there before making changes here.
+ */
+#define FSYNC_SEGMENT_REQUEST	1
+#define FORGET_SEGMENT_FSYNC	2
+#define FORGET_RELATION_FSYNC	3
+#define FORGET_DATABASE_FSYNC	4
+#define UNLINK_RELATION_REQUEST 5
+#define UNLINK_SEGMENT_REQUEST	6
+
+/* intervals for calling AbsorbFsyncRequests in smgrsync and smgrpostckpt */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * An entry in the hash table of files that need to be flushed for the next
+ * checkpoint.
+ */
+typedef struct PendingFsyncEntry
+{
+	RelFileNode	rnode;
+	segnum_vector requests[MAX_FORKNUM + 1];
+	segnum_vector requests_in_progress[MAX_FORKNUM + 1];
+	CycleCtr	cycle_ctr;
+} PendingFsyncEntry;
+
+typedef struct PendingUnlinkEntry
+{
+	RelFileNode rnode;			/* the dead relation to delete */
+	CycleCtr	cycle_ctr;		/* ckpt_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static bool sync_in_progress = false;
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr ckpt_cycle_ctr = 0;
+
+static HTAB *pendingFsyncTable = NULL;
+static List *pendingUnlinks = NIL;
+
+/*
+ * Initialize the pending operations state, if necessary.
+ */
+void
+smgrsync_init(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(RelFileNode);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingFsyncTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+}
+
+/*
+ * Do pre-checkpoint work.
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+smgrpreckpt(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	ckpt_cycle_ctr++;
+}
+
+/*
+ * Sync previous writes to stable storage.
+ */
+void
+smgrsync(void)
+{
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	int			processed = CheckpointStats.ckpt_sync_rels;
+	uint64		longest = CheckpointStats.ckpt_longest_sync;
+	uint64		total_elapsed = CheckpointStats.ckpt_agg_sync_time;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingFsyncTable.
+	 */
+	if (!pendingFsyncTable)
+		elog(ERROR, "cannot sync without a pendingFsyncTable");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbFsyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use sync_cycle_ctr to tell old entries apart from new
+	 * ones: new ones will have cycle_ctr equal to the incremented value of
+	 * sync_cycle_ctr.
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous smgrsync() failed to complete, run through the table and
+	 * forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			ForkNumber		forknum;
+
+			entry->cycle_ctr = sync_cycle_ctr;
+
+			/*
+			 * If any requests remain unprocessed, they need to be merged with
+			 * the segment numbers that have arrived since.
+			 */
+			for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+			{
+				segnum_vector *requests = &entry->requests[forknum];
+				segnum_vector *requests_in_progress =
+					&entry->requests_in_progress[forknum];
+
+				if (!segnum_vector_empty(requests_in_progress))
+				{
+					/* Append the unfinished requests that were not yet handled. */
+					segnum_vector_append_n(requests,
+										   segnum_vector_data(requests_in_progress),
+										   segnum_vector_size(requests_in_progress));
+					segnum_vector_reset(requests_in_progress);
+
+					/* Sort and make unique. */
+					segnum_array_sort(segnum_vector_begin(requests),
+									  segnum_vector_end(requests));
+					segnum_vector_resize(requests,
+									 segnum_array_unique(segnum_vector_begin(requests),
+														 segnum_vector_end(requests)) -
+										 segnum_vector_begin(requests));
+				}
+			}
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingFsyncTable);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)))
+	{
+		ForkNumber forknum;
+		SMgrRelation reln;
+
+		/*
+		 * If the entry is new then don't process it this time; it might
+		 * contain multiple fsync requests, but they are all new.  Note
+		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * Scan over the forks and segments represented by the entry.
+		 *
+		 * The vector manipulations are slightly tricky, because we can call
+		 * AbsorbFsyncRequests() inside the loop and that could result in new
+		 * segment numbers being added.  So we swap the contents of "requests"
+		 * with "requests_in_progress", and if we fail we'll merge it with any
+		 * new requests that have arrived in the meantime.
+		 */
+		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			segnum_vector *requests_in_progress =
+				&entry->requests_in_progress[forknum];
+
+			/*
+			 * Transfer the current set of segment numbers into the "in
+			 * progress" vector (which must be empty initially).
+			 */
+			Assert(segnum_vector_empty(requests_in_progress));
+			segnum_vector_swap(&entry->requests[forknum], requests_in_progress);
+
+			/*
+			 * If fsync is off then we don't have to bother opening the
+			 * files at all.  (We delay checking until this point so that
+			 * changing fsync on the fly behaves sensibly.)
+			 */
+			if (!enableFsync)
+				segnum_vector_clear(requests_in_progress);
+
+			/* Loop until all requests have been handled. */
+			while (!segnum_vector_empty(requests_in_progress))
+			{
+				SegmentNumber	segno = *segnum_vector_back(requests_in_progress);
+
+				INSTR_TIME_SET_CURRENT(sync_start);
+
+				reln = smgropen(entry->rnode, InvalidBackendId);
+				if (!smgrimmedsync(reln, forknum, segno))
+				{
+					/*
+					 * The underlying file couldn't be found.  Check if a
+					 * later message in the queue reports that it has been
+					 * unlinked; if so it will be removed from the vector,
+					 * indicating that we can safely skip it.
+					 */
+					AbsorbFsyncRequests();
+					if (!segnum_array_binary_search(segnum_vector_begin(requests_in_progress),
+													segnum_vector_end(requests_in_progress),
+													&segno))
+						continue;
+
+					/* Otherwise it's an unexpectedly missing file. */
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open backing file to fsync: %u/%u/%u",
+									entry->rnode.dbNode,
+									entry->rnode.relNode,
+									segno)));
+				}
+
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				/* Remove this segment number. */
+				Assert(segno == *segnum_vector_back(requests_in_progress));
+				segnum_vector_pop_back(requests_in_progress);
+
+				if (log_checkpoints)
+					ereport(DEBUG1,
+							(errmsg("checkpoint sync: number=%d db=%u rel=%u seg=%u time=%.3f msec",
+									processed,
+									entry->rnode.dbNode,
+									entry->rnode.relNode,
+									segno,
+									(double) elapsed / 1000),
+							 errhidestmt(true),
+							 errhidecontext(true)));
+
+				/*
+				 * If in checkpointer, we want to absorb pending requests
+				 * every so often to prevent overflow of the fsync request
+				 * queue.  It is unspecified whether newly-added entries will
+				 * be visited by hash_seq_search, but we don't care since we
+				 * don't need to process them anyway.
+				 */
+				if (--absorb_counter <= 0)
+				{
+					AbsorbFsyncRequests();
+					absorb_counter = FSYNCS_PER_ABSORB;
+				}
+			}
+		}
+
+		/*
+		 * We've finished everything that was requested before we started to
+		 * scan the entry.  If no new requests have been inserted meanwhile,
+		 * remove the entry.  Otherwise, update its cycle counter, as all the
+		 * requests now in it must have arrived during this cycle.
+		 */
+		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			Assert(segnum_vector_empty(&entry->requests_in_progress[forknum]));
+			if (!segnum_vector_empty(&entry->requests[forknum]))
+				break;
+			segnum_vector_reset(&entry->requests[forknum]);
+		}
+		if (forknum <= MAX_FORKNUM)
+			entry->cycle_ctr = sync_cycle_ctr;
+		else
+		{
+			/* Okay to remove it */
+			if (hash_search(pendingFsyncTable, &entry->rnode,
+							HASH_REMOVE, NULL) == NULL)
+				elog(ERROR, "pendingOpsTable corrupted");
+		}
+	}							/* end loop over hashtable entries */
+
+	/* Maintain sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of smgrsync */
+	sync_in_progress = false;
+}
+
+/*
+ * Do post-checkpoint work.
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+smgrpostckpt(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == ckpt_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = relpathperm(entry->rnode, MAIN_FORKNUM);
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in smgrsync, we don't want to stop absorbing fsync requests for a
+		 * long time when there are many deletions to be done.  We can safely
+		 * call AbsorbFsyncRequests() at this point in the loop (note it might
+		 * try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbFsyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+
+/*
+ * Mark a file as needing fsync.
+ *
+ * If there is a local pending-ops table, just make an entry in it for
+ * smgrsync to process later.  Otherwise, try to pass off the fsync request to
+ * the checkpointer process.
+ *
+ * Returns true on success, but false if the queue was full and we couldn't
+ * pass the request to the the checkpointer, meaning that the caller must
+ * perform the fsync.
+ */
+bool
+FsyncAtCheckpoint(RelFileNode rnode, ForkNumber forknum, SegmentNumber segno)
+{
+	if (pendingFsyncTable)
+	{
+		RememberFsyncRequest(FSYNC_SEGMENT_REQUEST, rnode, forknum, segno);
+		return true;
+	}
+	else
+		return ForwardFsyncRequest(FSYNC_SEGMENT_REQUEST, rnode, forknum,
+								   segno);
+}
+
+/*
+ * Schedule a file to be deleted after next checkpoint.
+ *
+ * As with FsyncAtCheckpoint, this could involve either a local or a remote
+ * pending-ops table.
+ */
+void
+UnlinkAfterCheckpoint(RelFileNodeBackend rnode)
+{
+	/* Should never be used with temp relations */
+	Assert(!RelFileNodeBackendIsTemp(rnode));
+
+	if (pendingFsyncTable)
+	{
+		/* push it into local pending-ops table */
+		RememberFsyncRequest(UNLINK_RELATION_REQUEST,
+							 rnode.node,
+							 MAIN_FORKNUM,
+							 InvalidSegmentNumber);
+	}
+	else
+	{
+		/* Notify the checkpointer about it. */
+		Assert(IsUnderPostmaster);
+
+		ForwardFsyncRequest(UNLINK_RELATION_REQUEST,
+							rnode.node,
+							MAIN_FORKNUM,
+							InvalidSegmentNumber);
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingFsyncTable during initialization of the startup
+ * process.  Calling this function drops the local pendingFsyncTable so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingFsyncTable)
+	{
+		smgrsync();
+		hash_destroy(pendingFsyncTable);
+	}
+	pendingFsyncTable = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
+
+/*
+ * Find and remove a segment number by binary search.
+ */
+static inline void
+delete_segno(segnum_vector *vec, SegmentNumber segno)
+{
+	SegmentNumber *position =
+		segnum_array_lower_bound(segnum_vector_begin(vec),
+								 segnum_vector_end(vec),
+								 &segno);
+
+	if (position != segnum_vector_end(vec) &&
+		*position == segno)
+		segnum_vector_erase(vec, position);
+}
+
+/*
+ * Add a segment number by binary search.  Hopefully these tend to be added a
+ * the high end, which is cheap.
+ */
+static inline void
+insert_segno(segnum_vector *vec, SegmentNumber segno)
+{
+	segnum_vector_insert(vec,
+						 segnum_array_lower_bound(segnum_vector_begin(vec),
+												  segnum_vector_end(vec),
+												  &segno),
+						 &segno);
+}
+
+/*
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * Valid valid values for 'type':
+ * - FSYNC_SEGMENT_REQUEST means to schedule an fsync
+ * - FORGET_SEGMENT_FSYNC means to cancel pending fsyncs for one segment
+ * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
+ *	 either for one fork, or all forks if forknum is InvalidForkNumber
+ * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
+ * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
+ *	 checkpoint.
+ * Note also that we're assuming real segment numbers don't exceed INT_MAX.
+ *
+ * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
+ * table has to be searched linearly, but dropping a database is a pretty
+ * heavyweight operation anyhow, so we'll live with it.)
+ */
+void
+RememberFsyncRequest(int type, RelFileNode rnode, ForkNumber forknum,
+					 SegmentNumber segno)
+{
+	Assert(pendingFsyncTable);
+
+	if (type == FORGET_SEGMENT_FSYNC || type == FORGET_RELATION_FSYNC)
+	{
+		PendingFsyncEntry *entry;
+
+		entry = hash_search(pendingFsyncTable, &rnode, HASH_FIND, NULL);
+		if (entry)
+		{
+			if (type == FORGET_SEGMENT_FSYNC)
+			{
+				delete_segno(&entry->requests[forknum], segno);
+				delete_segno(&entry->requests_in_progress[forknum], segno);
+			}
+			else if (forknum == InvalidForkNumber)
+			{
+				/* Remove requests for all forks. */
+				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+				{
+					segnum_vector_reset(&entry->requests[forknum]);
+					segnum_vector_reset(&entry->requests_in_progress[forknum]);
+				}
+			}
+			else
+			{
+				/* Forget about all segments for one fork. */
+				segnum_vector_reset(&entry->requests[forknum]);
+				segnum_vector_reset(&entry->requests_in_progress[forknum]);
+			}
+		}
+	}
+	else if (type == FORGET_DATABASE_FSYNC)
+	{
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (rnode.dbNode == entry->rnode.dbNode)
+			{
+				/* Remove requests for all forks. */
+				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+				{
+					segnum_vector_reset(&entry->requests[forknum]);
+					segnum_vector_reset(&entry->requests_in_progress[forknum]);
+				}
+			}
+		}
+
+		/* Remove unlink requests */
+		if (segno == FORGET_DATABASE_FSYNC)
+		{
+			ListCell   *cell,
+					   *next,
+					   *prev;
+
+			prev = NULL;
+			for (cell = list_head(pendingUnlinks); cell; cell = next)
+			{
+				PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+				next = lnext(cell);
+				if (rnode.dbNode == entry->rnode.dbNode)
+				{
+					pendingUnlinks = list_delete_cell(pendingUnlinks, cell,
+													  prev);
+					pfree(entry);
+				}
+				else
+					prev = cell;
+			}
+		}
+	}
+	else if (type == UNLINK_RELATION_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
+		Assert(forknum == MAIN_FORKNUM);
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->rnode = rnode;
+		entry->cycle_ctr = ckpt_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else if (type == FSYNC_SEGMENT_REQUEST)
+	{
+		/* Normal case: enter a request to fsync this segment */
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		entry = (PendingFsyncEntry *) hash_search(pendingFsyncTable,
+												  &rnode,
+												  HASH_ENTER,
+												  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			ForkNumber	f;
+
+			entry->cycle_ctr = ckpt_cycle_ctr;
+			for (f = 0; f <= MAX_FORKNUM; f++)
+			{
+				segnum_vector_init(&entry->requests[f]);
+				segnum_vector_init(&entry->requests_in_progress[f]);
+			}
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		insert_segno(&entry->requests[forknum], segno);
+	}
+}
+
+/*
+ * ForgetSegmentFsyncRequests -- forget any fsyncs for one segment of a
+ * relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetSegmentFsyncRequests(RelFileNode rnode, ForkNumber forknum,
+						   SegmentNumber segno)
+{
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_SEGMENT_FSYNC, rnode, forknum, segno);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		while (!ForwardFsyncRequest(FORGET_SEGMENT_FSYNC, rnode, forknum,
+									segno))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+{
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_RELATION_FSYNC, rnode, forknum,
+							 InvalidSegmentNumber);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		while (!ForwardFsyncRequest(FORGET_RELATION_FSYNC, rnode, forknum,
+									InvalidSegmentNumber))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
+ */
+void
+ForgetDatabaseFsyncRequests(Oid dbid)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = dbid;
+	rnode.spcNode = 0;
+	rnode.relNode = 0;
+
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_DATABASE_FSYNC, rnode, 0,
+							 InvalidSegmentNumber);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* see notes in ForgetRelationFsyncRequests */
+		while (!ForwardFsyncRequest(FORGET_DATABASE_FSYNC, rnode, 0,
+									InvalidSegmentNumber))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+	}
+}
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 27ae6be7517..7372377e19c 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -59,7 +59,7 @@
 #include "commands/view.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rewriteRemove.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f81e0424e72..a12fd0f6ed2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -60,6 +60,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 53b8f5fe3cb..585ce52667c 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -1,10 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * bgwriter.h
- *	  Exports from postmaster/bgwriter.c and postmaster/checkpointer.c.
- *
- * The bgwriter process used to handle checkpointing duties too.  Now
- * there is a separate process, but we did not bother to split this header.
+ *	  Exports from postmaster/bgwriter.c.
  *
  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
  *
@@ -15,29 +12,10 @@
 #ifndef _BGWRITER_H
 #define _BGWRITER_H
 
-#include "storage/block.h"
-#include "storage/relfilenode.h"
-
-
 /* GUC options */
 extern int	BgWriterDelay;
-extern int	CheckPointTimeout;
-extern int	CheckPointWarning;
-extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void) pg_attribute_noreturn();
-extern void CheckpointerMain(void) pg_attribute_noreturn();
-
-extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
-
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
-
-extern Size CheckpointerShmemSize(void);
-extern void CheckpointerShmemInit(void);
 
-extern bool FirstCallSinceLastCheckpoint(void);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/postmaster/checkpointer.h b/src/include/postmaster/checkpointer.h
new file mode 100644
index 00000000000..28b13c2d9c0
--- /dev/null
+++ b/src/include/postmaster/checkpointer.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * checkpointer.h
+ *	  Exports from postmaster/checkpointer.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/checkpointer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHECKPOINTER_H
+#define CHECKPOINTER_H
+
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+
+/* GUC options */
+extern int	CheckPointTimeout;
+extern int	CheckPointWarning;
+extern double CheckPointCompletionTarget;
+
+extern void CheckpointerMain(void) pg_attribute_noreturn();
+extern bool ForwardFsyncRequest(int type, RelFileNode rnode,
+								ForkNumber forknum, BlockNumber segno);
+extern void RequestCheckpoint(int flags);
+extern void CheckpointWriteDelay(int flags, double progress);
+
+extern void AbsorbFsyncRequests(void);
+extern void AbsorbAllFsyncRequests(void);
+
+extern Size CheckpointerShmemSize(void);
+extern void CheckpointerShmemInit(void);
+
+extern bool FirstCallSinceLastCheckpoint(void);
+extern void CountBackendWrite(void);
+
+#endif
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 820d08ed4ed..5a52229dcd5 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,15 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/*
+ * The type used to identify segment numbers.  Generally, segments are an
+ * internal detail of individual storage manager implementations, but since
+ * they appear in various places to allow them to be passed between processes,
+ * it seemed worthwhile to have a typename.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -105,10 +114,9 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
-extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
+extern bool smgrimmedsync(SMgrRelation reln, ForkNumber forknum,
+						  SegmentNumber segno);
+
 extern void AtEOXact_SMgr(void);
 
 
@@ -133,16 +141,9 @@ extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
+extern bool mdimmedsync(SMgrRelation reln, ForkNumber forknum,
+						SegmentNumber segno);
+
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 #endif							/* SMGR_H */
diff --git a/src/include/storage/smgrsync.h b/src/include/storage/smgrsync.h
new file mode 100644
index 00000000000..01a174e7291
--- /dev/null
+++ b/src/include/storage/smgrsync.h
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.h
+ *	  management of file synchronization
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/smgrpending.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SMGRSYNC_H
+#define SMGRSYNC_H
+
+#include "storage/smgr.h"
+
+extern void smgrsync_init(void);
+extern void smgrpreckpt(void);
+extern void smgrsync(void);
+extern void smgrpostckpt(void);
+
+extern void UnlinkAfterCheckpoint(RelFileNodeBackend rnode);
+extern bool FsyncAtCheckpoint(RelFileNode rnode, ForkNumber forknum,
+							  SegmentNumber segno);
+extern void RememberFsyncRequest(int type, RelFileNode rnode,
+								 ForkNumber forknum, SegmentNumber segno);
+extern void SetForwardFsyncRequests(void);
+extern void ForgetSegmentFsyncRequests(RelFileNode rnode, ForkNumber forknum,
+									   SegmentNumber segno);
+extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
+extern void ForgetDatabaseFsyncRequests(Oid dbid);
+
+
+#endif
-- 
2.19.1

#21

Kevin Grittner

kgrittn@gmail.com

almost 7 years ago

In reply to: Thomas Munro (#20)

Re: Refactoring the checkpointer's fsync request queue

With the help of VMware's Dirk Hohndel (VMware's Chief Open Source
Officer, a VP position near the top of the organization, and a
personal friend of Linus), I have been fortunate enough to make
contact directly with Linus Torvalds to discuss this issue. In emails
to me he has told me that this patch is no longer provisional:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3

Linus has given me permission to quote him, so here is a quote from an
email he sent 2019-01-17:

That commit (b4678df184b3: "errseq: Always report a writeback error
once") was already backported to the stable trees (4.14 and 4.16), so
yes, everything should be fine. We did indeed miss old errors for a
while.

The latest information I could find on this said this commit was "provisional" but
also that it might be back-patched to 4.13 and on. Can you clarify the status of
this patch in either respect?

It was definitely backported to both 4.14 and 4.16, I see it in my
email archives.

The bug may remain in 4.13, but that isn't actually maintained any
more, and I don't think any distro uses it (distros tend to use the
long-term stable kernels that are maintained, or sometimes maintain
their own patch queue).

I think that eliminates the need for this patch.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

#22

Kevin Grittner

kgrittn@gmail.com

almost 7 years ago

In reply to: Kevin Grittner (#21)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Jan 22, 2019 at 8:27 AM Kevin Grittner <kgrittn@gmail.com> wrote:

With the help of VMware's Dirk Hohndel (VMware's Chief Open Source
Officer, a VP position near the top of the organization, and a
personal friend of Linus), I have been fortunate enough to make
contact directly with Linus Torvalds to discuss this issue. In emails
to me he has told me that this patch is no longer provisional:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3

Linus has given me permission to quote him, so here is a quote from an
email he sent 2019-01-17:

That commit (b4678df184b3: "errseq: Always report a writeback error
once") was already backported to the stable trees (4.14 and 4.16), so
yes, everything should be fine. We did indeed miss old errors for a
while.

Sorry, but somehow I got the parent link rather that the intended
commit. Linus got it right, of course.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b4678df184b314a2bd47d2329feca2c2534aa12b

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

#23

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Kevin Grittner (#21)

Re: Refactoring the checkpointer's fsync request queue

Hi,,

On 2019-01-22 08:27:48 -0600, Kevin Grittner wrote:

With the help of VMware's Dirk Hohndel (VMware's Chief Open Source
Officer, a VP position near the top of the organization, and a
personal friend of Linus), I have been fortunate enough to make
contact directly with Linus Torvalds to discuss this issue. In emails
to me he has told me that this patch is no longer provisional:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=fff75eb2a08c2ac96404a2d79685668f3cf5a7a3

Unfortunately, unless something has changed recently, that patch is
*not* sufficient to really solve the issue - we don't guarantee that
there's always an fd preventing the necessary information from being
evicted from memory:

Note that we might still lose the error if the inode gets evicted from
the cache before anything can reopen it, but that was the case before
errseq_t was merged. At LSF/MM we had some discussion about keeping
inodes with unreported writeback errors around in the cache for longer
(possibly indefinitely), but that's really a separate problem"

And that's entirely possibly in postgres. The commit was dicussed on
list too, btw...

Greetings,

Andres Freund

#24

Kevin Grittner

kgrittn@gmail.com

almost 7 years ago

In reply to: Andres Freund (#23)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Jan 22, 2019 at 12:17 PM Andres Freund <andres@anarazel.de> wrote:

Unfortunately, unless something has changed recently, that patch is
*not* sufficient to really solve the issue - we don't guarantee that
there's always an fd preventing the necessary information from being
evicted from memory:

But we can't lose an FD without either closing it or suffering an
abrupt termination that would trigger a PANIC, can we? And close()
always calls fsync(). And I thought our "PANIC on fsync" patch paid
attention to close(). How do you see this happening???

Note that we might still lose the error if the inode gets evicted from
the cache before anything can reopen it, but that was the case before
errseq_t was merged. At LSF/MM we had some discussion about keeping
inodes with unreported writeback errors around in the cache for longer
(possibly indefinitely), but that's really a separate problem"

And that's entirely possibly in postgres.

Is it possible for an inode to be evicted while there is an open FD
referencing it?

The commit was dicussed on list too, btw...

Can you point to a post explaining how the inode can be evicted?
--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

#25

Thomas Munro

thomas.munro@enterprisedb.com

almost 7 years ago

In reply to: Kevin Grittner (#24)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Jan 23, 2019 at 9:29 AM Kevin Grittner <kgrittn@gmail.com> wrote:

Can you point to a post explaining how the inode can be evicted?

Hi Kevin,

To recap the (admittedly confusing) list of problems with Linux fsync
or our usage:

1. On Linux < 4.13, the error state can be cleared in various
surprising ways so that we never hear about it. Jeff Layton
identified and fixed this problem for 4.13+ by switching from an error
flag to an error counter that is tracked in such a way that every fd
hears about every error in the file.

2. Layton's changes originally assumed that you only wanted to hear
about errors that happened after you opened the file (ie it set the
fd's counter to the inode's current level at open time). Craig Ringer
complained about this. Everyone complained about this. A fix was
then made so that one fd also reports errors that happened before
opening, if no one else has seen them yet. This is the change that
was back-patched as far as Linux 4.14. So long as no third program
comes along and calls fsync on a file that we don't have open
anywhere, thereby eating the "not seen" flag before the checkpointer
gets around to opening the file, all is well.

3. Regardless of the above changes, we also learned that pages are
unceremoniously dropped from the page cache after write-back errors,
so that calling fsync() again after a failure is a bad idea (it might
report success, but your dirty data written before the previous
fsync() call is gone). We handled that by introducing a PANIC after
any fsync failure:

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=9ccdd7f66e3324d2b6d3dec282cfa9ff084083f1

So did MySQL, MongoDB, and probably everyone else who spat out their
cornflakes while reading articles like "PostgreSQL's fsync() surprise"
in the Linux Weekly News that resulted from Craig's report:

https://github.com/mysql/mysql-server/commit/8590c8e12a3374eeccb547359750a9d2a128fa6a

https://github.com/wiredtiger/wiredtiger/commit/ae8bccce3d8a8248afa0e4e0cf67674a43dede96

4. Regardless of all of the above changes, there is still one way to
lose track of an error, as Andres mentioned: during a period of time
when neither the writing backend nor the checkpointer has the file
open, the kernel may choose to evict the inode from kernel memory, and
thereby forget about an error that we haven't received yet.

Problems 1-3 are solved by changes to Linux and PostgreSQL.

Problem 4 would be solved by this "fd-passing" scheme (file
descriptors are never closed until after fsync has been called,
existing in the purgatory of Unix socket ancillary data until the
checkpointer eventually deals with them), but it's complicated and not
quite fully baked yet.

It could also be solved by the kernel agreeing not to evict inodes
that hold error state, or to promote the error to device level, or
something like that. IIUC those kinds of ideas were rejected so far.

(It can also be solved by using FreeBSD and/or ZFS, so you don't have
problem 3 and therefore don't have the other problems.)

I'm not sure how likely that failure mode actually is, but I guess you
need a large number of active files, a low PostgreSQL max_safe_fds so
we close descriptors aggressively, a kernel that is low on memory or
has a high vfs_cache_pressure setting so that it throws out recently
used inodes aggressively, enough time between checkpoints for all of
the above to happen, and then some IO errors when the kernel is
writing back dirty data asynchronously while you don't have the file
open anywhere.

--
Thomas Munro
http://www.enterprisedb.com

#26

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Kevin Grittner (#24)

Re: Refactoring the checkpointer's fsync request queue

Hi,

On 2019-01-22 14:29:23 -0600, Kevin Grittner wrote:

On Tue, Jan 22, 2019 at 12:17 PM Andres Freund <andres@anarazel.de> wrote:

Unfortunately, unless something has changed recently, that patch is
*not* sufficient to really solve the issue - we don't guarantee that
there's always an fd preventing the necessary information from being
evicted from memory:

But we can't lose an FD without either closing it or suffering an
abrupt termination that would trigger a PANIC, can we? And close()
always calls fsync(). And I thought our "PANIC on fsync" patch paid
attention to close(). How do you see this happening???

close() doesn't trigger an fsync() in general (although it does on many
NFS implementations), and doing so would be *terrible* for
performance. Given that it's pretty clear how you can get all FDs
closed, right? You just need sufficient open files that files get closed
due to max_files_per_process, and you can run into the issue. A
thousand open files is pretty easy to reach with forks, indexes,
partitions etc, so this isn't particularly artifical.

Note that we might still lose the error if the inode gets evicted from
the cache before anything can reopen it, but that was the case before
errseq_t was merged. At LSF/MM we had some discussion about keeping
inodes with unreported writeback errors around in the cache for longer
(possibly indefinitely), but that's really a separate problem"

And that's entirely possibly in postgres.

Is it possible for an inode to be evicted while there is an open FD
referencing it?

No, but we don't guarantee that there's always an FD open, due to the

The commit was dicussed on list too, btw...

Can you point to a post explaining how the inode can be evicted?

/messages/by-id/20180427222842.in2e4mibx45zdth5@alap3.anarazel.de
is, I think, a good overview, with a bunch of links. As is the
referenced lwn article [1]https://lwn.net/Articles/752063/ and the commit message you linked.

Greetings,

Andres Freund

[1]: https://lwn.net/Articles/752063/

#27

Kevin Grittner

kgrittn@gmail.com

almost 7 years ago

In reply to: Andres Freund (#26)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Jan 22, 2019 at 2:38 PM Andres Freund <andres@anarazel.de> wrote:

close() doesn't trigger an fsync() in general

What sort of a performance hit was observed when testing the addition
of fsync or fdatasync before any PostgreSQL close() of a writable
file, or have we not yet checked that?

/messages/by-id/20180427222842.in2e4mibx45zdth5@alap3.anarazel.de
is, I think, a good overview, with a bunch of links.

Thanks! Will review.

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/

#28

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Kevin Grittner (#27)

Re: Refactoring the checkpointer's fsync request queue

Hi,

On 2019-01-22 14:53:11 -0600, Kevin Grittner wrote:

On Tue, Jan 22, 2019 at 2:38 PM Andres Freund <andres@anarazel.de> wrote:

close() doesn't trigger an fsync() in general

What sort of a performance hit was observed when testing the addition
of fsync or fdatasync before any PostgreSQL close() of a writable
file, or have we not yet checked that?

I briefly played with it, and it was so atrocious (as in, less than
something like 0.2x the throughput) that I didn't continue far down that
path. Two ways I IIRC (and it's really just memory) I tried were:

a) Short lived connections that do a bunch of writes to files each. That
turns each disconnect into an fsync of most files.
b) Workload with > max_files_per_process files (IIRC I just used a bunch
of larger tables with a few indexes each) in a read/write workload
that's a bit larger than shared buffers. That lead to most file
closes being integrity writes, with obvious downsides.

Greetings,

Andres Freund

#29

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Thomas Munro (#20)

Re: Refactoring the checkpointer's fsync request queue

I (finally) got a chance to go through these patches and they look
great. Thank you for working on this! Few comments:

- I do not see SmgrFileTag being defined or used like you mentioned in
your first email. I see RelFileNode still being used. Is this planned
for the future?
- Would be great to add a set of Tests for SimpleVector.

For the 0001 patch, I'll probably want to reconsider the naming a it
("simple -> "specialized", "generic", ...?)

I think the name SimpleVector is fine, fits with the SimpleHash theme.
If the goal is to shorten it, perhaps PG prefix would suffice?

4. The protocol for forgetting relations etc is slightly different:
if a file is found to be missing, AbsortFsyncRequests() and then probe
to see if the segment number disappeared from the set (instead of
cancel flags), though I need to test this case.

Can you explain this part a bit more? I am likely missing something in
the patch.

I couldn't resist the urge to try porting pg_qsort() to this style.
It seems to be about twice as fast as the original at sorting integers
on my machine with -O2. I suppose people aren't going to be too
enthusiastic about yet another copy of qsort in the tree, but maybe
this approach (with a bit more work) could replace the Perl code-gen
for tuple sorting. Then the net number of copies wouldn't go up, but
this could be used for more things too, and it fits with the style of
simplehash.h and simplevector.h. Thoughts?

+1 for avoiding duplicate code. Would it be acceptable to migrate the
rest of the usages to this model over time perhaps? Love to move this
patch forward.

I wonder if it might be better to introduce two different functions
catering to the two different use cases for forcing an immediate sync:

- sync a relation
smgrimmedsyncrel(SMgrRelation, ForkNumber)
- sync a specific segment
smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber)

This will avoid having to specify InvalidSegmentNumber for majority of
the callers today.

--
Shawn Debnath
Amazon Web Services (AWS)

#30

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Shawn Debnath (#29)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote:

I (finally) got a chance to go through these patches and they look
great. Thank you for working on this!

This review is very recent, so I have moved the patch to next CF.
--
Michael

#31

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Shawn Debnath (#29)

2 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote:

I wonder if it might be better to introduce two different functions
catering to the two different use cases for forcing an immediate sync:

- sync a relation
smgrimmedsyncrel(SMgrRelation, ForkNumber)
- sync a specific segment
smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber)

This will avoid having to specify InvalidSegmentNumber for majority of
the callers today.

I have gone ahead and rebased the refactor patch so it could cleanly
apply on heapam.c, see patch v7.

I am also attaching a patch (v8) that implements smgrimmedsyncrel() and
smgrimmedsyncseg() as I mentioned in the previous email. It avoids
callers to pass in InvalidSegmentNumber when they just want to sync the
whole relation. As a side effect, I was able to get rid of some extra
checkpointer.h includes.

--
Shawn Debnath
Amazon Web Services (AWS)

Attachments:

0002-Refactor-the-fsync-machinery-to-support-future-SM-v7.patchtext/plain; charset=us-asciiDownload

From 54112d952592b9a1d3858384b5c79b6334b20ac6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Mon, 31 Dec 2018 15:25:16 +1300
Subject: [PATCH 2/2] Refactor the fsync machinery to support future SMGR
 implementations.

In anticipation of proposed block storage managers alongside md.c that
map bufmgr.c blocks to files optimised for different usage patterns:

1.  Move the system for requesting fsyncs out of md.c into a new
translation unit smgrsync.c.

2.  Have smgrsync.c perform the actual fsync() calls via the existing
polymorphic smgrimmedsync() interface, extended to allow an individual
segment number to be specified.

3.  Teach the checkpointer how to forget individual segments that are
unlinked from the 'front' after having been dropped from shared
buffers.

4.  Move the request tracking from a bitmapset into a sorted vector,
because the proposed block storage managers are not anchored at zero
and use potentially very large and sparse integers.

Author: Thomas Munro
Reviewed-by:
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 contrib/bloom/blinsert.c              |   2 +-
 src/backend/access/heap/heapam.c      |   4 +-
 src/backend/access/nbtree/nbtree.c    |   2 +-
 src/backend/access/nbtree/nbtsort.c   |   2 +-
 src/backend/access/spgist/spginsert.c |   2 +-
 src/backend/access/transam/xlog.c     |   2 +
 src/backend/bootstrap/bootstrap.c     |   1 +
 src/backend/catalog/heap.c            |   2 +-
 src/backend/commands/dbcommands.c     |   3 +-
 src/backend/commands/tablecmds.c      |   2 +-
 src/backend/commands/tablespace.c     |   2 +-
 src/backend/postmaster/bgwriter.c     |   1 +
 src/backend/postmaster/checkpointer.c |  22 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +
 src/backend/storage/ipc/ipci.c        |   1 +
 src/backend/storage/smgr/Makefile     |   2 +-
 src/backend/storage/smgr/md.c         | 801 ++-----------------------------
 src/backend/storage/smgr/smgr.c       | 104 ++---
 src/backend/storage/smgr/smgrsync.c   | 855 ++++++++++++++++++++++++++++++++++
 src/backend/tcop/utility.c            |   2 +-
 src/backend/utils/misc/guc.c          |   1 +
 src/include/postmaster/bgwriter.h     |  24 +-
 src/include/postmaster/checkpointer.h |  39 ++
 src/include/storage/smgr.h            |  29 +-
 src/include/storage/smgrsync.h        |  36 ++
 25 files changed, 1055 insertions(+), 888 deletions(-)
 create mode 100644 src/backend/storage/smgr/smgrsync.c
 create mode 100644 src/include/postmaster/checkpointer.h
 create mode 100644 src/include/storage/smgrsync.h

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index e43fbe0005..6fa07db4f8 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -188,7 +188,7 @@ blbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dc3499349b..4cf2661387 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8980,7 +8980,7 @@ heap_sync(Relation rel)
 	/* main heap */
 	FlushRelationBuffers(rel);
 	/* FlushRelationBuffers will have opened rd_smgr */
-	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
+	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM, InvalidSegmentNumber);
 
 	/* FSM is not critical, don't bother syncing it */
 
@@ -8991,7 +8991,7 @@ heap_sync(Relation rel)
 
 		toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
 		FlushRelationBuffers(toastrel);
-		smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM, InvalidSegmentNumber);
 		table_close(toastrel, AccessShareLock);
 	}
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..b29112c133 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -178,7 +178,7 @@ btbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..a64eaa06e4 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1208,7 +1208,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	if (RelationNeedsWAL(wstate->index))
 	{
 		RelationOpenSmgr(wstate->index);
-		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM, InvalidBlockNumber);
 	}
 }
 
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index f428a15138..0eb5ced43d 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -205,7 +205,7 @@ spgbuildempty(Relation index)
 	 * writes did not go through shared buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53a..87d1172373 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/walwriter.h"
 #include "postmaster/startup.h"
 #include "replication/basebackup.h"
@@ -64,6 +65,7 @@
 #include "storage/procarray.h"
 #include "storage/reinit.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
diff --git a/src/backend/bootstrap/bootstrap.c b/src/backend/bootstrap/bootstrap.c
index 4d7ed8ad1a..083314c18a 100644
--- a/src/backend/bootstrap/bootstrap.c
+++ b/src/backend/bootstrap/bootstrap.c
@@ -33,6 +33,7 @@
 #include "pg_getopt.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
 #include "replication/walreceiver.h"
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 06d18a1cfb..9c213efbc3 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1421,7 +1421,7 @@ heap_create_init_fork(Relation rel)
 	RelationOpenSmgr(rel);
 	smgrcreate(rel->rd_smgr, INIT_FORKNUM, false);
 	log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM);
-	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index d207cd899f..2f2993ab4d 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -47,7 +47,7 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "replication/slot.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
@@ -55,6 +55,7 @@
 #include "storage/ipc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 715c6a221c..125b16c339 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11788,7 +11788,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	 * here, they might still not be on disk when the crash occurs.
 	 */
 	if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+		smgrimmedsync(dst, forkNum, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 4afd178e97..ac239cfa8f 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -70,7 +70,7 @@
 #include "commands/tablespace.h"
 #include "common/file_perm.h"
 #include "miscadmin.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/standby.h"
diff --git a/src/backend/postmaster/bgwriter.c b/src/backend/postmaster/bgwriter.c
index e6b6c549de..fd5803f195 100644
--- a/src/backend/postmaster/bgwriter.c
+++ b/src/backend/postmaster/bgwriter.c
@@ -44,6 +44,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/bufmgr.h"
 #include "storage/buf_internals.h"
 #include "storage/condition_variable.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359..a43dc03be3 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -47,6 +47,8 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "postmaster/postmaster.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -56,6 +58,7 @@
 #include "storage/proc.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -108,10 +111,10 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
+	int			type;
+	RelFileNode	rnode;
 	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
+	SegmentNumber segno;
 } CheckpointerRequest;
 
 typedef struct
@@ -1077,9 +1080,7 @@ RequestCheckpoint(int flags)
  * RelFileNodeBackend.
  *
  * segno specifies which segment (not block!) of the relation needs to be
- * fsync'd.  (Since the valid range is much less than BlockNumber, we can
- * use high values for special flags; that's all internal to md.c, which
- * see for details.)
+ * fsync'd.
  *
  * To avoid holding the lock for longer than necessary, we normally write
  * to the requests[] queue without checking for duplicates.  The checkpointer
@@ -1092,13 +1093,14 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardFsyncRequest(int type, RelFileNode rnode, ForkNumber forknum,
+					SegmentNumber segno)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
 
 	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
+		elog(ERROR, "ForwardFsyncRequest must not be called in single user mode");
 
 	if (AmCheckpointerProcess())
 		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
@@ -1130,6 +1132,7 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
+	request->type = type;
 	request->rnode = rnode;
 	request->forknum = forknum;
 	request->segno = segno;
@@ -1314,7 +1317,8 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberFsyncRequest(request->type, request->rnode, request->forknum,
+							 request->segno);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..97bdfcb7b3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -42,11 +42,13 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/buf_internals.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/standby.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 5965d3620f..07b6c2f5f3 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -27,6 +27,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "replication/logicallauncher.h"
 #include "replication/slot.h"
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0df1..c9c4be325e 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrsync.o smgrtype.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2aba2dfe91..99470c0ebf 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -30,37 +30,24 @@
 #include "access/xlog.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
 
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
 
 /*
  * On Windows, we have to interpret EACCES as possibly meaning the same as
  * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
  * that's what you get.  Ugh.  This code is designed so that we don't
  * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
+ * a pending fsync request getting canceled ... see smgrsync).
  */
 #ifndef WIN32
 #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
@@ -134,30 +121,9 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * (Regular backends do not track pending operations locally, but forward
  * them to the checkpointer.)
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
 
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
 
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
 
 
 /*** behavior for mdopen & _mdfd_getseg ***/
@@ -184,8 +150,7 @@ static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
 			 bool isRedo);
 static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
 static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
+							   MdfdVec *seg);
 static void _fdvec_resize(SMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg);
@@ -208,64 +173,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -382,7 +289,7 @@ mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
 	/*
 	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
+	 * relation, else the next smgrsync() will fail.  There can't be any such
 	 * requests for a temp relation, though.  We can send just one request
 	 * even when deleting multiple forks, since the fsync queuing code accepts
 	 * the "InvalidForkNumber = all forks" convention.
@@ -442,7 +349,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		UnlinkAfterCheckpoint(rnode);
 	}
 
 	/*
@@ -976,423 +883,55 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *
  * Note that only writes already issued are synced; this routine knows
  * nothing of dirty buffers that may exist inside the buffer manager.
+ *
+ * See smgrimmedsync comment for contract.
  */
-void
-mdimmedsync(SMgrRelation reln, ForkNumber forknum)
+bool
+mdimmedsync(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
-	int			segno;
+	MdfdVec	   *segments;
+	size_t		num_segments;
+	size_t		i;
 
-	/*
-	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
-	 * fsync loop will get them all!
-	 */
-	mdnblocks(reln, forknum);
-
-	segno = reln->md_num_open_segs[forknum];
+	if (segno != InvalidSegmentNumber)
+	{
+		/*
+		 * Get the specified segment, or report failure if it doesn't seem to
+		 * exist.
+		 */
+		segments = _mdfd_openseg(reln, forknum, segno * RELSEG_SIZE,
+								 EXTENSION_RETURN_NULL);
+		if (segments == NULL)
+			return false;
+		num_segments = 1;
+	}
+	else
+	{
+		/*
+		 * NOTE: mdnblocks makes sure we have opened all active segments, so that
+		 * fsync loop will get them all!
+		 */
+		mdnblocks(reln, forknum);
+		num_segments = reln->md_num_open_segs[forknum];
+		segments = &reln->md_seg_fds[forknum][0];
+	}
 
-	while (segno > 0)
+	for (i = 0; i < num_segments; ++i)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec    *v = &segments[i];
 
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(v->mdfd_vfd))));
-		segno--;
-	}
-}
-
-/*
- *	mdsync() -- Sync previous writes to stable storage.
- */
-void
-mdsync(void)
-{
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
 	}
 
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
-
-/*
- * mdpostckpt() -- Do post-checkpoint work
- *
- * Remove any lingering files that can now be safely removed.
- */
-void
-mdpostckpt(void)
-{
-	int			absorb_counter;
-
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
-
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
+	return true;
 }
 
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
- *
- * If there is a local pending-ops table, just make an entry in it for
- * mdsync to process later.  Otherwise, try to pass off the fsync request
- * to the checkpointer process.  If that fails, just do the fsync
- * locally before returning (we hope this will not happen often enough
- * to be a performance problem).
  */
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
@@ -1400,16 +939,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
+	if (!FsyncAtCheckpoint(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
 	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1421,258 +952,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
-/*
- * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
- */
-static void
-register_unlink(RelFileNodeBackend rnode)
-{
-	/* Should never be used with temp relations */
-	Assert(!RelFileNodeBackendIsTemp(rnode));
-
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
-/*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
- */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
-{
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
-}
-
-/*
- * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
- */
-void
-ForgetDatabaseFsyncRequests(Oid dbid)
-{
-	RelFileNode rnode;
-
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
-
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
 /*
  * DropRelationFiles -- drop files of all given relations
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0c0bba4ab3..c6d3da1c1a 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
@@ -58,10 +59,8 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
+	bool		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum,
+								   SegmentNumber segno);
 } f_smgr;
 
 
@@ -81,10 +80,7 @@ static const f_smgr smgrsw[] = {
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
-		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
+		.smgr_immedsync = mdimmedsync
 	}
 };
 
@@ -104,6 +100,14 @@ static void smgrshutdown(int code, Datum arg);
 static void add_to_unowned_list(SMgrRelation reln);
 static void remove_from_unowned_list(SMgrRelation reln);
 
+/*
+ * For now there is only one implementation.
+ */
+static inline int
+which_for_relfilenode(RelFileNode rnode)
+{
+	return 0;	/* we only have md.c at present */
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -118,6 +122,8 @@ smgrinit(void)
 {
 	int			i;
 
+	smgrsync_init();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
@@ -185,7 +191,7 @@ smgropen(RelFileNode rnode, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		reln->smgr_fsm_nblocks = InvalidBlockNumber;
 		reln->smgr_vm_nblocks = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+		reln->smgr_which = which_for_relfilenode(rnode);
 
 		/* mark it not open */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
@@ -726,17 +732,20 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *	smgrimmedsync() -- Force the specified relation to stable storage.
  *
  *		Synchronously force all previous writes to the specified relation
- *		down to disk.
- *
- *		This is useful for building completely new relations (eg, new
- *		indexes).  Instead of incrementally WAL-logging the index build
- *		steps, we can just write completed index pages to disk with smgrwrite
- *		or smgrextend, and then fsync the completed index file before
- *		committing the transaction.  (This is sufficient for purposes of
- *		crash recovery, since it effectively duplicates forcing a checkpoint
- *		for the completed index.  But it is *not* sufficient if one wishes
- *		to use the WAL log for PITR or replication purposes: in that case
- *		we have to make WAL entries as well.)
+ *		down to disk.  If segnum is >= 0, only applies to data in
+ *		one segment file.
+ *
+ *		Used for checkpointing dirty files.
+ *
+ *		This can also be used for building completely new relations (eg, new
+ *		indexes).  Instead of incrementally WAL-logging the index build steps,
+ *		we can just write completed index pages to disk with smgrwrite or
+ *		smgrextend, and then fsync the completed index file before committing
+ *		the transaction.  (This is sufficient for purposes of crash recovery,
+ *		since it effectively duplicates forcing a checkpoint for the completed
+ *		index.  But it is *not* sufficient if one wishes to use the WAL log
+ *		for PITR or replication purposes: in that case we have to make WAL
+ *		entries as well.)
  *
  *		The preceding writes should specify skipFsync = true to avoid
  *		duplicative fsyncs.
@@ -744,57 +753,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *		Note that you need to do FlushRelationBuffers() first if there is
  *		any possibility that there are dirty buffers for the relation;
  *		otherwise the sync is not very meaningful.
+ *
+ *		Fail to fsync raises an error, but non-existence of a requested
+ *		segment is reported with a false return value.
  */
-void
-smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
-{
-	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
-}
-
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
+bool
+smgrimmedsync(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
+	return smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum, segno);
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgrsync.c b/src/backend/storage/smgr/smgrsync.c
new file mode 100644
index 0000000000..d343e59931
--- /dev/null
+++ b/src/backend/storage/smgr/smgrsync.c
@@ -0,0 +1,855 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.c
+ *	  management of file synchronization.
+ *
+ * This modules tracks which files need to be fsynced or unlinked at the
+ * next checkpoint, and performs those actions.  Normally the work is done
+ * when called by the checkpointer, but it is also done in standalone mode
+ * and startup.
+ *
+ * Originally this logic was inside md.c, but it is now made more general,
+ * for reuse by other SMGR implementations that work with files.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/smgrsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "pgstat.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
+#include "storage/relfilenode.h"
+#include "storage/smgrsync.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+#define SV_PREFIX segnum_vector
+#define SV_DECLARE
+#define SV_DEFINE
+#define SV_ELEMENT_TYPE BlockNumber
+#define SV_SCOPE static inline
+#define SV_GLOBAL_MEMORY_CONTEXT pendingOpsCxt
+#include "lib/simplevector.h"
+
+#define SA_PREFIX segnum_array
+#define SA_COMPARE(a,b) (*a < *b ? -1 : *a == *b ? 0 : 1)
+#define SA_DECLARE
+#define SA_DEFINE
+#define SA_ELEMENT_TYPE SV_ELEMENT_TYPE
+#define SA_SCOPE static inline
+#include "lib/sort_utils.h"
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  A hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
+
+/*
+ * Values for the "type" member of CheckpointerRequest.
+ *
+ * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
+ * fsync request from the queue if an identical, subsequent request is found.
+ * See comments there before making changes here.
+ */
+#define FSYNC_SEGMENT_REQUEST	1
+#define FORGET_SEGMENT_FSYNC	2
+#define FORGET_RELATION_FSYNC	3
+#define FORGET_DATABASE_FSYNC	4
+#define UNLINK_RELATION_REQUEST 5
+#define UNLINK_SEGMENT_REQUEST	6
+
+/* intervals for calling AbsorbFsyncRequests in smgrsync and smgrpostckpt */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * An entry in the hash table of files that need to be flushed for the next
+ * checkpoint.
+ */
+typedef struct PendingFsyncEntry
+{
+	RelFileNode	rnode;
+	segnum_vector requests[MAX_FORKNUM + 1];
+	segnum_vector requests_in_progress[MAX_FORKNUM + 1];
+	CycleCtr	cycle_ctr;
+} PendingFsyncEntry;
+
+typedef struct PendingUnlinkEntry
+{
+	RelFileNode rnode;			/* the dead relation to delete */
+	CycleCtr	cycle_ctr;		/* ckpt_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static bool sync_in_progress = false;
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr ckpt_cycle_ctr = 0;
+
+static HTAB *pendingFsyncTable = NULL;
+static List *pendingUnlinks = NIL;
+
+/*
+ * Initialize the pending operations state, if necessary.
+ */
+void
+smgrsync_init(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(RelFileNode);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingFsyncTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+}
+
+/*
+ * Do pre-checkpoint work.
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+smgrpreckpt(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	ckpt_cycle_ctr++;
+}
+
+/*
+ * Sync previous writes to stable storage.
+ */
+void
+smgrsync(void)
+{
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	int			processed = CheckpointStats.ckpt_sync_rels;
+	uint64		longest = CheckpointStats.ckpt_longest_sync;
+	uint64		total_elapsed = CheckpointStats.ckpt_agg_sync_time;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingFsyncTable.
+	 */
+	if (!pendingFsyncTable)
+		elog(ERROR, "cannot sync without a pendingFsyncTable");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbFsyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use sync_cycle_ctr to tell old entries apart from new
+	 * ones: new ones will have cycle_ctr equal to the incremented value of
+	 * sync_cycle_ctr.
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous smgrsync() failed to complete, run through the table and
+	 * forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			ForkNumber		forknum;
+
+			entry->cycle_ctr = sync_cycle_ctr;
+
+			/*
+			 * If any requests remain unprocessed, they need to be merged with
+			 * the segment numbers that have arrived since.
+			 */
+			for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+			{
+				segnum_vector *requests = &entry->requests[forknum];
+				segnum_vector *requests_in_progress =
+					&entry->requests_in_progress[forknum];
+
+				if (!segnum_vector_empty(requests_in_progress))
+				{
+					/* Append the unfinished requests that were not yet handled. */
+					segnum_vector_append_n(requests,
+										   segnum_vector_data(requests_in_progress),
+										   segnum_vector_size(requests_in_progress));
+					segnum_vector_reset(requests_in_progress);
+
+					/* Sort and make unique. */
+					segnum_array_sort(segnum_vector_begin(requests),
+									  segnum_vector_end(requests));
+					segnum_vector_resize(requests,
+									 segnum_array_unique(segnum_vector_begin(requests),
+														 segnum_vector_end(requests)) -
+										 segnum_vector_begin(requests));
+				}
+			}
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingFsyncTable);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)))
+	{
+		ForkNumber forknum;
+		SMgrRelation reln;
+
+		/*
+		 * If the entry is new then don't process it this time; it might
+		 * contain multiple fsync requests, but they are all new.  Note
+		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * Scan over the forks and segments represented by the entry.
+		 *
+		 * The vector manipulations are slightly tricky, because we can call
+		 * AbsorbFsyncRequests() inside the loop and that could result in new
+		 * segment numbers being added.  So we swap the contents of "requests"
+		 * with "requests_in_progress", and if we fail we'll merge it with any
+		 * new requests that have arrived in the meantime.
+		 */
+		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			segnum_vector *requests_in_progress =
+				&entry->requests_in_progress[forknum];
+
+			/*
+			 * Transfer the current set of segment numbers into the "in
+			 * progress" vector (which must be empty initially).
+			 */
+			Assert(segnum_vector_empty(requests_in_progress));
+			segnum_vector_swap(&entry->requests[forknum], requests_in_progress);
+
+			/*
+			 * If fsync is off then we don't have to bother opening the
+			 * files at all.  (We delay checking until this point so that
+			 * changing fsync on the fly behaves sensibly.)
+			 */
+			if (!enableFsync)
+				segnum_vector_clear(requests_in_progress);
+
+			/* Loop until all requests have been handled. */
+			while (!segnum_vector_empty(requests_in_progress))
+			{
+				SegmentNumber	segno = *segnum_vector_back(requests_in_progress);
+
+				INSTR_TIME_SET_CURRENT(sync_start);
+
+				reln = smgropen(entry->rnode, InvalidBackendId);
+				if (!smgrimmedsync(reln, forknum, segno))
+				{
+					/*
+					 * The underlying file couldn't be found.  Check if a
+					 * later message in the queue reports that it has been
+					 * unlinked; if so it will be removed from the vector,
+					 * indicating that we can safely skip it.
+					 */
+					AbsorbFsyncRequests();
+					if (!segnum_array_binary_search(segnum_vector_begin(requests_in_progress),
+													segnum_vector_end(requests_in_progress),
+													&segno))
+						continue;
+
+					/* Otherwise it's an unexpectedly missing file. */
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open backing file to fsync: %u/%u/%u",
+									entry->rnode.dbNode,
+									entry->rnode.relNode,
+									segno)));
+				}
+
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				/* Remove this segment number. */
+				Assert(segno == *segnum_vector_back(requests_in_progress));
+				segnum_vector_pop_back(requests_in_progress);
+
+				if (log_checkpoints)
+					ereport(DEBUG1,
+							(errmsg("checkpoint sync: number=%d db=%u rel=%u seg=%u time=%.3f msec",
+									processed,
+									entry->rnode.dbNode,
+									entry->rnode.relNode,
+									segno,
+									(double) elapsed / 1000),
+							 errhidestmt(true),
+							 errhidecontext(true)));
+
+				/*
+				 * If in checkpointer, we want to absorb pending requests
+				 * every so often to prevent overflow of the fsync request
+				 * queue.  It is unspecified whether newly-added entries will
+				 * be visited by hash_seq_search, but we don't care since we
+				 * don't need to process them anyway.
+				 */
+				if (--absorb_counter <= 0)
+				{
+					AbsorbFsyncRequests();
+					absorb_counter = FSYNCS_PER_ABSORB;
+				}
+			}
+		}
+
+		/*
+		 * We've finished everything that was requested before we started to
+		 * scan the entry.  If no new requests have been inserted meanwhile,
+		 * remove the entry.  Otherwise, update its cycle counter, as all the
+		 * requests now in it must have arrived during this cycle.
+		 */
+		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			Assert(segnum_vector_empty(&entry->requests_in_progress[forknum]));
+			if (!segnum_vector_empty(&entry->requests[forknum]))
+				break;
+			segnum_vector_reset(&entry->requests[forknum]);
+		}
+		if (forknum <= MAX_FORKNUM)
+			entry->cycle_ctr = sync_cycle_ctr;
+		else
+		{
+			/* Okay to remove it */
+			if (hash_search(pendingFsyncTable, &entry->rnode,
+							HASH_REMOVE, NULL) == NULL)
+				elog(ERROR, "pendingOpsTable corrupted");
+		}
+	}							/* end loop over hashtable entries */
+
+	/* Maintain sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of smgrsync */
+	sync_in_progress = false;
+}
+
+/*
+ * Do post-checkpoint work.
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+smgrpostckpt(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == ckpt_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = relpathperm(entry->rnode, MAIN_FORKNUM);
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in smgrsync, we don't want to stop absorbing fsync requests for a
+		 * long time when there are many deletions to be done.  We can safely
+		 * call AbsorbFsyncRequests() at this point in the loop (note it might
+		 * try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbFsyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+
+/*
+ * Mark a file as needing fsync.
+ *
+ * If there is a local pending-ops table, just make an entry in it for
+ * smgrsync to process later.  Otherwise, try to pass off the fsync request to
+ * the checkpointer process.
+ *
+ * Returns true on success, but false if the queue was full and we couldn't
+ * pass the request to the the checkpointer, meaning that the caller must
+ * perform the fsync.
+ */
+bool
+FsyncAtCheckpoint(RelFileNode rnode, ForkNumber forknum, SegmentNumber segno)
+{
+	if (pendingFsyncTable)
+	{
+		RememberFsyncRequest(FSYNC_SEGMENT_REQUEST, rnode, forknum, segno);
+		return true;
+	}
+	else
+		return ForwardFsyncRequest(FSYNC_SEGMENT_REQUEST, rnode, forknum,
+								   segno);
+}
+
+/*
+ * Schedule a file to be deleted after next checkpoint.
+ *
+ * As with FsyncAtCheckpoint, this could involve either a local or a remote
+ * pending-ops table.
+ */
+void
+UnlinkAfterCheckpoint(RelFileNodeBackend rnode)
+{
+	/* Should never be used with temp relations */
+	Assert(!RelFileNodeBackendIsTemp(rnode));
+
+	if (pendingFsyncTable)
+	{
+		/* push it into local pending-ops table */
+		RememberFsyncRequest(UNLINK_RELATION_REQUEST,
+							 rnode.node,
+							 MAIN_FORKNUM,
+							 InvalidSegmentNumber);
+	}
+	else
+	{
+		/* Notify the checkpointer about it. */
+		Assert(IsUnderPostmaster);
+
+		ForwardFsyncRequest(UNLINK_RELATION_REQUEST,
+							rnode.node,
+							MAIN_FORKNUM,
+							InvalidSegmentNumber);
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingFsyncTable during initialization of the startup
+ * process.  Calling this function drops the local pendingFsyncTable so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingFsyncTable)
+	{
+		smgrsync();
+		hash_destroy(pendingFsyncTable);
+	}
+	pendingFsyncTable = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
+
+/*
+ * Find and remove a segment number by binary search.
+ */
+static inline void
+delete_segno(segnum_vector *vec, SegmentNumber segno)
+{
+	SegmentNumber *position =
+		segnum_array_lower_bound(segnum_vector_begin(vec),
+								 segnum_vector_end(vec),
+								 &segno);
+
+	if (position != segnum_vector_end(vec) &&
+		*position == segno)
+		segnum_vector_erase(vec, position);
+}
+
+/*
+ * Add a segment number by binary search.  Hopefully these tend to be added a
+ * the high end, which is cheap.
+ */
+static inline void
+insert_segno(segnum_vector *vec, SegmentNumber segno)
+{
+	segnum_vector_insert(vec,
+						 segnum_array_lower_bound(segnum_vector_begin(vec),
+												  segnum_vector_end(vec),
+												  &segno),
+						 &segno);
+}
+
+/*
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * Valid valid values for 'type':
+ * - FSYNC_SEGMENT_REQUEST means to schedule an fsync
+ * - FORGET_SEGMENT_FSYNC means to cancel pending fsyncs for one segment
+ * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
+ *	 either for one fork, or all forks if forknum is InvalidForkNumber
+ * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
+ * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
+ *	 checkpoint.
+ * Note also that we're assuming real segment numbers don't exceed INT_MAX.
+ *
+ * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
+ * table has to be searched linearly, but dropping a database is a pretty
+ * heavyweight operation anyhow, so we'll live with it.)
+ */
+void
+RememberFsyncRequest(int type, RelFileNode rnode, ForkNumber forknum,
+					 SegmentNumber segno)
+{
+	Assert(pendingFsyncTable);
+
+	if (type == FORGET_SEGMENT_FSYNC || type == FORGET_RELATION_FSYNC)
+	{
+		PendingFsyncEntry *entry;
+
+		entry = hash_search(pendingFsyncTable, &rnode, HASH_FIND, NULL);
+		if (entry)
+		{
+			if (type == FORGET_SEGMENT_FSYNC)
+			{
+				delete_segno(&entry->requests[forknum], segno);
+				delete_segno(&entry->requests_in_progress[forknum], segno);
+			}
+			else if (forknum == InvalidForkNumber)
+			{
+				/* Remove requests for all forks. */
+				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+				{
+					segnum_vector_reset(&entry->requests[forknum]);
+					segnum_vector_reset(&entry->requests_in_progress[forknum]);
+				}
+			}
+			else
+			{
+				/* Forget about all segments for one fork. */
+				segnum_vector_reset(&entry->requests[forknum]);
+				segnum_vector_reset(&entry->requests_in_progress[forknum]);
+			}
+		}
+	}
+	else if (type == FORGET_DATABASE_FSYNC)
+	{
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (rnode.dbNode == entry->rnode.dbNode)
+			{
+				/* Remove requests for all forks. */
+				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+				{
+					segnum_vector_reset(&entry->requests[forknum]);
+					segnum_vector_reset(&entry->requests_in_progress[forknum]);
+				}
+			}
+		}
+
+		/* Remove unlink requests */
+		if (segno == FORGET_DATABASE_FSYNC)
+		{
+			ListCell   *cell,
+					   *next,
+					   *prev;
+
+			prev = NULL;
+			for (cell = list_head(pendingUnlinks); cell; cell = next)
+			{
+				PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+				next = lnext(cell);
+				if (rnode.dbNode == entry->rnode.dbNode)
+				{
+					pendingUnlinks = list_delete_cell(pendingUnlinks, cell,
+													  prev);
+					pfree(entry);
+				}
+				else
+					prev = cell;
+			}
+		}
+	}
+	else if (type == UNLINK_RELATION_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
+		Assert(forknum == MAIN_FORKNUM);
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->rnode = rnode;
+		entry->cycle_ctr = ckpt_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else if (type == FSYNC_SEGMENT_REQUEST)
+	{
+		/* Normal case: enter a request to fsync this segment */
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		entry = (PendingFsyncEntry *) hash_search(pendingFsyncTable,
+												  &rnode,
+												  HASH_ENTER,
+												  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			ForkNumber	f;
+
+			entry->cycle_ctr = ckpt_cycle_ctr;
+			for (f = 0; f <= MAX_FORKNUM; f++)
+			{
+				segnum_vector_init(&entry->requests[f]);
+				segnum_vector_init(&entry->requests_in_progress[f]);
+			}
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		insert_segno(&entry->requests[forknum], segno);
+	}
+}
+
+/*
+ * ForgetSegmentFsyncRequests -- forget any fsyncs for one segment of a
+ * relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetSegmentFsyncRequests(RelFileNode rnode, ForkNumber forknum,
+						   SegmentNumber segno)
+{
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_SEGMENT_FSYNC, rnode, forknum, segno);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		while (!ForwardFsyncRequest(FORGET_SEGMENT_FSYNC, rnode, forknum,
+									segno))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+{
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_RELATION_FSYNC, rnode, forknum,
+							 InvalidSegmentNumber);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		while (!ForwardFsyncRequest(FORGET_RELATION_FSYNC, rnode, forknum,
+									InvalidSegmentNumber))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
+ */
+void
+ForgetDatabaseFsyncRequests(Oid dbid)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = dbid;
+	rnode.spcNode = 0;
+	rnode.relNode = 0;
+
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_DATABASE_FSYNC, rnode, 0,
+							 InvalidSegmentNumber);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* see notes in ForgetRelationFsyncRequests */
+		while (!ForwardFsyncRequest(FORGET_DATABASE_FSYNC, rnode, 0,
+									InvalidSegmentNumber))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+	}
+}
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 6ec795f1b4..41760b03de 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -59,7 +59,7 @@
 #include "commands/view.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
-#include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rewriteRemove.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 41d477165c..8869e730dc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -61,6 +61,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 53b8f5fe3c..585ce52667 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -1,10 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * bgwriter.h
- *	  Exports from postmaster/bgwriter.c and postmaster/checkpointer.c.
- *
- * The bgwriter process used to handle checkpointing duties too.  Now
- * there is a separate process, but we did not bother to split this header.
+ *	  Exports from postmaster/bgwriter.c.
  *
  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
  *
@@ -15,29 +12,10 @@
 #ifndef _BGWRITER_H
 #define _BGWRITER_H
 
-#include "storage/block.h"
-#include "storage/relfilenode.h"
-
-
 /* GUC options */
 extern int	BgWriterDelay;
-extern int	CheckPointTimeout;
-extern int	CheckPointWarning;
-extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void) pg_attribute_noreturn();
-extern void CheckpointerMain(void) pg_attribute_noreturn();
-
-extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
-
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
-
-extern Size CheckpointerShmemSize(void);
-extern void CheckpointerShmemInit(void);
 
-extern bool FirstCallSinceLastCheckpoint(void);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/postmaster/checkpointer.h b/src/include/postmaster/checkpointer.h
new file mode 100644
index 0000000000..28b13c2d9c
--- /dev/null
+++ b/src/include/postmaster/checkpointer.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * checkpointer.h
+ *	  Exports from postmaster/checkpointer.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/checkpointer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHECKPOINTER_H
+#define CHECKPOINTER_H
+
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+
+/* GUC options */
+extern int	CheckPointTimeout;
+extern int	CheckPointWarning;
+extern double CheckPointCompletionTarget;
+
+extern void CheckpointerMain(void) pg_attribute_noreturn();
+extern bool ForwardFsyncRequest(int type, RelFileNode rnode,
+								ForkNumber forknum, BlockNumber segno);
+extern void RequestCheckpoint(int flags);
+extern void CheckpointWriteDelay(int flags, double progress);
+
+extern void AbsorbFsyncRequests(void);
+extern void AbsorbAllFsyncRequests(void);
+
+extern Size CheckpointerShmemSize(void);
+extern void CheckpointerShmemInit(void);
+
+extern bool FirstCallSinceLastCheckpoint(void);
+extern void CountBackendWrite(void);
+
+#endif
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 820d08ed4e..5a52229dcd 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,15 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/*
+ * The type used to identify segment numbers.  Generally, segments are an
+ * internal detail of individual storage manager implementations, but since
+ * they appear in various places to allow them to be passed between processes,
+ * it seemed worthwhile to have a typename.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -105,10 +114,9 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
-extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
+extern bool smgrimmedsync(SMgrRelation reln, ForkNumber forknum,
+						  SegmentNumber segno);
+
 extern void AtEOXact_SMgr(void);
 
 
@@ -133,16 +141,9 @@ extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
+extern bool mdimmedsync(SMgrRelation reln, ForkNumber forknum,
+						SegmentNumber segno);
+
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 #endif							/* SMGR_H */
diff --git a/src/include/storage/smgrsync.h b/src/include/storage/smgrsync.h
new file mode 100644
index 0000000000..01a174e729
--- /dev/null
+++ b/src/include/storage/smgrsync.h
@@ -0,0 +1,36 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.h
+ *	  management of file synchronization
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/smgrpending.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SMGRSYNC_H
+#define SMGRSYNC_H
+
+#include "storage/smgr.h"
+
+extern void smgrsync_init(void);
+extern void smgrpreckpt(void);
+extern void smgrsync(void);
+extern void smgrpostckpt(void);
+
+extern void UnlinkAfterCheckpoint(RelFileNodeBackend rnode);
+extern bool FsyncAtCheckpoint(RelFileNode rnode, ForkNumber forknum,
+							  SegmentNumber segno);
+extern void RememberFsyncRequest(int type, RelFileNode rnode,
+								 ForkNumber forknum, SegmentNumber segno);
+extern void SetForwardFsyncRequests(void);
+extern void ForgetSegmentFsyncRequests(RelFileNode rnode, ForkNumber forknum,
+									   SegmentNumber segno);
+extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
+extern void ForgetDatabaseFsyncRequests(Oid dbid);
+
+
+#endif
-- 
2.16.5

0002-Refactor-the-fsync-machinery-to-support-future-SM-v8.patchtext/plain; charset=us-asciiDownload

From 54112d952592b9a1d3858384b5c79b6334b20ac6 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Mon, 31 Dec 2018 15:25:16 +1300
Subject: [PATCH 2/2] Refactor the fsync machinery to support future SMGR
 implementations.

In anticipation of proposed block storage managers alongside md.c that
map bufmgr.c blocks to files optimised for different usage patterns:

1.  Move the system for requesting fsyncs out of md.c into a new
translation unit smgrsync.c.

2.  Have smgrsync.c perform the actual fsync() calls via the existing
polymorphic smgrimmedsync() interface, extended to allow an individual
segment number to be specified.

3.  Teach the checkpointer how to forget individual segments that are
unlinked from the 'front' after having been dropped from shared
buffers.

4.  Move the request tracking from a bitmapset into a sorted vector,
because the proposed block storage managers are not anchored at zero
and use potentially very large and sparse integers.

Author: Thomas Munro
Reviewed-by:
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 contrib/bloom/blinsert.c              |   2 +-
 src/backend/access/heap/heapam.c      |   4 +-
 src/backend/access/nbtree/nbtree.c    |   2 +-
 src/backend/access/nbtree/nbtsort.c   |   4 +-
 src/backend/access/spgist/spginsert.c |   2 +-
 src/backend/access/transam/xlog.c     |   2 +
 src/backend/catalog/heap.c            |   2 +-
 src/backend/commands/dbcommands.c     |   2 +-
 src/backend/commands/tablecmds.c      |   2 +-
 src/backend/commands/tablespace.c     |   1 -
 src/backend/postmaster/checkpointer.c |  21 +-
 src/backend/storage/buffer/bufmgr.c   |   1 +
 src/backend/storage/smgr/Makefile     |   2 +-
 src/backend/storage/smgr/md.c         | 776 ++----------------------------
 src/backend/storage/smgr/smgr.c       | 117 +++--
 src/backend/storage/smgr/smgrsync.c   | 855 ++++++++++++++++++++++++++++++++++
 src/backend/tcop/utility.c            |   1 -
 src/backend/utils/misc/guc.c          |   1 +
 src/include/postmaster/bgwriter.h     |  24 +-
 src/include/postmaster/checkpointer.h |  39 ++
 src/include/storage/smgr.h            |  31 +-
 src/include/storage/smgrsync.h        |  35 ++
 22 files changed, 1063 insertions(+), 863 deletions(-)
 create mode 100644 src/backend/storage/smgr/smgrsync.c
 create mode 100644 src/include/postmaster/checkpointer.h
 create mode 100644 src/include/storage/smgrsync.h

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index e43fbe0005..6fa07db4f8 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -188,7 +188,7 @@ blbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dc3499349b..1b32b0128d 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8980,7 +8980,7 @@ heap_sync(Relation rel)
 	/* main heap */
 	FlushRelationBuffers(rel);
 	/* FlushRelationBuffers will have opened rd_smgr */
-	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
+	smgrimmedsyncrel(rel->rd_smgr, MAIN_FORKNUM);
 
 	/* FSM is not critical, don't bother syncing it */
 
@@ -8991,7 +8991,7 @@ heap_sync(Relation rel)
 
 		toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
 		FlushRelationBuffers(toastrel);
-		smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsyncrel(toastrel->rd_smgr, MAIN_FORKNUM);
 		table_close(toastrel, AccessShareLock);
 	}
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..1bccc7a8df 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -178,7 +178,7 @@ btbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsyncrel(index->rd_smgr, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..3b555eff89 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -34,7 +34,7 @@
  * Since the index will never be used unless it is completely built,
  * from a crash-recovery point of view there is no need to WAL-log the
  * steps of the build.  After completing the index build, we can just sync
- * the whole file to disk using smgrimmedsync() before exiting this module.
+ * the whole file to disk using smgrimmedsyncrel() before exiting this module.
  * This can be seen to be sufficient for crash recovery by considering that
  * it's effectively equivalent to what would happen if a CHECKPOINT occurred
  * just after the index build.  However, it is clearly not sufficient if the
@@ -1208,7 +1208,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	if (RelationNeedsWAL(wstate->index))
 	{
 		RelationOpenSmgr(wstate->index);
-		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsyncrel(wstate->index->rd_smgr, MAIN_FORKNUM);
 	}
 }
 
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index f428a15138..0c5ab317cb 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -205,7 +205,7 @@ spgbuildempty(Relation index)
 	 * writes did not go through shared buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsyncrel(index->rd_smgr, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53a..87d1172373 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -44,6 +44,7 @@
 #include "pgstat.h"
 #include "port/atomics.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/walwriter.h"
 #include "postmaster/startup.h"
 #include "replication/basebackup.h"
@@ -64,6 +65,7 @@
 #include "storage/procarray.h"
 #include "storage/reinit.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index 06d18a1cfb..e0a38a1144 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1421,7 +1421,7 @@ heap_create_init_fork(Relation rel)
 	RelationOpenSmgr(rel);
 	smgrcreate(rel->rd_smgr, INIT_FORKNUM, false);
 	log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM);
-	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM);
+	smgrimmedsyncrel(rel->rd_smgr, INIT_FORKNUM);
 }
 
 /*
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index d207cd899f..1be6c874a1 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -47,7 +47,6 @@
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
 #include "pgstat.h"
-#include "postmaster/bgwriter.h"
 #include "replication/slot.h"
 #include "storage/copydir.h"
 #include "storage/fd.h"
@@ -55,6 +54,7 @@
 #include "storage/ipc.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/acl.h"
 #include "utils/builtins.h"
 #include "utils/fmgroids.h"
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 715c6a221c..4e662bdf92 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11788,7 +11788,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	 * here, they might still not be on disk when the crash occurs.
 	 */
 	if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+		smgrimmedsyncrel(dst, forkNum);
 }
 
 /*
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 4afd178e97..e450e161a0 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -70,7 +70,6 @@
 #include "commands/tablespace.h"
 #include "common/file_perm.h"
 #include "miscadmin.h"
-#include "postmaster/bgwriter.h"
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/standby.h"
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359..da233b1cdc 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -47,6 +47,7 @@
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/postmaster.h"
 #include "replication/syncrep.h"
 #include "storage/bufmgr.h"
 #include "storage/condition_variable.h"
@@ -56,6 +57,7 @@
 #include "storage/proc.h"
 #include "storage/shmem.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/spin.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -108,10 +110,10 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
+	int			type;
+	RelFileNode	rnode;
 	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
+	SegmentNumber segno;
 } CheckpointerRequest;
 
 typedef struct
@@ -1077,9 +1079,7 @@ RequestCheckpoint(int flags)
  * RelFileNodeBackend.
  *
  * segno specifies which segment (not block!) of the relation needs to be
- * fsync'd.  (Since the valid range is much less than BlockNumber, we can
- * use high values for special flags; that's all internal to md.c, which
- * see for details.)
+ * fsync'd.
  *
  * To avoid holding the lock for longer than necessary, we normally write
  * to the requests[] queue without checking for duplicates.  The checkpointer
@@ -1092,13 +1092,14 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardFsyncRequest(int type, RelFileNode rnode, ForkNumber forknum,
+					SegmentNumber segno)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
 
 	if (!IsUnderPostmaster)
-		return false;			/* probably shouldn't even get here */
+		elog(ERROR, "ForwardFsyncRequest must not be called in single user mode");
 
 	if (AmCheckpointerProcess())
 		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
@@ -1130,6 +1131,7 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
+	request->type = type;
 	request->rnode = rnode;
 	request->forknum = forknum;
 	request->segno = segno;
@@ -1314,7 +1316,8 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberFsyncRequest(request->type, request->rnode, request->forknum,
+							 request->segno);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..b75021cce3 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -47,6 +47,7 @@
 #include "storage/ipc.h"
 #include "storage/proc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "storage/standby.h"
 #include "utils/rel.h"
 #include "utils/resowner_private.h"
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0df1..c9c4be325e 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrsync.o smgrtype.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2aba2dfe91..73c68e56bf 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -30,37 +30,23 @@
 #include "access/xlog.h"
 #include "pgstat.h"
 #include "portability/instr_time.h"
-#include "postmaster/bgwriter.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
 
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
 
 /*
  * On Windows, we have to interpret EACCES as possibly meaning the same as
  * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
  * that's what you get.  Ugh.  This code is designed so that we don't
  * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
+ * a pending fsync request getting canceled ... see smgrsync).
  */
 #ifndef WIN32
 #define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
@@ -134,30 +120,9 @@ static MemoryContext MdCxt;		/* context for all MdfdVec objects */
  * (Regular backends do not track pending operations locally, but forward
  * them to the checkpointer.)
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
-
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
 
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
 
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
 
 
 /*** behavior for mdopen & _mdfd_getseg ***/
@@ -184,8 +149,7 @@ static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
 			 bool isRedo);
 static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
 static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
+							   MdfdVec *seg);
 static void _fdvec_resize(SMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg);
@@ -208,64 +172,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -382,7 +288,7 @@ mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
 	/*
 	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
+	 * relation, else the next smgrsync() will fail.  There can't be any such
 	 * requests for a temp relation, though.  We can send just one request
 	 * even when deleting multiple forks, since the fsync queuing code accepts
 	 * the "InvalidForkNumber = all forks" convention.
@@ -442,7 +348,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		UnlinkAfterCheckpoint(rnode);
 	}
 
 	/*
@@ -972,13 +878,15 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 }
 
 /*
- *	mdimmedsync() -- Immediately sync a relation to stable storage.
+ *	mdimmedsyncrel() -- Immediately sync a relation to stable storage.
  *
  * Note that only writes already issued are synced; this routine knows
  * nothing of dirty buffers that may exist inside the buffer manager.
+ *
+ * See smgrimmedsyncrel comment for contract.
  */
-void
-mdimmedsync(SMgrRelation reln, ForkNumber forknum)
+bool
+mdimmedsyncrel(SMgrRelation reln, ForkNumber forknum)
 {
 	int			segno;
 
@@ -992,407 +900,57 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 
 	while (segno > 0)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec *v = &reln->md_seg_fds[forknum][segno - 1];
 
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
-			ereport(data_sync_elevel(ERROR),
+			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(v->mdfd_vfd))));
 		segno--;
 	}
-}
-
-/*
- *	mdsync() -- Sync previous writes to stable storage.
- */
-void
-mdsync(void)
-{
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
-	}
-
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
 
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
+	return true;
 }
 
 /*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
+ *	mdimmedsyncseg() -- Immediately sync a relation segment to stable storage.
  *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
-
-/*
- * mdpostckpt() -- Do post-checkpoint work
+ * Note that only writes already issued are synced; this routine knows
+ * nothing of dirty buffers that may exist inside the buffer manager.
  *
- * Remove any lingering files that can now be safely removed.
+ * See smgrimmedsyncseg comment for contract.
  */
-void
-mdpostckpt(void)
+bool
+mdimmedsyncseg(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
-	int			absorb_counter;
+	MdfdVec	   *v;
 
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
+	if (segno != InvalidSegmentNumber)
 	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
 		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
+		 * Get the specified segment, or report failure if it doesn't seem to
+		 * exist.
 		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
+		v = _mdfd_openseg(reln, forknum, segno * RELSEG_SIZE,
+								 EXTENSION_RETURN_NULL);
+		if (v == NULL)
+			return false;
 
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
+		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
+			ereport(data_sync_elevel(ERROR),
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(v->mdfd_vfd))));
 
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
+		return true;
 	}
+
+	return false;
 }
 
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
- *
- * If there is a local pending-ops table, just make an entry in it for
- * mdsync to process later.  Otherwise, try to pass off the fsync request
- * to the checkpointer process.  If that fails, just do the fsync
- * locally before returning (we hope this will not happen often enough
- * to be a performance problem).
  */
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
@@ -1400,16 +958,8 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
+	if (!FsyncAtCheckpoint(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
 	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1421,258 +971,6 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	}
 }
 
-/*
- * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
- */
-static void
-register_unlink(RelFileNodeBackend rnode)
-{
-	/* Should never be used with temp relations */
-	Assert(!RelFileNodeBackendIsTemp(rnode));
-
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
-/*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
- */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
-{
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
-}
-
-/*
- * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
- */
-void
-ForgetDatabaseFsyncRequests(Oid dbid)
-{
-	RelFileNode rnode;
-
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
-
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
 /*
  * DropRelationFiles -- drop files of all given relations
  */
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0c0bba4ab3..de93d92bb7 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/smgrsync.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
@@ -58,10 +59,9 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
+	bool		(*smgr_immedsyncrel) (SMgrRelation reln, ForkNumber forknum);
+	bool		(*smgr_immedsyncseg) (SMgrRelation reln, ForkNumber forknum,
+								   SegmentNumber segno);
 } f_smgr;
 
 
@@ -81,10 +81,8 @@ static const f_smgr smgrsw[] = {
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
-		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
+		.smgr_immedsyncrel = mdimmedsyncrel,
+		.smgr_immedsyncseg = mdimmedsyncseg
 	}
 };
 
@@ -104,6 +102,14 @@ static void smgrshutdown(int code, Datum arg);
 static void add_to_unowned_list(SMgrRelation reln);
 static void remove_from_unowned_list(SMgrRelation reln);
 
+/*
+ * For now there is only one implementation.
+ */
+static inline int
+which_for_relfilenode(RelFileNode rnode)
+{
+	return 0;	/* we only have md.c at present */
+}
 
 /*
  *	smgrinit(), smgrshutdown() -- Initialize or shut down storage
@@ -118,6 +124,8 @@ smgrinit(void)
 {
 	int			i;
 
+	smgrsync_init();
+
 	for (i = 0; i < NSmgr; i++)
 	{
 		if (smgrsw[i].smgr_init)
@@ -185,7 +193,7 @@ smgropen(RelFileNode rnode, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		reln->smgr_fsm_nblocks = InvalidBlockNumber;
 		reln->smgr_vm_nblocks = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+		reln->smgr_which = which_for_relfilenode(rnode);
 
 		/* mark it not open */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
@@ -723,20 +731,23 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 }
 
 /*
- *	smgrimmedsync() -- Force the specified relation to stable storage.
+ *	smgrimmedsyncrel() -- Force the specified relation to stable storage.
  *
  *		Synchronously force all previous writes to the specified relation
- *		down to disk.
- *
- *		This is useful for building completely new relations (eg, new
- *		indexes).  Instead of incrementally WAL-logging the index build
- *		steps, we can just write completed index pages to disk with smgrwrite
- *		or smgrextend, and then fsync the completed index file before
- *		committing the transaction.  (This is sufficient for purposes of
- *		crash recovery, since it effectively duplicates forcing a checkpoint
- *		for the completed index.  But it is *not* sufficient if one wishes
- *		to use the WAL log for PITR or replication purposes: in that case
- *		we have to make WAL entries as well.)
+ *		down to disk.  If segnum is >= 0, only applies to data in
+ *		one segment file.
+ *
+ *		Used for checkpointing dirty files.
+ *
+ *		This can also be used for building completely new relations (eg, new
+ *		indexes).  Instead of incrementally WAL-logging the index build steps,
+ *		we can just write completed index pages to disk with smgrwrite or
+ *		smgrextend, and then fsync the completed index file before committing
+ *		the transaction.  (This is sufficient for purposes of crash recovery,
+ *		since it effectively duplicates forcing a checkpoint for the completed
+ *		index.  But it is *not* sufficient if one wishes to use the WAL log
+ *		for PITR or replication purposes: in that case we have to make WAL
+ *		entries as well.)
  *
  *		The preceding writes should specify skipFsync = true to avoid
  *		duplicative fsyncs.
@@ -744,57 +755,33 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *		Note that you need to do FlushRelationBuffers() first if there is
  *		any possibility that there are dirty buffers for the relation;
  *		otherwise the sync is not very meaningful.
+ *
+ *		Fail to fsync raises an error, but non-existence of a requested
+ *		segment is reported with a false return value.
  */
-void
-smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
-{
-	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
-}
-
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
+bool
+smgrimmedsyncrel(SMgrRelation reln, ForkNumber forknum)
 {
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
+	return smgrsw[reln->smgr_which].smgr_immedsyncrel(reln, forknum);
 }
 
 /*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
+ *	smgrimmedsyncseg() -- Force the specified relation segment to stable storage.
+ *
+ *		Synchronously force all previous writes to the specified relation
+ *		segment down to disk.
+ *
+ *		The preceding writes should specify skipFsync = true to avoid
+ *		duplicative fsyncs.
+ *
+ *		Fail to fsync raises an error, but non-existence of a requested
+ *		segment is reported with a false return value.
  */
-void
-smgrpostckpt(void)
+bool
+smgrimmedsyncseg(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
+	Assert(segno != InvalidSegmentNumber);
+	return smgrsw[reln->smgr_which].smgr_immedsyncseg(reln, forknum, segno);
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgrsync.c b/src/backend/storage/smgr/smgrsync.c
new file mode 100644
index 0000000000..6ac901eb78
--- /dev/null
+++ b/src/backend/storage/smgr/smgrsync.c
@@ -0,0 +1,855 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.c
+ *	  management of file synchronization.
+ *
+ * This modules tracks which files need to be fsynced or unlinked at the
+ * next checkpoint, and performs those actions.  Normally the work is done
+ * when called by the checkpointer, but it is also done in standalone mode
+ * and startup.
+ *
+ * Originally this logic was inside md.c, but it is now made more general,
+ * for reuse by other SMGR implementations that work with files.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/smgr/smgrsync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "nodes/pg_list.h"
+#include "pgstat.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "storage/relfilenode.h"
+#include "storage/smgrsync.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+#define SV_PREFIX segnum_vector
+#define SV_DECLARE
+#define SV_DEFINE
+#define SV_ELEMENT_TYPE BlockNumber
+#define SV_SCOPE static inline
+#define SV_GLOBAL_MEMORY_CONTEXT pendingOpsCxt
+#include "lib/simplevector.h"
+
+#define SA_PREFIX segnum_array
+#define SA_COMPARE(a,b) (*a < *b ? -1 : *a == *b ? 0 : 1)
+#define SA_DECLARE
+#define SA_DEFINE
+#define SA_ELEMENT_TYPE SV_ELEMENT_TYPE
+#define SA_SCOPE static inline
+#include "lib/sort_utils.h"
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  A hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+
+typedef uint32 CycleCtr;		/* can be any convenient integer size */
+
+/*
+ * Values for the "type" member of CheckpointerRequest.
+ *
+ * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
+ * fsync request from the queue if an identical, subsequent request is found.
+ * See comments there before making changes here.
+ */
+#define FSYNC_SEGMENT_REQUEST	1
+#define FORGET_SEGMENT_FSYNC	2
+#define FORGET_RELATION_FSYNC	3
+#define FORGET_DATABASE_FSYNC	4
+#define UNLINK_RELATION_REQUEST 5
+#define UNLINK_SEGMENT_REQUEST	6
+
+/* intervals for calling AbsorbFsyncRequests in smgrsync and smgrpostckpt */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * An entry in the hash table of files that need to be flushed for the next
+ * checkpoint.
+ */
+typedef struct PendingFsyncEntry
+{
+	RelFileNode	rnode;
+	segnum_vector requests[MAX_FORKNUM + 1];
+	segnum_vector requests_in_progress[MAX_FORKNUM + 1];
+	CycleCtr	cycle_ctr;
+} PendingFsyncEntry;
+
+typedef struct PendingUnlinkEntry
+{
+	RelFileNode rnode;			/* the dead relation to delete */
+	CycleCtr	cycle_ctr;		/* ckpt_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static bool sync_in_progress = false;
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr ckpt_cycle_ctr = 0;
+
+static HTAB *pendingFsyncTable = NULL;
+static List *pendingUnlinks = NIL;
+
+/*
+ * Initialize the pending operations state, if necessary.
+ */
+void
+smgrsync_init(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(RelFileNode);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingFsyncTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+}
+
+/*
+ * Do pre-checkpoint work.
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+smgrpreckpt(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	ckpt_cycle_ctr++;
+}
+
+/*
+ * Sync previous writes to stable storage.
+ */
+void
+smgrsync(void)
+{
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	int			processed = CheckpointStats.ckpt_sync_rels;
+	uint64		longest = CheckpointStats.ckpt_longest_sync;
+	uint64		total_elapsed = CheckpointStats.ckpt_agg_sync_time;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingFsyncTable.
+	 */
+	if (!pendingFsyncTable)
+		elog(ERROR, "cannot sync without a pendingFsyncTable");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbFsyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use sync_cycle_ctr to tell old entries apart from new
+	 * ones: new ones will have cycle_ctr equal to the incremented value of
+	 * sync_cycle_ctr.
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous smgrsync() failed to complete, run through the table and
+	 * forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			ForkNumber		forknum;
+
+			entry->cycle_ctr = sync_cycle_ctr;
+
+			/*
+			 * If any requests remain unprocessed, they need to be merged with
+			 * the segment numbers that have arrived since.
+			 */
+			for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+			{
+				segnum_vector *requests = &entry->requests[forknum];
+				segnum_vector *requests_in_progress =
+					&entry->requests_in_progress[forknum];
+
+				if (!segnum_vector_empty(requests_in_progress))
+				{
+					/* Append the unfinished requests that were not yet handled. */
+					segnum_vector_append_n(requests,
+										   segnum_vector_data(requests_in_progress),
+										   segnum_vector_size(requests_in_progress));
+					segnum_vector_reset(requests_in_progress);
+
+					/* Sort and make unique. */
+					segnum_array_sort(segnum_vector_begin(requests),
+									  segnum_vector_end(requests));
+					segnum_vector_resize(requests,
+									 segnum_array_unique(segnum_vector_begin(requests),
+														 segnum_vector_end(requests)) -
+										 segnum_vector_begin(requests));
+				}
+			}
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingFsyncTable);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)))
+	{
+		ForkNumber forknum;
+		SMgrRelation reln;
+
+		/*
+		 * If the entry is new then don't process it this time; it might
+		 * contain multiple fsync requests, but they are all new.  Note
+		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * Scan over the forks and segments represented by the entry.
+		 *
+		 * The vector manipulations are slightly tricky, because we can call
+		 * AbsorbFsyncRequests() inside the loop and that could result in new
+		 * segment numbers being added.  So we swap the contents of "requests"
+		 * with "requests_in_progress", and if we fail we'll merge it with any
+		 * new requests that have arrived in the meantime.
+		 */
+		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			segnum_vector *requests_in_progress =
+				&entry->requests_in_progress[forknum];
+
+			/*
+			 * Transfer the current set of segment numbers into the "in
+			 * progress" vector (which must be empty initially).
+			 */
+			Assert(segnum_vector_empty(requests_in_progress));
+			segnum_vector_swap(&entry->requests[forknum], requests_in_progress);
+
+			/*
+			 * If fsync is off then we don't have to bother opening the
+			 * files at all.  (We delay checking until this point so that
+			 * changing fsync on the fly behaves sensibly.)
+			 */
+			if (!enableFsync)
+				segnum_vector_clear(requests_in_progress);
+
+			/* Loop until all requests have been handled. */
+			while (!segnum_vector_empty(requests_in_progress))
+			{
+				SegmentNumber	segno = *segnum_vector_back(requests_in_progress);
+
+				INSTR_TIME_SET_CURRENT(sync_start);
+
+				reln = smgropen(entry->rnode, InvalidBackendId);
+				Assert(segno != InvalidSegmentNumber);
+				if (!smgrimmedsyncseg(reln, forknum, segno))
+				{
+					/*
+					 * The underlying file couldn't be found.  Check if a
+					 * later message in the queue reports that it has been
+					 * unlinked; if so it will be removed from the vector,
+					 * indicating that we can safely skip it.
+					 */
+					AbsorbFsyncRequests();
+					if (!segnum_array_binary_search(segnum_vector_begin(requests_in_progress),
+													segnum_vector_end(requests_in_progress),
+													&segno))
+						continue;
+
+					/* Otherwise it's an unexpectedly missing file. */
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not open backing file to fsync: %u/%u/%u",
+									entry->rnode.dbNode,
+									entry->rnode.relNode,
+									segno)));
+				}
+
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				/* Remove this segment number. */
+				Assert(segno == *segnum_vector_back(requests_in_progress));
+				segnum_vector_pop_back(requests_in_progress);
+
+				if (log_checkpoints)
+					ereport(DEBUG1,
+							(errmsg("checkpoint sync: number=%d db=%u rel=%u seg=%u time=%.3f msec",
+									processed,
+									entry->rnode.dbNode,
+									entry->rnode.relNode,
+									segno,
+									(double) elapsed / 1000),
+							 errhidestmt(true),
+							 errhidecontext(true)));
+
+				/*
+				 * If in checkpointer, we want to absorb pending requests
+				 * every so often to prevent overflow of the fsync request
+				 * queue.  It is unspecified whether newly-added entries will
+				 * be visited by hash_seq_search, but we don't care since we
+				 * don't need to process them anyway.
+				 */
+				if (--absorb_counter <= 0)
+				{
+					AbsorbFsyncRequests();
+					absorb_counter = FSYNCS_PER_ABSORB;
+				}
+			}
+		}
+
+		/*
+		 * We've finished everything that was requested before we started to
+		 * scan the entry.  If no new requests have been inserted meanwhile,
+		 * remove the entry.  Otherwise, update its cycle counter, as all the
+		 * requests now in it must have arrived during this cycle.
+		 */
+		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
+			Assert(segnum_vector_empty(&entry->requests_in_progress[forknum]));
+			if (!segnum_vector_empty(&entry->requests[forknum]))
+				break;
+			segnum_vector_reset(&entry->requests[forknum]);
+		}
+		if (forknum <= MAX_FORKNUM)
+			entry->cycle_ctr = sync_cycle_ctr;
+		else
+		{
+			/* Okay to remove it */
+			if (hash_search(pendingFsyncTable, &entry->rnode,
+							HASH_REMOVE, NULL) == NULL)
+				elog(ERROR, "pendingOpsTable corrupted");
+		}
+	}							/* end loop over hashtable entries */
+
+	/* Maintain sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of smgrsync */
+	sync_in_progress = false;
+}
+
+/*
+ * Do post-checkpoint work.
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+smgrpostckpt(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == ckpt_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = relpathperm(entry->rnode, MAIN_FORKNUM);
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in smgrsync, we don't want to stop absorbing fsync requests for a
+		 * long time when there are many deletions to be done.  We can safely
+		 * call AbsorbFsyncRequests() at this point in the loop (note it might
+		 * try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbFsyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+
+/*
+ * Mark a file as needing fsync.
+ *
+ * If there is a local pending-ops table, just make an entry in it for
+ * smgrsync to process later.  Otherwise, try to pass off the fsync request to
+ * the checkpointer process.
+ *
+ * Returns true on success, but false if the queue was full and we couldn't
+ * pass the request to the the checkpointer, meaning that the caller must
+ * perform the fsync.
+ */
+bool
+FsyncAtCheckpoint(RelFileNode rnode, ForkNumber forknum, SegmentNumber segno)
+{
+	if (pendingFsyncTable)
+	{
+		RememberFsyncRequest(FSYNC_SEGMENT_REQUEST, rnode, forknum, segno);
+		return true;
+	}
+	else
+		return ForwardFsyncRequest(FSYNC_SEGMENT_REQUEST, rnode, forknum,
+								   segno);
+}
+
+/*
+ * Schedule a file to be deleted after next checkpoint.
+ *
+ * As with FsyncAtCheckpoint, this could involve either a local or a remote
+ * pending-ops table.
+ */
+void
+UnlinkAfterCheckpoint(RelFileNodeBackend rnode)
+{
+	/* Should never be used with temp relations */
+	Assert(!RelFileNodeBackendIsTemp(rnode));
+
+	if (pendingFsyncTable)
+	{
+		/* push it into local pending-ops table */
+		RememberFsyncRequest(UNLINK_RELATION_REQUEST,
+							 rnode.node,
+							 MAIN_FORKNUM,
+							 InvalidSegmentNumber);
+	}
+	else
+	{
+		/* Notify the checkpointer about it. */
+		Assert(IsUnderPostmaster);
+
+		ForwardFsyncRequest(UNLINK_RELATION_REQUEST,
+							rnode.node,
+							MAIN_FORKNUM,
+							InvalidSegmentNumber);
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingFsyncTable during initialization of the startup
+ * process.  Calling this function drops the local pendingFsyncTable so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingFsyncTable)
+	{
+		smgrsync();
+		hash_destroy(pendingFsyncTable);
+	}
+	pendingFsyncTable = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
+
+/*
+ * Find and remove a segment number by binary search.
+ */
+static inline void
+delete_segno(segnum_vector *vec, SegmentNumber segno)
+{
+	SegmentNumber *position =
+		segnum_array_lower_bound(segnum_vector_begin(vec),
+								 segnum_vector_end(vec),
+								 &segno);
+
+	if (position != segnum_vector_end(vec) &&
+		*position == segno)
+		segnum_vector_erase(vec, position);
+}
+
+/*
+ * Add a segment number by binary search.  Hopefully these tend to be added a
+ * the high end, which is cheap.
+ */
+static inline void
+insert_segno(segnum_vector *vec, SegmentNumber segno)
+{
+	segnum_vector_insert(vec,
+						 segnum_array_lower_bound(segnum_vector_begin(vec),
+												  segnum_vector_end(vec),
+												  &segno),
+						 &segno);
+}
+
+/*
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * Valid valid values for 'type':
+ * - FSYNC_SEGMENT_REQUEST means to schedule an fsync
+ * - FORGET_SEGMENT_FSYNC means to cancel pending fsyncs for one segment
+ * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
+ *	 either for one fork, or all forks if forknum is InvalidForkNumber
+ * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
+ * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
+ *	 checkpoint.
+ * Note also that we're assuming real segment numbers don't exceed INT_MAX.
+ *
+ * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
+ * table has to be searched linearly, but dropping a database is a pretty
+ * heavyweight operation anyhow, so we'll live with it.)
+ */
+void
+RememberFsyncRequest(int type, RelFileNode rnode, ForkNumber forknum,
+					 SegmentNumber segno)
+{
+	Assert(pendingFsyncTable);
+
+	if (type == FORGET_SEGMENT_FSYNC || type == FORGET_RELATION_FSYNC)
+	{
+		PendingFsyncEntry *entry;
+
+		entry = hash_search(pendingFsyncTable, &rnode, HASH_FIND, NULL);
+		if (entry)
+		{
+			if (type == FORGET_SEGMENT_FSYNC)
+			{
+				delete_segno(&entry->requests[forknum], segno);
+				delete_segno(&entry->requests_in_progress[forknum], segno);
+			}
+			else if (forknum == InvalidForkNumber)
+			{
+				/* Remove requests for all forks. */
+				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+				{
+					segnum_vector_reset(&entry->requests[forknum]);
+					segnum_vector_reset(&entry->requests_in_progress[forknum]);
+				}
+			}
+			else
+			{
+				/* Forget about all segments for one fork. */
+				segnum_vector_reset(&entry->requests[forknum]);
+				segnum_vector_reset(&entry->requests_in_progress[forknum]);
+			}
+		}
+	}
+	else if (type == FORGET_DATABASE_FSYNC)
+	{
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingFsyncTable);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (rnode.dbNode == entry->rnode.dbNode)
+			{
+				/* Remove requests for all forks. */
+				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+				{
+					segnum_vector_reset(&entry->requests[forknum]);
+					segnum_vector_reset(&entry->requests_in_progress[forknum]);
+				}
+			}
+		}
+
+		/* Remove unlink requests */
+		if (segno == FORGET_DATABASE_FSYNC)
+		{
+			ListCell   *cell,
+					   *next,
+					   *prev;
+
+			prev = NULL;
+			for (cell = list_head(pendingUnlinks); cell; cell = next)
+			{
+				PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+				next = lnext(cell);
+				if (rnode.dbNode == entry->rnode.dbNode)
+				{
+					pendingUnlinks = list_delete_cell(pendingUnlinks, cell,
+													  prev);
+					pfree(entry);
+				}
+				else
+					prev = cell;
+			}
+		}
+	}
+	else if (type == UNLINK_RELATION_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
+		Assert(forknum == MAIN_FORKNUM);
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->rnode = rnode;
+		entry->cycle_ctr = ckpt_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else if (type == FSYNC_SEGMENT_REQUEST)
+	{
+		/* Normal case: enter a request to fsync this segment */
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		entry = (PendingFsyncEntry *) hash_search(pendingFsyncTable,
+												  &rnode,
+												  HASH_ENTER,
+												  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			ForkNumber	f;
+
+			entry->cycle_ctr = ckpt_cycle_ctr;
+			for (f = 0; f <= MAX_FORKNUM; f++)
+			{
+				segnum_vector_init(&entry->requests[f]);
+				segnum_vector_init(&entry->requests_in_progress[f]);
+			}
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		insert_segno(&entry->requests[forknum], segno);
+	}
+}
+
+/*
+ * ForgetSegmentFsyncRequests -- forget any fsyncs for one segment of a
+ * relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetSegmentFsyncRequests(RelFileNode rnode, ForkNumber forknum,
+						   SegmentNumber segno)
+{
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_SEGMENT_FSYNC, rnode, forknum, segno);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		while (!ForwardFsyncRequest(FORGET_SEGMENT_FSYNC, rnode, forknum,
+									segno))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+{
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_RELATION_FSYNC, rnode, forknum,
+							 InvalidSegmentNumber);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* Notify the checkpointer about it. */
+		while (!ForwardFsyncRequest(FORGET_RELATION_FSYNC, rnode, forknum,
+									InvalidSegmentNumber))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see smgrsync() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
+ */
+void
+ForgetDatabaseFsyncRequests(Oid dbid)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = dbid;
+	rnode.spcNode = 0;
+	rnode.relNode = 0;
+
+	if (pendingFsyncTable)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(FORGET_DATABASE_FSYNC, rnode, 0,
+							 InvalidSegmentNumber);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* see notes in ForgetRelationFsyncRequests */
+		while (!ForwardFsyncRequest(FORGET_DATABASE_FSYNC, rnode, 0,
+									InvalidSegmentNumber))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+	}
+}
diff --git a/src/backend/tcop/utility.c b/src/backend/tcop/utility.c
index 6ec795f1b4..9ed06a32e6 100644
--- a/src/backend/tcop/utility.c
+++ b/src/backend/tcop/utility.c
@@ -59,7 +59,6 @@
 #include "commands/view.h"
 #include "miscadmin.h"
 #include "parser/parse_utilcmd.h"
-#include "postmaster/bgwriter.h"
 #include "rewrite/rewriteDefine.h"
 #include "rewrite/rewriteRemove.h"
 #include "storage/fd.h"
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 41d477165c..8869e730dc 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -61,6 +61,7 @@
 #include "postmaster/autovacuum.h"
 #include "postmaster/bgworker_internals.h"
 #include "postmaster/bgwriter.h"
+#include "postmaster/checkpointer.h"
 #include "postmaster/postmaster.h"
 #include "postmaster/syslogger.h"
 #include "postmaster/walwriter.h"
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 53b8f5fe3c..585ce52667 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -1,10 +1,7 @@
 /*-------------------------------------------------------------------------
  *
  * bgwriter.h
- *	  Exports from postmaster/bgwriter.c and postmaster/checkpointer.c.
- *
- * The bgwriter process used to handle checkpointing duties too.  Now
- * there is a separate process, but we did not bother to split this header.
+ *	  Exports from postmaster/bgwriter.c.
  *
  * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
  *
@@ -15,29 +12,10 @@
 #ifndef _BGWRITER_H
 #define _BGWRITER_H
 
-#include "storage/block.h"
-#include "storage/relfilenode.h"
-
-
 /* GUC options */
 extern int	BgWriterDelay;
-extern int	CheckPointTimeout;
-extern int	CheckPointWarning;
-extern double CheckPointCompletionTarget;
 
 extern void BackgroundWriterMain(void) pg_attribute_noreturn();
-extern void CheckpointerMain(void) pg_attribute_noreturn();
-
-extern void RequestCheckpoint(int flags);
-extern void CheckpointWriteDelay(int flags, double progress);
-
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
-
-extern Size CheckpointerShmemSize(void);
-extern void CheckpointerShmemInit(void);
 
-extern bool FirstCallSinceLastCheckpoint(void);
 
 #endif							/* _BGWRITER_H */
diff --git a/src/include/postmaster/checkpointer.h b/src/include/postmaster/checkpointer.h
new file mode 100644
index 0000000000..28b13c2d9c
--- /dev/null
+++ b/src/include/postmaster/checkpointer.h
@@ -0,0 +1,39 @@
+/*-------------------------------------------------------------------------
+ *
+ * checkpointer.h
+ *	  Exports from postmaster/checkpointer.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ *
+ * src/include/postmaster/checkpointer.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef CHECKPOINTER_H
+#define CHECKPOINTER_H
+
+#include "common/relpath.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+
+/* GUC options */
+extern int	CheckPointTimeout;
+extern int	CheckPointWarning;
+extern double CheckPointCompletionTarget;
+
+extern void CheckpointerMain(void) pg_attribute_noreturn();
+extern bool ForwardFsyncRequest(int type, RelFileNode rnode,
+								ForkNumber forknum, BlockNumber segno);
+extern void RequestCheckpoint(int flags);
+extern void CheckpointWriteDelay(int flags, double progress);
+
+extern void AbsorbFsyncRequests(void);
+extern void AbsorbAllFsyncRequests(void);
+
+extern Size CheckpointerShmemSize(void);
+extern void CheckpointerShmemInit(void);
+
+extern bool FirstCallSinceLastCheckpoint(void);
+extern void CountBackendWrite(void);
+
+#endif
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 820d08ed4e..8b7ff665c5 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,15 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/*
+ * The type used to identify segment numbers.  Generally, segments are an
+ * internal detail of individual storage manager implementations, but since
+ * they appear in various places to allow them to be passed between processes,
+ * it seemed worthwhile to have a typename.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -105,10 +114,10 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
-extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
+extern bool smgrimmedsyncrel(SMgrRelation reln, ForkNumber forknum);
+extern bool smgrimmedsyncseg(SMgrRelation reln, ForkNumber forknum,
+	SegmentNumber segno);
+
 extern void AtEOXact_SMgr(void);
 
 
@@ -133,16 +142,10 @@ extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
+extern bool mdimmedsyncrel(SMgrRelation reln, ForkNumber forknum);
+extern bool mdimmedsyncseg(SMgrRelation reln, ForkNumber forknum,
+	SegmentNumber segno);
+
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 #endif							/* SMGR_H */
diff --git a/src/include/storage/smgrsync.h b/src/include/storage/smgrsync.h
new file mode 100644
index 0000000000..212a0f8443
--- /dev/null
+++ b/src/include/storage/smgrsync.h
@@ -0,0 +1,35 @@
+/*-------------------------------------------------------------------------
+ *
+ * smgrsync.h
+ *	  management of file synchronization
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/smgrpending.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SMGRSYNC_H
+#define SMGRSYNC_H
+
+#include "storage/smgr.h"
+
+extern void smgrsync_init(void);
+extern void smgrpreckpt(void);
+extern void smgrsync(void);
+extern void smgrpostckpt(void);
+
+extern void UnlinkAfterCheckpoint(RelFileNodeBackend rnode);
+extern bool FsyncAtCheckpoint(RelFileNode rnode, ForkNumber forknum,
+							  SegmentNumber segno);
+extern void RememberFsyncRequest(int type, RelFileNode rnode,
+								 ForkNumber forknum, SegmentNumber segno);
+extern void SetForwardFsyncRequests(void);
+extern void ForgetSegmentFsyncRequests(RelFileNode rnode, ForkNumber forknum,
+									   SegmentNumber segno);
+extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
+extern void ForgetDatabaseFsyncRequests(Oid dbid);
+
+#endif
-- 
2.16.5

#32

Thomas Munro

thomas.munro@enterprisedb.com

almost 7 years ago

In reply to: Shawn Debnath (#31)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Feb 13, 2019 at 3:58 PM Shawn Debnath <sdn@amazon.com> wrote:

On Wed, Jan 30, 2019 at 09:59:38PM -0800, Shawn Debnath wrote:

I wonder if it might be better to introduce two different functions
catering to the two different use cases for forcing an immediate sync:

- sync a relation
smgrimmedsyncrel(SMgrRelation, ForkNumber)
- sync a specific segment
smgrimmedsyncseg(SMgrRelation, ForkNumber, SegmentNumber)

This will avoid having to specify InvalidSegmentNumber for majority of
the callers today.

I have gone ahead and rebased the refactor patch so it could cleanly
apply on heapam.c, see patch v7.

I am also attaching a patch (v8) that implements smgrimmedsyncrel() and
smgrimmedsyncseg() as I mentioned in the previous email. It avoids
callers to pass in InvalidSegmentNumber when they just want to sync the
whole relation. As a side effect, I was able to get rid of some extra
checkpointer.h includes.

Hi Shawn,

Thanks! And sorry for not replying sooner -- I got distracted by
FOSDEM (and the associated 20 thousand miles of travel). On that trip
I had a chance to discuss this patch with Andres Freund in person, and
he opined that it might be better for the fsync request queue to work
in terms of pathnames. Instead of the approach in this patch, where a
backend sends an fsync request for { reflfilenode, segno } inside
mdwrite(), and then the checkpointer processes the request by calling
smgrdimmedsyncrel(), he speculated that it'd be better to have
mdwrite() send an fsync request for a pathname, and then the
checkpointer would just open that file by name and fsync() it. That
is, the checkpointer wouldn't call back into smgr.

One of the advantages of that approach is that there are probably
other files that need to be fsync'd for each checkpoint that could
benefit from being offloaded to the checkpointer. Another is that you
break the strange cycle mentioned above.

Here's a possible problem with it. The fsync request messages would
have to be either large (MAXPGPATH) or variable sized and potentially
large. I am a bit worried that such messages would be problematic for
the atomicity requirement of the (future) fd-passing patch that passes
it via a Unix domain socket (which is a bit tricky because
SOCK_SEQPACKET and SOCK_DGRAM aren't portable enough, so we probably
have to use SOCK_STREAM, but there is no formal guarantee like
PIPE_BUF; we know that in practice small messages will be atomic, but
certainty decreases with larger messages. This needs more study...).
You need to include the path even in a message containing an fd,
because the checkpointer will use that as a hashtable key to merge
received requests. Perhaps you'd solve that by using a small tag that
can be converted back to a path (as you noticed my last patch had some
leftover dead code from an experiment along those lines), but then I
think you'd finish up needing an smgr interface to convert it back to
the path (implementation different for md.c, undofile.c, slru.c). So
you don't exactly break the cycle mentioned earlier. Hmm, or perhaps
you could avoid even thinking about atomicity by passing 1 byte
fd-bearing messages via the pipe, and pathnames via shared memory, in
the same order.

Another consideration if we do that is that the existing scheme has a
kind of hierarchy that allows fsync requests to be cancelled in bulk
when you drop relations and databases. That is, the checkpointer
knows about the internal hierarchy of tablespace, db, rel, seg. If we
get rid of that and have just paths, it seems like a bad idea to teach
the checkpointer about the internal structure of the paths (even
though we know they contain the same elements encoded somehow). You'd
have to send an explicit cancel for every key; that is, if you're
dropping a relation, you need to generate a cancel message for every
segment, and if you're dropping a database, you need to generate a
cancel message for every segment of every relation. Once again, if
you used some kind of tag that is passed back to smgr, you could
probably come up with a way to do it.

Thoughts?

--
Thomas Munro
http://www.enterprisedb.com

#33

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Thomas Munro (#32)

Re: Refactoring the checkpointer's fsync request queue

Hi,

On 2019-02-13 18:40:05 +1300, Thomas Munro wrote:

Thanks! And sorry for not replying sooner -- I got distracted by
FOSDEM (and the associated 20 thousand miles of travel). On that trip
I had a chance to discuss this patch with Andres Freund in person, and
he opined that it might be better for the fsync request queue to work
in terms of pathnames. Instead of the approach in this patch, where a
backend sends an fsync request for { reflfilenode, segno } inside
mdwrite(), and then the checkpointer processes the request by calling
smgrdimmedsyncrel(), he speculated that it'd be better to have
mdwrite() send an fsync request for a pathname, and then the
checkpointer would just open that file by name and fsync() it. That
is, the checkpointer wouldn't call back into smgr.

One of the advantages of that approach is that there are probably
other files that need to be fsync'd for each checkpoint that could
benefit from being offloaded to the checkpointer. Another is that you
break the strange cycle mentioned above.

The other issue is that I think your approach moves the segmentation
logic basically out of md into smgr. I think that's wrong. We shouldn't
presume that every type of storage is going to have segmentation that's
representable in a uniform way imo.

Another consideration if we do that is that the existing scheme has a
kind of hierarchy that allows fsync requests to be cancelled in bulk
when you drop relations and databases. That is, the checkpointer
knows about the internal hierarchy of tablespace, db, rel, seg. If we
get rid of that and have just paths, it seems like a bad idea to teach
the checkpointer about the internal structure of the paths (even
though we know they contain the same elements encoded somehow). You'd
have to send an explicit cancel for every key; that is, if you're
dropping a relation, you need to generate a cancel message for every
segment, and if you're dropping a database, you need to generate a
cancel message for every segment of every relation.

I can't see that being a problem - compared to the overhead of dropping
a relation, that doesn't seem to be a meaningfully large cost?

Greetings,

Andres Freund

#34

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Andres Freund (#33)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Feb 15, 2019 at 06:45:02PM -0800, Andres Freund wrote:

One of the advantages of that approach is that there are probably
other files that need to be fsync'd for each checkpoint that could
benefit from being offloaded to the checkpointer. Another is that you
break the strange cycle mentioned above.

The other issue is that I think your approach moves the segmentation
logic basically out of md into smgr. I think that's wrong. We shouldn't
presume that every type of storage is going to have segmentation that's
representable in a uniform way imo.

I had a discussion with Thomas on this and am working on a new version
of the patch that incorporates what you guys discussed at FOSDEM, but
avoiding passing pathnames to checkpointer.

The mdsync machinery will be moved out of md.c and pending ops table
will incorporate the segment number as part of the key. Still deciding
on how to cleanly re-factor _mdfd_getseg which mdsync utilizes during
the file sync operations. The ultimate goal is to get checkpointer the
file descriptor it can use to issue the fsync using FileSync. So perhaps
a function in smgr that returns just that based on the RelFileNode, fork
and segno combination. Dealing only with file descriptors will allow us
to implement passing FDs to checkpointer directly as part of the request
in the future.

The goal is to encapsulate relation specific knowledge within md.c while
allowing undo and generic block store (ex-SLRU) to do their own mapping
within the smgr layer later. Yes, checkpointer will "call back" into
smgr, but these would be to retrieve information that should be managed
by smgr. Allowing checkpointer to focus on its job of tracking requests
and syncing files via the fd interfaces.

Another consideration if we do that is that the existing scheme has a
kind of hierarchy that allows fsync requests to be cancelled in bulk
when you drop relations and databases. That is, the checkpointer
knows about the internal hierarchy of tablespace, db, rel, seg. If we
get rid of that and have just paths, it seems like a bad idea to teach
the checkpointer about the internal structure of the paths (even
though we know they contain the same elements encoded somehow). You'd
have to send an explicit cancel for every key; that is, if you're
dropping a relation, you need to generate a cancel message for every
segment, and if you're dropping a database, you need to generate a
cancel message for every segment of every relation.

I can't see that being a problem - compared to the overhead of dropping
a relation, that doesn't seem to be a meaningfully large cost?

With the scheme above - dropping hierarchies will require scanning the
hash table for matching dboid or reloid and removing those entries. We
do this today for FORGET_DATABASE_FSYNC in RememberFsyncRequest. The
matching function will belong in smgr. We can see how scanning the whole
hash table impacts performance and iterate on it from there if needed.
--
Shawn Debnath
Amazon Web Services (AWS)

#35

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Shawn Debnath (#34)

3 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

As promised, here's a patch that addresses the points discussed by
Andres and Thomas at FOSDEM. As a result of how we want checkpointer to
track what files to fsync, the pending ops table now integrates the
forknum and segno as part of the hash key eliminating the need for the
bitmapsets, or vectors from the previous iterations. We re-construct the
pathnames from the RelFileNode, ForkNumber and SegmentNumber and use
PathNameOpenFile to get the file descriptor to use for fsync.

Apart from that, this patch moves the system for requesting and
processing fsyncs out of md.c into smgr.c, allowing us to call on smgr
component specific callbacks to retrieve metadata like relation and
segment paths. This allows smgr components to maintain how relfilenodes,
forks and segments map to specific files without exposing this knowledge
to smgr. It redefines smgrsync() behavior to be closer to that of
smgrimmedsysnc(), i.e., if a regular sync is required for a particular
file, enqueue it in locally or forward it to checkpointer.
smgrimmedsync() retains the existing behavior and fsyncs the file right
away. The processing of fsync requests has been moved from mdsync() to a
new ProcessFsyncRequests() function.

Testing
-------

Checkpointer stats didn't cover what I wanted to verify, i.e., time
spent dealing with the pending operations table. So I added temporary
instrumentation to get the numbers by timing the code in
ProcessFsyncRequests which starts by absorbing fsync requests from
checkpointer queue, processing them and finally issuing sync on the
files. Similarly, I added the same instrumentation in the mdsync code in
master branch. The time to actually execute FileSync is irrelevant for
this patch.

I did two separate runs for 30 mins, both with scale=10,000 on
i3.8xlarge instances [1]https://aws.amazon.com/ec2/instance-types/i3/ with default params to force frequent
checkpoints:

1. Single pgbench run having 1000 clients update 4 tables, as a result
we get 4 relations and its forks and several segments in each being
synced.

2. 10 parallel pgbench runs on 10 separate databases having 200 clients
each. This results in more relations and more segments being touched
letting us better compare against the bitmapset optimizations.

Results
--------

The important metric to look at would be the total time spent absorbing
and processing the fsync requests as that's what the changes revolve
around. The other metrics are here for posterity. The new code is about
6% faster in total time taken to process the queue for the single
pgbench run. For the 10x parallel pgbench run, we are seeing drops up to
70% with the patch.

Would be great if some other folks can verify this. The temporary
instrumentation patches for the master branch and one that applies after
the main patch are attached. Enable log_checkpoints and then use grep
and cut to extract the numbers from the log file after the runs.

[Requests Absorbed]

single pgbench run
Min Max Average Median Mode Std Dev
-------- ------- -------- ---------- -------- ------- ----------
patch 15144 144961 78628.84 76124 58619 24135.69
master 25728 138422 81455.04 80601 25728 21295.83

10 parallel pgbench runs
Min Max Average Median Mode Std Dev
-------- -------- -------- ----------- -------- -------- ----------
patch 45098 282158 155969.4 151603 153049 39990.91
master 191833 602512 416533.86 424946 191833 82014.48

[Files Synced]

single pgbench run
Min Max Average Median Mode Std Dev
-------- ----- ----- --------- -------- ------ ---------
patch 153 166 158.11 158 159 1.86
master 154 166 158.29 159 159 10.29

10 parallel pgbench runs
Min Max Average Median Mode Std Dev
-------- ------ ------ --------- -------- ------ ---------
patch 1540 1662 1556.42 1554 1552 11.12
master 1546 1546 1546 1559 1553 12.79

[Total Time in ProcessFsyncRequest/mdsync]

single pgbench run
Min Max Average Median Mode Std Dev
-------- ----- --------- --------- -------- ------ ---------
patch 500 3833.51 2305.22 2239 500 510.08
master 806 4430.32 2458.77 2382 806 497.01

10 parallel pgbench runs
Min Max Average Median Mode Std Dev
-------- ------ ------- ---------- -------- ------ ---------
patch 908 6927 3022.58 2863 908 939.09
master 4323 17858 10982.15 11154 4322 2760.47

[1]: https://aws.amazon.com/ec2/instance-types/i3/

--
Shawn Debnath
Amazon Web Services (AWS)

Attachments:

0001-Refactor-the-fsync-machinery-to-support-future-SMGR-v9.patchtext/plain; charset=us-asciiDownload

From 88cb3577b380275019642458ae40ae1c93c78755 Mon Sep 17 00:00:00 2001
From: Shawn Debnath <sdn@amazon.com>
Date: Sun, 17 Feb 2019 22:08:46 +0000
Subject: [PATCH] Refactor the fsync mechanism to support future SMGR
 implementations.

In anticipation of proposed block storage managers alongside md.c that
map bufmgr.c blocks to files optimised for different usage patterns:

1. Move the system for requesting and processing fsyncs out of md.c
into smgr.c.

2. Redefine smgrsync() behavior to be closer to that of
smgrimmedsysnc(), i.e., if a regular sync is required for a particular
file, enqueue it in locally or forward it to checkpointer. The
processing of fsync requests has been moved to a new ProcessFsyncRequests
function. smgrimmedsync() retains the old behavior of forcing an
immediate sync.

3. Removed the need for specific storage managers to implement pre and
post checkpoint callbacks. These are now executed at the smgr layer.

4. We now embed the fork number and the segment number as part of the
hash key for the pending ops table. This eliminates the bitmapset based
segment tracking for each relfilenode during fsync as not all storage
managers may map their segments from zero.

5. As part of processing the requests, we now re-construct the
path to the segment based on relfilenode, fork and segment numbers, and
use PathNameOpenFile to get a file descriptor to use for FileSync.

Author: Shawn Debnath, Thomas Munro
Reviewed-by:
Discussion:
https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 contrib/bloom/blinsert.c              |   2 +-
 src/backend/access/heap/heapam.c      |   4 +-
 src/backend/access/nbtree/nbtree.c    |   2 +-
 src/backend/access/nbtree/nbtsort.c   |   2 +-
 src/backend/access/spgist/spginsert.c |   2 +-
 src/backend/access/transam/xlog.c     |   4 +-
 src/backend/catalog/heap.c            |   2 +-
 src/backend/commands/tablecmds.c      |   2 +-
 src/backend/postmaster/checkpointer.c |   2 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +-
 src/backend/storage/smgr/md.c         | 915 +++-------------------------------
 src/backend/storage/smgr/smgr.c       | 702 ++++++++++++++++++++++++--
 src/include/postmaster/bgwriter.h     |   3 +-
 src/include/storage/smgr.h            |  68 ++-
 14 files changed, 818 insertions(+), 894 deletions(-)

diff --git a/contrib/bloom/blinsert.c b/contrib/bloom/blinsert.c
index e43fbe0005..6fa07db4f8 100644
--- a/contrib/bloom/blinsert.c
+++ b/contrib/bloom/blinsert.c
@@ -188,7 +188,7 @@ blbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index dc3499349b..4cf2661387 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -8980,7 +8980,7 @@ heap_sync(Relation rel)
 	/* main heap */
 	FlushRelationBuffers(rel);
 	/* FlushRelationBuffers will have opened rd_smgr */
-	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM);
+	smgrimmedsync(rel->rd_smgr, MAIN_FORKNUM, InvalidSegmentNumber);
 
 	/* FSM is not critical, don't bother syncing it */
 
@@ -8991,7 +8991,7 @@ heap_sync(Relation rel)
 
 		toastrel = table_open(rel->rd_rel->reltoastrelid, AccessShareLock);
 		FlushRelationBuffers(toastrel);
-		smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsync(toastrel->rd_smgr, MAIN_FORKNUM, InvalidSegmentNumber);
 		table_close(toastrel, AccessShareLock);
 	}
 }
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 98917de2ef..b29112c133 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -178,7 +178,7 @@ btbuildempty(Relation index)
 	 * write did not go through shared_buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtsort.c b/src/backend/access/nbtree/nbtsort.c
index dc398e1186..052215bb34 100644
--- a/src/backend/access/nbtree/nbtsort.c
+++ b/src/backend/access/nbtree/nbtsort.c
@@ -1208,7 +1208,7 @@ _bt_load(BTWriteState *wstate, BTSpool *btspool, BTSpool *btspool2)
 	if (RelationNeedsWAL(wstate->index))
 	{
 		RelationOpenSmgr(wstate->index);
-		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM);
+		smgrimmedsync(wstate->index->rd_smgr, MAIN_FORKNUM, InvalidSegmentNumber);
 	}
 }
 
diff --git a/src/backend/access/spgist/spginsert.c b/src/backend/access/spgist/spginsert.c
index f428a15138..0eb5ced43d 100644
--- a/src/backend/access/spgist/spginsert.c
+++ b/src/backend/access/spgist/spginsert.c
@@ -205,7 +205,7 @@ spgbuildempty(Relation index)
 	 * writes did not go through shared buffers and therefore a concurrent
 	 * checkpoint may have moved the redo pointer past our xlog record.
 	 */
-	smgrimmedsync(index->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(index->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53a..69eaa2f7e2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8616,7 +8616,7 @@ CreateCheckPoint(int flags)
 	 * the REDO pointer.  Note that smgr must not do anything that'd have to
 	 * be undone if we decide no checkpoint is needed.
 	 */
-	smgrpreckpt();
+	smgrprecheckpoint();
 
 	/* Begin filling in the checkpoint WAL record */
 	MemSet(&checkPoint, 0, sizeof(checkPoint));
@@ -8912,7 +8912,7 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
-	smgrpostckpt();
+	smgrpostcheckpoint();
 
 	/*
 	 * Update the average distance between checkpoints if the prior checkpoint
diff --git a/src/backend/catalog/heap.c b/src/backend/catalog/heap.c
index d0215a5eed..8ecf7c09a2 100644
--- a/src/backend/catalog/heap.c
+++ b/src/backend/catalog/heap.c
@@ -1421,7 +1421,7 @@ heap_create_init_fork(Relation rel)
 	RelationOpenSmgr(rel);
 	smgrcreate(rel->rd_smgr, INIT_FORKNUM, false);
 	log_smgrcreate(&rel->rd_smgr->smgr_rnode.node, INIT_FORKNUM);
-	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM);
+	smgrimmedsync(rel->rd_smgr, INIT_FORKNUM, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/commands/tablecmds.c b/src/backend/commands/tablecmds.c
index 715c6a221c..125b16c339 100644
--- a/src/backend/commands/tablecmds.c
+++ b/src/backend/commands/tablecmds.c
@@ -11788,7 +11788,7 @@ copy_relation_data(SMgrRelation src, SMgrRelation dst,
 	 * here, they might still not be on disk when the crash occurs.
 	 */
 	if (relpersistence == RELPERSISTENCE_PERMANENT || copying_initfork)
-		smgrimmedsync(dst, forkNum);
+		smgrimmedsync(dst, forkNum, InvalidSegmentNumber);
 }
 
 /*
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359..867f427028 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1092,7 +1092,7 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, SegmentNumber segno)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..c493c591aa 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2584,7 +2584,7 @@ CheckPointBuffers(int flags)
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	ProcessFsyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2aba2dfe91..46b54f92a3 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -29,8 +29,6 @@
 #include "access/xlogutils.h"
 #include "access/xlog.h"
 #include "pgstat.h"
-#include "portability/instr_time.h"
-#include "postmaster/bgwriter.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
 #include "storage/relfilenode.h"
@@ -39,35 +37,6 @@
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
-
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
-
-/*
- * On Windows, we have to interpret EACCES as possibly meaning the same as
- * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
- * that's what you get.  Ugh.  This code is designed so that we don't
- * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
- */
-#ifndef WIN32
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
-#else
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
-#endif
-
 /*
  *	The magnetic disk storage manager keeps track of open file
  *	descriptors in its own descriptor pool.  This is done to make it
@@ -114,53 +83,27 @@ typedef struct _MdfdVec
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 
+/* local routines */
+static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
+			 bool isRedo);
+static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+					   MdfdVec *seg);
+static void register_unlink(RelFileNodeBackend rnode);
+static void _fdvec_resize(SMgrRelation reln,
+			  ForkNumber forknum,
+			  int nseg);
+static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
+			  BlockNumber segno, int oflags);
+static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
+			 BlockNumber blkno, bool skipFsync, int behavior);
+static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+		   MdfdVec *seg);
+
 
 /*
- * In some contexts (currently, standalone backends and the checkpointer)
- * we keep track of pending fsync operations: we need to remember all relation
- * segments that have been written since the last checkpoint, so that we can
- * fsync them down to disk before completing the next checkpoint.  This hash
- * table remembers the pending operations.  We use a hash table mostly as
- * a convenient way of merging duplicate requests.
- *
- * We use a similar mechanism to remember no-longer-needed files that can
- * be deleted after the next checkpoint, but we use a linked list instead of
- * a hash table, because we don't expect there to be any duplicate requests.
- *
- * These mechanisms are only used for non-temp relations; we never fsync
- * temp rels, nor do we need to postpone their deletion (see comments in
- * mdunlink).
- *
- * (Regular backends do not track pending operations locally, but forward
- * them to the checkpointer.)
+ * Segment handling behaviors
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
-
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
-
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
-
-
-/*** behavior for mdopen & _mdfd_getseg ***/
 /* ereport if segment not present */
 #define EXTENSION_FAIL				(1 << 0)
 /* return NULL if segment not present */
@@ -179,26 +122,6 @@ static CycleCtr mdckpt_cycle_ctr = 0;
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
 
 
-/* local routines */
-static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
-			 bool isRedo);
-static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
-static void _fdvec_resize(SMgrRelation reln,
-			  ForkNumber forknum,
-			  int nseg);
-static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
-			  BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
-			  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
-			 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
-		   MdfdVec *seg);
-
-
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
  */
@@ -208,64 +131,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -382,10 +247,10 @@ mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
 	/*
 	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
-	 * requests for a temp relation, though.  We can send just one request
-	 * even when deleting multiple forks, since the fsync queuing code accepts
-	 * the "InvalidForkNumber = all forks" convention.
+	 * relation, else the next ProcessFsyncRequests will fail.  There can't be
+	 * any such requests for a temp relation, though.  We can send just one
+	 * request even when deleting multiple forks, since the fsync queuing
+	 * code accepts the "InvalidForkNumber = all forks" convention.
 	 */
 	if (!RelFileNodeBackendIsTemp(rnode))
 		ForgetRelationFsyncRequests(rnode.node, forkNum);
@@ -978,421 +843,91 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  * nothing of dirty buffers that may exist inside the buffer manager.
  */
 void
-mdimmedsync(SMgrRelation reln, ForkNumber forknum)
+mdimmedsync(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
-	int			segno;
-
-	/*
-	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
-	 * fsync loop will get them all!
-	 */
-	mdnblocks(reln, forknum);
+	MdfdVec	   *segments = NULL;
+	size_t		nsegs = 0;
 
-	segno = reln->md_num_open_segs[forknum];
+	if (segno != InvalidSegmentNumber)
+	{
+		/* Get the specified segment, or report failure if it doesn't exist */
+		segments = _mdfd_openseg(reln, forknum, segno * RELSEG_SIZE,
+								 EXTENSION_RETURN_NULL);
+		if (!segments)
+			ereport(data_sync_elevel(ERROR),
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\" to fsync: %m",
+						mdsegpath(reln->smgr_rnode, forknum, segno))));
+		nsegs = 1;
+	}
+	else
+	{
+		/*
+		 * NOTE: mdnblocks makes sure we have opened all active segments, so that
+		 * fsync loop will get them all!
+		 */
+		mdnblocks(reln, forknum);
+		nsegs = reln->md_num_open_segs[forknum];
+		segments = &reln->md_seg_fds[forknum][0];
+	}
 
-	while (segno > 0)
+	for (segno = 0; segno < nsegs; ++segno)
 	{
-		MdfdVec    *v = &reln->md_seg_fds[forknum][segno - 1];
+		MdfdVec    *v = &segments[segno];
 
 		if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_IMMEDIATE_SYNC) < 0)
 			ereport(data_sync_elevel(ERROR),
 					(errcode_for_file_access(),
 					 errmsg("could not fsync file \"%s\": %m",
 							FilePathName(v->mdfd_vfd))));
-		segno--;
 	}
 }
 
 /*
- *	mdsync() -- Sync previous writes to stable storage.
+ *	mdrelpath() -- Return the full path to the relation.
  */
-void
-mdsync(void)
+char *
+mdrelpath(RelFileNodeBackend rnode, ForkNumber forknum)
 {
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
-	}
-
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+	char	   *path;
 
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
+	path = relpath(rnode, forknum);
+	return path;
 }
 
 /*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
+ *	mdsegpath()
  *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
+ * Return the filename for the specified segment of the relation. The
+ * returned string is palloc'd.
  */
-void
-mdpreckpt(void)
+char *
+mdsegpath(RelFileNodeBackend rnode, ForkNumber forknum, BlockNumber segno)
 {
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
+	char	   *path,
+			   *fullpath;
 
-/*
- * mdpostckpt() -- Do post-checkpoint work
- *
- * Remove any lingering files that can now be safely removed.
- */
-void
-mdpostckpt(void)
-{
-	int			absorb_counter;
+	path = relpath(rnode, forknum);
 
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
+	if (segno > 0)
 	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
+		fullpath = psprintf("%s.%u", path, segno);
 		pfree(path);
-
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
 	}
+	else
+		fullpath = path;
+
+	return fullpath;
 }
 
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
- * If there is a local pending-ops table, just make an entry in it for
- * mdsync to process later.  Otherwise, try to pass off the fsync request
- * to the checkpointer process.  If that fails, just do the fsync
- * locally before returning (we hope this will not happen often enough
- * to be a performance problem).
+ * Call smgrsync() to queue the fsync request.  If there is a local pending-ops
+ * table, just make an entry in it to be processed later. Otherwise, try to
+ * forward the fsync request to the checkpointer process. If that fails, just
+ * do the fsync locally before returning (we hope this will not happen often
+ * enough to be a performance problem).
  */
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
@@ -1400,16 +935,9 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
+	/* Try to push it into local pending-ops table, or forward to checkpointer */
+	if (!smgrsync(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
 	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1436,13 +964,8 @@ register_unlink(RelFileNodeBackend rnode)
 	/* Should never be used with temp relations */
 	Assert(!RelFileNodeBackendIsTemp(rnode));
 
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
+	/* Try to push it into local pending-ops table, or forward to checkpointer */
+	if (!smgrsync(rnode.node, MAIN_FORKNUM, UNLINK_RELATION_REQUEST))
 	{
 		/*
 		 * Notify the checkpointer about it.  If we fail to queue the request
@@ -1458,259 +981,6 @@ register_unlink(RelFileNodeBackend rnode)
 	}
 }
 
-/*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
- */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
-{
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
-}
-
-/*
- * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
- */
-void
-ForgetDatabaseFsyncRequests(Oid dbid)
-{
-	RelFileNode rnode;
-
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
-
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
-/*
- * DropRelationFiles -- drop files of all given relations
- */
-void
-DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo)
-{
-	SMgrRelation *srels;
-	int			i;
-
-	srels = palloc(sizeof(SMgrRelation) * ndelrels);
-	for (i = 0; i < ndelrels; i++)
-	{
-		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
-
-		if (isRedo)
-		{
-			ForkNumber	fork;
-
-			for (fork = 0; fork <= MAX_FORKNUM; fork++)
-				XLogDropRelation(delrels[i], fork);
-		}
-		srels[i] = srel;
-	}
-
-	smgrdounlinkall(srels, ndelrels, isRedo);
-
-	/*
-	 * Call smgrclose() in reverse order as when smgropen() is called.
-	 * This trick enables remove_from_unowned_list() in smgrclose()
-	 * to search the SMgrRelation from the unowned list,
-	 * with O(1) performance.
-	 */
-	for (i = ndelrels - 1; i >= 0; i--)
-		smgrclose(srels[i]);
-	pfree(srels);
-}
-
-
 /*
  *	_fdvec_resize() -- Resize the fork's open segments array
  */
@@ -1748,29 +1018,6 @@ _fdvec_resize(SMgrRelation reln,
 	reln->md_num_open_segs[forknum] = nseg;
 }
 
-/*
- * Return the filename for the specified segment of the relation. The
- * returned string is palloc'd.
- */
-static char *
-_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
-{
-	char	   *path,
-			   *fullpath;
-
-	path = relpath(reln->smgr_rnode, forknum);
-
-	if (segno > 0)
-	{
-		fullpath = psprintf("%s.%u", path, segno);
-		pfree(path);
-	}
-	else
-		fullpath = path;
-
-	return fullpath;
-}
-
 /*
  * Open the specified segment of the relation,
  * and make a MdfdVec object for it.  Returns NULL on failure.
@@ -1783,7 +1030,7 @@ _mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
 	int			fd;
 	char	   *fullpath;
 
-	fullpath = _mdfd_segpath(reln, forknum, segno);
+	fullpath = mdsegpath(reln->smgr_rnode, forknum, segno);
 
 	/* open the file */
 	fd = PathNameOpenFile(fullpath, O_RDWR | PG_BINARY | oflags);
@@ -1918,7 +1165,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not open file \"%s\" (target block %u): previous segment is only %u blocks",
-							_mdfd_segpath(reln, forknum, nextsegno),
+							mdsegpath(reln->smgr_rnode, forknum, nextsegno),
 							blkno, nblocks)));
 		}
 
@@ -1932,7 +1179,7 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not open file \"%s\" (target block %u): %m",
-							_mdfd_segpath(reln, forknum, nextsegno),
+							mdsegpath(reln->smgr_rnode, forknum, nextsegno),
 							blkno)));
 		}
 	}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0c0bba4ab3..e387c01776 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -17,13 +17,77 @@
  */
 #include "postgres.h"
 
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/file.h>
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "access/xlogutils.h"
+#include "access/xlog.h"
 #include "commands/tablespace.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
+#include "utils/memutils.h"
 #include "utils/inval.h"
 
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  This hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+typedef uint16 CycleCtr;		/* can be any convenient integer size */
+
+typedef struct FsyncTag
+{
+	RelFileNode		rnode;			/* rel file node entry */
+	ForkNumber		forknum;
+	SegmentNumber	segno;			/* segment number */
+} FsyncTag;
+
+typedef struct
+{
+	FsyncTag	tag;			/* hash table key (must be first!) */
+	CycleCtr	cycle_ctr;		/* sync_cycle_ctr of oldest request */
+	bool		canceled;		/* canceled is true if we canceled "recently" */
+} PendingFsyncEntry;
+
+typedef struct
+{
+	RelFileNode rnode;			/* the dead relation to delete */
+	CycleCtr	cycle_ctr;		/* checkpoint_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static HTAB *pendingOps = NULL;
+static List *pendingUnlinks = NIL;
+static MemoryContext pendingOpsCxt; /* context for the above  */
+
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr checkpoint_cycle_ctr = 0;
+
+/* Intervals for calling AbsorbFsyncRequests */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
 
 /*
  * This struct of function pointers defines the API between smgr.c and
@@ -58,13 +122,13 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
-	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
+	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum,
+								   SegmentNumber segno);
+	char*		(*smgr_relpath) (RelFileNodeBackend rnode, ForkNumber forknum);
+	char*		(*smgr_segpath) (RelFileNodeBackend rnode, ForkNumber forknum,
+								   SegmentNumber segno);
 } f_smgr;
 
-
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{
@@ -82,15 +146,13 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
+		.smgr_relpath = mdrelpath,
+		.smgr_segpath = mdsegpath
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
-
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
@@ -99,7 +161,7 @@ static HTAB *SMgrRelationHash = NULL;
 
 static SMgrRelation first_unowned_reln = NULL;
 
-/* local function prototypes */
+/* Local function prototypes */
 static void smgrshutdown(int code, Datum arg);
 static void add_to_unowned_list(SMgrRelation reln);
 static void remove_from_unowned_list(SMgrRelation reln);
@@ -126,6 +188,40 @@ smgrinit(void)
 
 	/* register the shutdown proc */
 	on_proc_exit(smgrshutdown, 0);
+
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(FsyncTag);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOps = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
 }
 
 /*
@@ -725,8 +821,8 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 /*
  *	smgrimmedsync() -- Force the specified relation to stable storage.
  *
- *		Synchronously force all previous writes to the specified relation
- *		down to disk.
+ *		Synchronously force all previous writes to the specified relation,
+ *		or specific relation segment, down to disk.
  *
  *		This is useful for building completely new relations (eg, new
  *		indexes).  Instead of incrementally WAL-logging the index build
@@ -746,55 +842,599 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
  *		otherwise the sync is not very meaningful.
  */
 void
-smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
+smgrimmedsync(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
+{
+	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum, segno);
+}
+
+/*
+ * smgrsync() -- Enqueue fsync request
+ *
+ * Called by internal smgr components to queue a request either locally or
+ * to forward the request to checkpointer.
+ *
+ * Note: this api requires RelFileNode instead of SMgrRelation as callers
+ * include unlink which may not have an open SMgrRelation.
+ *
+ * Returns false if we failed to do either, which means the backend is required
+ * to do their own sync.
+ */
+bool
+smgrsync(RelFileNode rnode, ForkNumber forknum, SegmentNumber segno)
 {
-	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
+	if (pendingOps)
+	{
+		/* Push it into local pending-ops table */
+		RememberFsyncRequest(rnode, forknum, segno);
+		return true;
+	}
+	else
+	{
+		/* Forward request to checkpointer, which can fail if queue is full */
+		return ForwardFsyncRequest(rnode, forknum, segno);
+	}
 }
 
 
 /*
- *	smgrpreckpt() -- Prepare for checkpoint.
+ * Fsync related operations
+ */
+
+
+/*
+ * smgrprecheckpoint() -- Do pre-checkpoint work
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+smgrprecheckpoint(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	checkpoint_cycle_ctr++;
+}
+
+/*
+ * smgrpostcheckpoint() -- Do post-checkpoint work
+ *
+ * Remove any lingering files that can now be safely removed.
  */
 void
-smgrpreckpt(void)
+smgrpostcheckpoint(void)
 {
-	int			i;
+	int			absorb_counter;
 
-	for (i = 0; i < NSmgr; i++)
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
 	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		RelFileNodeBackend rnode = {.node = entry->rnode, .backend = InvalidBackendId};
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == checkpoint_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = smgrsw[0].smgr_relpath(rnode, MAIN_FORKNUM);
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in ProcessFsyncRequests, we don't want to stop absorbing fsync
+		 * requests for along time when there are many deletions to be done.  We
+		 * can safelycall AbsorbFsyncRequests() at this point in the loop
+		 * (note it might try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbFsyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
 	}
 }
 
 /*
- *	smgrsync() -- Sync files to disk during checkpoint.
+ *	ProcessFsyncRequests() -- Process queued fsync requests.
  */
 void
-smgrsync(void)
+ProcessFsyncRequests(void)
 {
-	int			i;
+	static bool sync_in_progress = false;
 
-	for (i = 0; i < NSmgr; i++)
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	int			processed = 0;
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	uint64		longest = 0;
+	uint64		total_elapsed = 0;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOps.
+	 */
+	if (!pendingOps)
+		elog(ERROR, "cannot sync without a pendingOps table");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbFsyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use sync_cycle_ctr to tell old entries apart from new
+	 * ones: new ones will have cycle_ctr equal to the incremented value of
+	 * sync_cycle_ctr.
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous ProcessFsyncRequests() failed to complete, run through the
+	 * table and forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
 	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			entry->cycle_ctr = sync_cycle_ctr;
+		}
 	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOps);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		int			failures;
+
+		/*
+		 * If fsync is off then we don't have to bother opening the
+		 * file at all.  (We delay checking until this point so that
+		 * changing fsync on the fly behaves sensibly.)
+		 */
+		if (!enableFsync)
+			continue;
+
+		/*
+		 * If the entry is new then don't process it this time; it might
+		 * contain multiple fsync-request bits, but they are all new.  Note
+		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * If in checkpointer, we want to absorb pending requests
+		 * every so often to prevent overflow of the fsync request
+		 * queue.  It is unspecified whether newly-added entries will
+		 * be visited by hash_seq_search, but we don't care since we
+		 * don't need to process them anyway.
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbFsyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+
+		/*
+		 * The fsync table could contain requests to fsync segments
+		 * that have been deleted (unlinked) by the time we get to
+		 * them. Rather than just hoping an ENOENT (or EACCES on
+		 * Windows) error can be ignored, what we do on error is
+		 * absorb pending requests and then retry.  Since mdunlink()
+		 * queues a "cancel" message before actually unlinking, the
+		 * fsync request is guaranteed to be marked canceled after the
+		 * absorb if it really was this case. DROP DATABASE likewise
+		 * has to tell us to forget fsync requests before it starts
+		 * deletions.
+		 */
+		for (failures = 0;; failures++) /* loop exits at "break" */
+		{
+			char	   *path;
+			File		fd;
+			RelFileNodeBackend rnode = {.node = entry->tag.rnode, .backend = InvalidBackendId};
+
+			path = smgrsw[0].smgr_segpath(rnode, MAIN_FORKNUM, entry->tag.segno);
+			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+
+			INSTR_TIME_SET_CURRENT(sync_start);
+			if (fd >= 0 &&
+				   FileSync(fd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
+			{
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				if (log_checkpoints)
+					elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
+						 processed,
+						 path,
+						 (double) elapsed / 1000);
+
+				pfree(path);
+				break;	/* out of retry loop */
+			}
+
+			/*
+			 * It is possible that the relation has been dropped or
+			 * truncated since the fsync request was entered.
+			 * Therefore, allow ENOENT, but only if we didn't fail
+			 * already on this file.  This applies both for
+			 * smgrgetseg() and for FileSync, since fd.c might have
+			 * closed the file behind our back.
+			 *
+			 * XXX is there any point in allowing more than one retry?
+			 * Don't see one at the moment, but easy to change the
+			 * test here if so.
+			 */
+			if (!FILE_POSSIBLY_DELETED(errno) || failures > 0)
+				ereport(data_sync_elevel(ERROR),
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								path)));
+			else
+				ereport(DEBUG1,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\" but retrying: %m",
+								path)));
+
+			pfree(path);
+
+			/*
+			 * Absorb incoming requests and check to see if a cancel
+			 * arrived for this relation fork.
+			 */
+			AbsorbFsyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
+
+			if (entry->canceled)
+				break;
+		}				/* end retry loop */
+
+		/* We are done with this entry, remove it */
+		if (hash_search(pendingOps, &entry->tag, HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "pendingOps corrupted");
+	}	/* end loop over hashtable entries */
+
+	/* Return sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of ProcessFsyncRequests */
+	sync_in_progress = false;
 }
 
 /*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
+ * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * The range of possible segment numbers is way less than the range of
+ * SegmentNumber, so we can reserve high values of segno for special purposes.
+ * We define three:
+ * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
+ *	 either for one fork, or all forks if forknum is InvalidForkNumber
+ * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
+ * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
+ *	 checkpoint.
+ * Note also that we're assuming real segment numbers don't exceed INT_MAX.
+ *
+ * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
+ * table has to be searched linearly, but dropping a database is a pretty
+ * heavyweight operation anyhow, so we'll live with it.)
  */
 void
-smgrpostckpt(void)
+RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, SegmentNumber segno)
 {
+	Assert(pendingOps);
+
+	if (segno == FORGET_RELATION_FSYNC)
+	{
+		/* Remove any pending requests for the relation (one or all forks) */
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (entry->tag.rnode.dbNode == rnode.dbNode &&
+				   entry->tag.rnode.relNode == rnode.relNode)
+			{
+				/* Check if we should remove all forks or a specific fork */
+				if (forknum == InvalidForkNumber ||
+					(forknum != InvalidForkNumber &&
+					   entry->tag.forknum == forknum))
+				{
+					entry->canceled = true;
+				}
+			}
+		}
+	}
+	else if (segno == FORGET_DATABASE_FSYNC)
+	{
+		/* Remove any pending requests for the entire database */
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+		ListCell   *cell,
+				   *prev,
+				   *next;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (entry->tag.rnode.dbNode == rnode.dbNode)
+			{
+				entry->canceled = true;
+			}
+		}
+
+		/* Remove unlink requests */
+		prev = NULL;
+		for (cell = list_head(pendingUnlinks); cell; cell = next)
+		{
+			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+			next = lnext(cell);
+			if (entry->rnode.dbNode == rnode.dbNode)
+			{
+				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
+				pfree(entry);
+			}
+			else
+				prev = cell;
+		}
+	}
+	else if (segno == UNLINK_RELATION_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
+		Assert(forknum == MAIN_FORKNUM);
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->rnode = rnode;
+		entry->cycle_ctr = checkpoint_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingFsyncEntry *entry;
+		bool		found;
+		FsyncTag	tag = {.rnode = rnode, .forknum = forknum, .segno = segno};
+
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+													  &tag,
+													  HASH_ENTER,
+													  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			entry->cycle_ctr = sync_cycle_ctr;
+			entry->canceled = false;
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+}
+
+/*
+ * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
+ *
+ * forknum == InvalidForkNumber means all forks, although this code doesn't
+ * actually know that, since it's just forwarding the request elsewhere.
+ */
+void
+ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+{
+	if (pendingOps)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/*
+		 * Notify the checkpointer about it.  If we fail to queue the cancel
+		 * message, we have to sleep and try again ... ugly, but hopefully
+		 * won't happen often.
+		 *
+		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
+		 * error would leave the no-longer-used file still present on disk,
+		 * which would be bad, so I'm inclined to assume that the checkpointer
+		 * will always empty the queue soon.
+		 */
+		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+
+		/*
+		 * Note we don't wait for the checkpointer to actually absorb the
+		 * cancel message; see ProcessFsyncRequests() for the implications.
+		 */
+	}
+}
+
+/*
+ * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
+ */
+void
+ForgetDatabaseFsyncRequests(Oid dbid)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = dbid;
+	rnode.spcNode = 0;
+	rnode.relNode = 0;
+
+	if (pendingOps)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
+	}
+	else if (IsUnderPostmaster)
+	{
+		/* see notes in ForgetRelationFsyncRequests */
+		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
+									FORGET_DATABASE_FSYNC))
+			pg_usleep(10000L);	/* 10 msec seems a good number */
+	}
+}
+
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingOps during initialization of the startup
+ * process.  Calling this function drops the local pendingOps so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+SetForwardFsyncRequests(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingOps)
+	{
+		ProcessFsyncRequests();
+		hash_destroy(pendingOps);
+	}
+	pendingOps = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
+
+/*
+ * DropRelationFiles -- drop files of all given relations
+ */
+void
+DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo)
+{
+	SMgrRelation *srels;
 	int			i;
 
-	for (i = 0; i < NSmgr; i++)
+	srels = palloc(sizeof(SMgrRelation) * ndelrels);
+	for (i = 0; i < ndelrels; i++)
 	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
+		SMgrRelation srel = smgropen(delrels[i], InvalidBackendId);
+
+		if (isRedo)
+		{
+			ForkNumber	fork;
+
+			for (fork = 0; fork <= MAX_FORKNUM; fork++)
+				XLogDropRelation(delrels[i], fork);
+		}
+		srels[i] = srel;
 	}
+
+	smgrdounlinkall(srels, ndelrels, isRedo);
+
+	/*
+	 * Call smgrclose() in reverse order as when smgropen() is called.
+	 * This trick enables remove_from_unowned_list() in smgrclose()
+	 * to search the SMgrRelation from the unowned list,
+	 * with O(1) performance.
+	 */
+	for (i = ndelrels - 1; i >= 0; i--)
+		smgrclose(srels[i]);
+	pfree(srels);
 }
 
 /*
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 53b8f5fe3c..5eae12455b 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -17,6 +17,7 @@
 
 #include "storage/block.h"
 #include "storage/relfilenode.h"
+#include "storage/smgr.h"
 
 
 /* GUC options */
@@ -32,7 +33,7 @@ extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
 extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
+					SegmentNumber segno);
 extern void AbsorbFsyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 820d08ed4e..9d38708c20 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,6 +18,39 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
+/*
+ * The type used to identify segment numbers.  Generally, segments are an
+ * internal detail of individual storage manager implementations, but since
+ * they appear in various places to allow them to be passed between processes,
+ * it seemed worthwhile to have a typename.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)
+
+/*
+ * Special values for the segno arg to RememberFsyncRequest.
+ *
+ * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
+ * fsync request from the queue if an identical, subsequent request is found.
+ * See comments there before making changes here.
+ */
+#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
+#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
+#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
+
+/*
+ * On Windows, we have to interpret EACCES as possibly meaning the same as
+ * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
+ * that's what you get.  Ugh.  This code is designed so that we don't
+ * actually believe these cases are okay without further evidence (namely,
+ * a pending fsync request getting canceled ... see mdsync).
+ */
+#ifndef WIN32
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
+#else
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
+#endif
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -105,12 +138,21 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
-extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
-extern void AtEOXact_SMgr(void);
+extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum,
+						   SegmentNumber segno);
+extern bool smgrsync(RelFileNode rnode, ForkNumber forknum,
+					   SegmentNumber segno);
+extern void smgrprecheckpoint(void);
+extern void smgrpostcheckpoint(void);
+extern void ProcessFsyncRequests(void);
+extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
+					 SegmentNumber segno);
+extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
+extern void ForgetDatabaseFsyncRequests(Oid dbid);
+extern void SetForwardFsyncRequests(void);
 
+extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
+extern void AtEOXact_SMgr(void);
 
 /* internals: move me elsewhere -- ay 7/94 */
 
@@ -133,16 +175,10 @@ extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
-extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
+extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum,
+						   SegmentNumber segno);
+extern char* mdrelpath(RelFileNodeBackend rnode, ForkNumber forknum);
+extern char* mdsegpath(RelFileNodeBackend rnode, ForkNumber forknum,
+		   SegmentNumber segno);
 
 #endif							/* SMGR_H */
-- 
2.16.5

mdsync-total-time-instrumentation.patchtext/plain; charset=us-asciiDownload

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2aba2dfe91..1f968f0bf7 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -1003,6 +1003,11 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
+
+/* TEMPORARY INSTRUMENTATION */
+static int remember_req_cnt = 0;
+
+
 /*
  *	mdsync() -- Sync previous writes to stable storage.
  */
@@ -1024,6 +1029,17 @@ mdsync(void)
 	uint64		longest = 0;
 	uint64		total_elapsed = 0;
 
+
+	/*
+	 * TEMPORARY INSTRUMENTATION
+	 */
+	instr_time	mdsync_start,
+				mdsync_end,
+				mdsync_diff;
+	remember_req_cnt = 0;
+	INSTR_TIME_SET_CURRENT(mdsync_start);
+
+
 	/*
 	 * This is only called during checkpoints, and checkpoints should only
 	 * occur in processes that have created a pendingOpsTable.
@@ -1295,6 +1311,18 @@ mdsync(void)
 
 	/* Flag successful completion of mdsync */
 	mdsync_in_progress = false;
+
+
+	/*
+	 * TEMPORARY INSTRUMENTATION
+	 */
+	INSTR_TIME_SET_CURRENT(mdsync_end);
+	mdsync_diff = mdsync_end;
+	INSTR_TIME_SUBTRACT(mdsync_diff, mdsync_start);
+	elapsed = INSTR_TIME_GET_MICROSEC(mdsync_diff);
+	if (log_checkpoints)
+		elog(LOG, "debug mdsync stats: remembered=%d fsynced=%d time=%.3f msec",
+			 remember_req_cnt, processed, (double) elapsed / 1000);
 }
 
 /*
@@ -1482,6 +1510,9 @@ register_unlink(RelFileNodeBackend rnode)
 void
 RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 {
+	/* TEMPORARY INSTRUMENTATION */
+	remember_req_cnt++;
+
 	Assert(pendingOpsTable);
 
 	if (segno == FORGET_RELATION_FSYNC)

ProcessFsyncRequests-total-time-instrumentation.patchtext/plain; charset=us-asciiDownload

diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index e387c01776..e8e15033f8 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -970,6 +970,11 @@ smgrpostcheckpoint(void)
 	}
 }
 
+
+/* TEMPORARY INSTRUMENTATION */
+static int remember_req_cnt = 0;
+
+
 /*
  *	ProcessFsyncRequests() -- Process queued fsync requests.
  */
@@ -991,6 +996,18 @@ ProcessFsyncRequests(void)
 	uint64		longest = 0;
 	uint64		total_elapsed = 0;
 
+
+	/*
+	 * TEMPORARY INSTRUMENTATION
+	 */
+	instr_time	mdsync_start,
+				mdsync_end,
+				mdsync_diff;
+	remember_req_cnt = 0;
+	INSTR_TIME_SET_CURRENT(mdsync_start);
+
+
+
 	/*
 	 * This is only called during checkpoints, and checkpoints should only
 	 * occur in processes that have created a pendingOps.
@@ -1181,6 +1198,19 @@ ProcessFsyncRequests(void)
 
 	/* Flag successful completion of ProcessFsyncRequests */
 	sync_in_progress = false;
+
+
+
+	/*
+	 * TEMPORARY INSTRUMENTATION
+	 */
+	INSTR_TIME_SET_CURRENT(mdsync_end);
+	mdsync_diff = mdsync_end;
+	INSTR_TIME_SUBTRACT(mdsync_diff, mdsync_start);
+	elapsed = INSTR_TIME_GET_MICROSEC(mdsync_diff);
+	if (log_checkpoints)
+		elog(LOG, "debug ProcessFsyncRequests stats: remembered=%d fsynced=%d time=%.3f msec",
+			 remember_req_cnt, processed, (double) elapsed / 1000);
 }
 
 /*
@@ -1207,6 +1237,9 @@ ProcessFsyncRequests(void)
 void
 RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, SegmentNumber segno)
 {
+	/* TEMPORARY INSTRUMENTATION */
+	remember_req_cnt++;
+
 	Assert(pendingOps);
 
 	if (segno == FORGET_RELATION_FSYNC)

#36

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Shawn Debnath (#35)

Re: Refactoring the checkpointer's fsync request queue

Hi,

Thanks for the update!

On 2019-02-20 15:27:40 -0800, Shawn Debnath wrote:

As promised, here's a patch that addresses the points discussed by
Andres and Thomas at FOSDEM. As a result of how we want checkpointer to
track what files to fsync, the pending ops table now integrates the
forknum and segno as part of the hash key eliminating the need for the
bitmapsets, or vectors from the previous iterations. We re-construct the
pathnames from the RelFileNode, ForkNumber and SegmentNumber and use
PathNameOpenFile to get the file descriptor to use for fsync.

I still object to exposing segment numbers to smgr and above. I think
that's an implementation detail of the various storage managers, and we
shouldn't expose smgr.c further than it already is.

Apart from that, this patch moves the system for requesting and
processing fsyncs out of md.c into smgr.c, allowing us to call on smgr
component specific callbacks to retrieve metadata like relation and
segment paths. This allows smgr components to maintain how relfilenodes,
forks and segments map to specific files without exposing this knowledge
to smgr. It redefines smgrsync() behavior to be closer to that of
smgrimmedsysnc(), i.e., if a regular sync is required for a particular
file, enqueue it in locally or forward it to checkpointer.
smgrimmedsync() retains the existing behavior and fsyncs the file right
away. The processing of fsync requests has been moved from mdsync() to a
new ProcessFsyncRequests() function.

I think that's also wrong, imo fsyncs are storage detail, and should be
relegated to *below* md.c. That's not to say the code should live in
md.c, but the issuing of such calls shouldn't be part of smgr.c -
consider e.g. developing a storage engine living in non volatile ram.

Greetings,

Andres Freund

#37

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Andres Freund (#36)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Feb 21, 2019 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:

Thanks for the update!

+1, thanks Shawn.

It's interesting that you measure performance improvements that IIUC
can come only from dropping the bitmapset stuff (or I guess bugs). I
don't understand the mechanism for that improvement yet.

The idea of just including the segment number (in some representation,
possibly opaque to this code) in the hash table key instead of
carrying a segment list seems pretty good from here (and I withdraw
the sorted vector machinery I mentioned earlier as it's redundant for
this project)... except for one detail. In your patch, you still have
FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC. That relies on this
sync manager code being able to understand which stuff to drop from
the hash table, which means that is still has knowledge of the key
hierarchy, so that it can match all entries for the relation or
database. That's one of the things that Andres is arguing against.

I can see how to fix that for the relation case: md.c could simply
enqueue a FORGET_REQUEST message for every segment and fork in the
relation, so they can be removed by O(1) hash table lookup. I can't
immediately see how to deal with the database case though. Perhaps in
a tag scheme, you'd have to make the tag mostly opaque, except for a
DB OID field, so you can handle this case? (Or some kind of predicate
callback, so you can say "does this tag cancel this other tag?"; seems
over the top).

On 2019-02-20 15:27:40 -0800, Shawn Debnath wrote:

As promised, here's a patch that addresses the points discussed by
Andres and Thomas at FOSDEM. As a result of how we want checkpointer to
track what files to fsync, the pending ops table now integrates the
forknum and segno as part of the hash key eliminating the need for the
bitmapsets, or vectors from the previous iterations. We re-construct the
pathnames from the RelFileNode, ForkNumber and SegmentNumber and use
PathNameOpenFile to get the file descriptor to use for fsync.

I still object to exposing segment numbers to smgr and above. I think
that's an implementation detail of the various storage managers, and we
shouldn't expose smgr.c further than it already is.

Ok by me. The solution to this is probably either raw paths (as
Andres has suggested, but which seem problematic to me -- see below),
or some kind of small fixed size tag type that is morally equivalent
to a path and can be converted to a path but is more convenient for
shoving through pipes and into hash tables.

Apart from that, this patch moves the system for requesting and
processing fsyncs out of md.c into smgr.c, allowing us to call on smgr
component specific callbacks to retrieve metadata like relation and
segment paths. This allows smgr components to maintain how relfilenodes,
forks and segments map to specific files without exposing this knowledge
to smgr. It redefines smgrsync() behavior to be closer to that of
smgrimmedsysnc(), i.e., if a regular sync is required for a particular
file, enqueue it in locally or forward it to checkpointer.
smgrimmedsync() retains the existing behavior and fsyncs the file right
away. The processing of fsync requests has been moved from mdsync() to a
new ProcessFsyncRequests() function.

I think that's also wrong, imo fsyncs are storage detail, and should be
relegated to *below* md.c. That's not to say the code should live in
md.c, but the issuing of such calls shouldn't be part of smgr.c -
consider e.g. developing a storage engine living in non volatile ram.

How about we take all that sync-related stuff, that Shawn has moved
from md.c into smgr.c, and my earlier patch had moved out of md.c into
smgrsync.c, and we put it into a new place
src/backend/storage/file/sync.c? Or somewhere else, but not under
smgr. It doesn't know anything about smgr concepts, and it can be
used to schedule file sync for any kind of file, not just the smgr
implementations' files. Though they'd be the main customers, I guess.
md.c and undofile.c etc would call it to register stuff, and
checkpointer.c would call it to actually perform all the fsync calls.

If we do that, the next question is how to represent filenames. One
idea is to use tags that can be converted back to a path. I suppose
there could be a table of function pointers somewhere, and the tag
could be a discriminated union? Or just a descriminator + a small
array of dumb uint32_t of a size big enough for all users, a bit like
lock tags.

One reason to want to use small fixed-sized tags is to avoid atomicity
problems in the future when it comes to the fd-passing work, as
mentioned up-thread. Here are some other ideas, to avoid having to
use tags:

* send the paths through a shm queue, but the fds through the Unix
domain socket; the messages carrying fds somehow point to the pathname
in the shm queue (and deal with slight disorder...)
* send the paths through the socket, but hold an LWLock while doing so
to make sure it's atomic, no matter what the size
* somehow prove that it's really already atomic even for long paths,
on every operating system we support, and that it's never change, so
there is no problem here

Another problem with variable sized pathnames even without the future
fd-passing work is that it's harder to size the shm queue: the current
code sets max_requests to NBuffers, which makes some kind of sense
because that's a hard upper bound on the number of dirty segments
there could possibly be at a given moment in time (one you'll
practically never hit), after deduplication. It's harder to come up
with a decent size for a new variable-sized-message queue; MAXPGPATH *
NBuffers would be insanely large (it's be 1/8th the size of the buffer
pool), but if you make it any smaller there is no guarantee that
compacting it can create space. Perhaps the solution to that is
simply to block/wait while shm queue is full -- but that might have
deadlock problems.

--
Thomas Munro
https://enterprisedb.com

#38

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Thomas Munro (#37)

Re: Refactoring the checkpointer's fsync request queue

It's interesting that you measure performance improvements that IIUC
can come only from dropping the bitmapset stuff (or I guess bugs). I
don't understand the mechanism for that improvement yet.

I will be digging into this a bit more to understand what really is the
cause for the improvements. But first, need to get the refactor patch in
better shape :-)

The idea of just including the segment number (in some representation,
possibly opaque to this code) in the hash table key instead of
carrying a segment list seems pretty good from here (and I withdraw
the sorted vector machinery I mentioned earlier as it's redundant for
this project)... except for one detail. In your patch, you still have
FORGET_RELATION_FSYNC and FORGET_DATABASE_FSYNC. That relies on this
sync manager code being able to understand which stuff to drop from
the hash table, which means that is still has knowledge of the key
hierarchy, so that it can match all entries for the relation or
database. That's one of the things that Andres is arguing against.

You are correct. I actually did mention having a callback to do the
request resolution in an response to Andres back up in the thread, but
oops, completely slipped my mind with my last patch.

How about we take all that sync-related stuff, that Shawn has moved
from md.c into smgr.c, and my earlier patch had moved out of md.c into
smgrsync.c, and we put it into a new place
src/backend/storage/file/sync.c? Or somewhere else, but not under
smgr. It doesn't know anything about smgr concepts, and it can be
used to schedule file sync for any kind of file, not just the smgr
implementations' files. Though they'd be the main customers, I guess.
md.c and undofile.c etc would call it to register stuff, and
checkpointer.c would call it to actually perform all the fsync calls.

If we do that, the next question is how to represent filenames. One
idea is to use tags that can be converted back to a path. I suppose
there could be a table of function pointers somewhere, and the tag
could be a discriminated union? Or just a descriminator + a small
array of dumb uint32_t of a size big enough for all users, a bit like
lock tags.

One reason to want to use small fixed-sized tags is to avoid atomicity
problems in the future when it comes to the fd-passing work, as
mentioned up-thread. Here are some other ideas, to avoid having to
use tags:

* send the paths through a shm queue, but the fds through the Unix
domain socket; the messages carrying fds somehow point to the pathname
in the shm queue (and deal with slight disorder...)
* send the paths through the socket, but hold an LWLock while doing so
to make sure it's atomic, no matter what the size
* somehow prove that it's really already atomic even for long paths,
on every operating system we support, and that it's never change, so
there is no problem here

Another problem with variable sized pathnames even without the future
fd-passing work is that it's harder to size the shm queue: the current
code sets max_requests to NBuffers, which makes some kind of sense
because that's a hard upper bound on the number of dirty segments
there could possibly be at a given moment in time (one you'll
practically never hit), after deduplication. It's harder to come up
with a decent size for a new variable-sized-message queue; MAXPGPATH *
NBuffers would be insanely large (it's be 1/8th the size of the buffer
pool), but if you make it any smaller there is no guarantee that
compacting it can create space. Perhaps the solution to that is
simply to block/wait while shm queue is full -- but that might have
deadlock problems.

I think I have a lot better understanding of what Andres is envisioning
and agree with what Thomas has said so far. To summarize, we want a
"sync" component at the level of fd, that components higher up the chain
like md, undo, slru and checkpointer will use to track and process fsync
requests (I am refraining from putting in an ascii diagram here!).
These checkpoint requests will be opaque to the sync machinery and will
rely on requesters to provide the necessary details. I, agree with
Thomas, in that I don't think passing full pathnames or variable
pathnames is the right way to go for all the reasons Thomas mentioned in
his email. However, if we want to, in the future we can easily extend
the checkpoint request to include a type, CHECKPOINT_REQUEST_FD or
CHECKPOINT_REQUEST_PATH, and delegate the current relfilenode to be of
type CHECKPOINT_REQUEST_RELFILENODE. Sync can then act on the requests
based on the type, and in some cases wouldn't need to interact with any
other component.

The pieces of information we need to process fsyncs are (1) determine if
a request is to invalidate other requests the queue currently holds and
(2) determine the full path to the file to issue fsync on.

I think using callbacks is the better path forward than having md or
other components issue an invalidate request for each and every segment
which can get quite heavy handed for large databases. Performance would
be the same as today since we already scan the entire hash table when we
encounter a forget request. This new approach will involve one
additional function call inside the loop which does a simple compare.

And Thomas brought up a good point offline: if we followed the path of
smgr for the callbacks, it will lead to header file dependency
nightmare. It would be best for components like md to register it's
callback functions with sync so that sync doesn't have to include higher
level header files to get access to their prototypes.

At the time of smgrinit(), mdinit() would call into sync and register
it's callbacks with an ID. We can use the same value that we are using
for smgr_which to map the callbacks. Each fsync request will then also
accompany this ID which the sync mechanism will use to call handlers for
resolving forget requests or obtaining paths for files.

I think this should satisfy Andres' requirement for not teaching smgr
anything about segmentation while keeping sync related knowledge far
below the smgr layer. This scheme works for undo and generic block
storage well. Though might have to teach immedsync to accept block
numbers so that undo and other storage managers can determine what
segment it maps to.

Thoughts? I am going to get started with revising the patch unless I
hear otherwise. And love the great feedback btw, thank you.

--
Shawn Debnath
Amazon Web Services (AWS)

#39

Ibrar Ahmed

ibrar.ahmad@gmail.com

almost 7 years ago

In reply to: Thomas Munro (#1)

Temporal Table Proposal

Hi,

While working on another PostgreSQL feature, I was thinking that we could
use a temporal table in PostgreSQL. Some existing databases offer this. I
searched for any discussion on the PostgreSQL mailing list, but could not
find any. Maybe my search wasn’t accurate enough: if anyone can point me
to a discussion, that would be useful.

https://www.percona.com/community-blog/2018/12/14/notes-mariadb-system-versioned-tables/
https://www.mssqltips.com/sqlservertip/3680/introduction-to-sql-server-temporal-tables/

What?
A temporal table feature has two tables “Temporal Table” and “History
Table”. The Temporal Table is where our current tuples are stored. This is
the main table, just like other PostgreSQL tables. The history table is the
other half of the feature and is where all the history of the main table is
stored. This table is created automatically. The history table is used to
query certain data at a certain time, useful for a point in time analysis.
It also offers built-in versioning.

Why?

Normally users write triggers or procedures to write a history of a table’s
data. Some time-sensitive applications will have code to write a data
history somewhere. By having this functionality, PostgreSQL would do it
automatically. For example, if we have a retail table where the price of
each product inventory item is stored. The temporal table would hold the
current price of the product. When we update the price of a product in the
temporal table, then a new row with a timestamp would be added to the
history table. That means on each update of the price, a new row containing
the previous price would be added to the history table. The same would
apply in the case of deletes. When we delete any product from our
inventory, then a row would be added to the history table storing the last
price of the product prior to delete. For any point in time, we can access
the price at which we sold the product.

How?
I was thinking about the implementation of this feature and read the
documentation on the internet. Microsoft SQL Server, for example, offers
such a feature. If we come to the conclusion we should offer the feature, I
will share the complete design.

Here are some ideas I have around this:

- Syntax.

CREATE TABLE tablename

(

…

start_time DATETIME,

end_time DATETIME,

PERIOD FOR SYSTEM_TIME (start_time, end_time)

)

WITH

(

SYSTEM_VERSIONING = ON (HISTORY_TABLE = tablename_history)

);

The tablename is the temporal table and tablename_history is be the history
table. The name of the history table is optional, in which case, PostgreSQL
will generate a table name. These two columns are a must for a temporal
table “start_time” and “end_time”. The PERIOD FOR SYSTEM_TIME is used to
identify these columns.

ALTER TABLE SET SYSTEM_VERSIONING = ON/OFF

Due to this syntax addition in CREATE/ALTER TABLE, there are some grammar
additions required in the parser.

PERIOD FOR SYSTEM TIME
SYSTEM VERSIONING

- Catalog Changes.

There are two options, one is to have another catalog pg_temporal which
will contain the information or we could have that information in the
pg_catalog

Table "public.pg_temporal"

Column | Type | Collation | Nullable | Default

-----------------+------+-----------+----------+---------

temporal_id | oid | | |

hist_id | oid | | |

start_date_name | text | | |

end_date_name | text | | |

--
Ibrar Ahmed

#40

Euler Taveira

euler@timbira.com.br

almost 7 years ago

In reply to: Ibrar Ahmed (#39)

Re: Temporal Table Proposal

Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed
<ibrar.ahmad@gmail.com> escreveu:

While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some existing databases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any. Maybe my search wasn’t accurate enough: if anyone can point me to a discussion, that would be useful.

/messages/by-id/CA+renyUb+XHzsrPHHR6ELqguxaUPGhOPyVc7NW+kRsRpBZuUFQ@mail.gmail.com

This is the last one. I don't know why it wasn't in the January CF.

--
Euler Taveira Timbira -
http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

#41

Paul Jungwirth

pj@illuminatedcomputing.com

almost 7 years ago

In reply to: Euler Taveira (#40)

Re: Temporal Table Proposal

On 2/22/19 11:31 AM, Euler Taveira wrote:

Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed
<ibrar.ahmad@gmail.com> escreveu:

While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some existing databases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any. Maybe my search wasn’t accurate enough: if anyone can point me to a discussion, that would be useful.

/messages/by-id/CA+renyUb+XHzsrPHHR6ELqguxaUPGhOPyVc7NW+kRsRpBZuUFQ@mail.gmail.com

This is the last one. I don't know why it wasn't in the January CF.

Oh that's by me! :-)

I didn't put it into the CF because I wanted to get some feedback on
primary keys before I got too far into foreign keys, but someone
recently advised me to starting adding to CFs anyway with "WIP" in the
title, so I'll do that next time.

Btw my own patch is very modest, and I'd love to see this other much
more extensive patch get some attention:

/messages/by-id/CAHO0eLYyvuqwF=2FsgDn1xOs_NOrFBu9Xh-Wq+aWfFy0y6=jWQ@mail.gmail.com

They were told to adjust where in the query pipeline they do their work,
and the latest patch does that (as I understand it), but I don't think
anyone has looked at it yet.

Both of these patches use range types rather than SQL:2011 PERIODs, but
I'd like to *also* support PERIODs (and accept ranges everywhere we
accept PERIODs). Vik Fearing already has a patch to let you *declare*
PERIODs:

https://www.postgresql-archive.org/Periods-td6022563.html

Actually using PERIODs in queries seems like a decent chunk of work
though: basically it means making our grammar & processing accept
PERIODs anywhere they currently accept columns. I'd love to hear some
thoughts/suggestions around that. For example: a PERIOD is *similar* to
a GENERATED column, so maybe the work being done there can/should
influence how we implement them.

I'm excited to be getting some momentum around temporal features though!
I'm supposed to give a talk about them at PGCon in Ottawa this spring,
so hopefully that will help too.

Yours,

--
Paul ~{:-)
pj@illuminatedcomputing.com

#42

Euler Taveira

euler@timbira.com.br

almost 7 years ago

In reply to: Paul Jungwirth (#41)

Re: Temporal Table Proposal

Em sex, 22 de fev de 2019 às 18:16, Paul Jungwirth
<pj@illuminatedcomputing.com> escreveu:

On 2/22/19 11:31 AM, Euler Taveira wrote:

Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed
<ibrar.ahmad@gmail.com> escreveu:

While working on another PostgreSQL feature, I was thinking that we could use a temporal table in PostgreSQL. Some existing databases offer this. I searched for any discussion on the PostgreSQL mailing list, but could not find any. Maybe my search wasn’t accurate enough: if anyone can point me to a discussion, that would be useful.

/messages/by-id/CA+renyUb+XHzsrPHHR6ELqguxaUPGhOPyVc7NW+kRsRpBZuUFQ@mail.gmail.com

This is the last one. I don't know why it wasn't in the January CF.

Oh that's by me! :-)

Forgot to CC you.

I didn't put it into the CF because I wanted to get some feedback on
primary keys before I got too far into foreign keys, but someone
recently advised me to starting adding to CFs anyway with "WIP" in the
title, so I'll do that next time.

Get some feedback is one of the CF goals. Even if you have just a WIP,
those CF feedbacks could help you solve/improve some pieces of your
current code.

Btw my own patch is very modest, and I'd love to see this other much
more extensive patch get some attention:

/messages/by-id/CAHO0eLYyvuqwF=2FsgDn1xOs_NOrFBu9Xh-Wq+aWfFy0y6=jWQ@mail.gmail.com

It isn't in the CF 2019-03. If you want it to be reviewed you should add it.

At this point, both patches should target v13.

--
Euler Taveira Timbira -
http://www.timbira.com.br/
PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

#43

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Shawn Debnath (#38)

Re: Refactoring the checkpointer's fsync request queue

Hi,

On 2019-02-22 10:18:57 -0800, Shawn Debnath wrote:

I think using callbacks is the better path forward than having md or
other components issue an invalidate request for each and every segment
which can get quite heavy handed for large databases.

I'm not sure I buy this. Unlinking files isn't cheap, involves many disk
writes, etc. The cost of an inval request in comparison isn't
particularly large. Especially for relation-level (rather than database
level) truncation, per-segment invals will likely commonly be faster
than the sequential scan.

At the time of smgrinit(), mdinit() would call into sync and register
it's callbacks with an ID. We can use the same value that we are using
for smgr_which to map the callbacks. Each fsync request will then also
accompany this ID which the sync mechanism will use to call handlers for
resolving forget requests or obtaining paths for files.

I'm not seeing a need to do this dynamically at runtime. Given that smgr
isn't extensible, why don't we just map callbacks (or even just some
switch based logic) based on some enum? Doing things at *init time has
more potential to go wrong, because say a preload_shared_library does
different things in postmaster than normal backends (in EXEC_BACKEND
cases).

Besides those two points, I think this is going in a good direction!

Greetings,

Andres Freund

#44

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Andres Freund (#43)

Re: Refactoring the checkpointer's fsync request queue

On Sat, Feb 23, 2019 at 11:15 AM Andres Freund <andres@anarazel.de> wrote:

On 2019-02-22 10:18:57 -0800, Shawn Debnath wrote:

I think using callbacks is the better path forward than having md or
other components issue an invalidate request for each and every segment
which can get quite heavy handed for large databases.

I'm not sure I buy this. Unlinking files isn't cheap, involves many disk
writes, etc. The cost of an inval request in comparison isn't
particularly large. Especially for relation-level (rather than database
level) truncation, per-segment invals will likely commonly be faster
than the sequential scan.

Well even if you do it with individual segment cancel messages for
relations, you still need a way to deal with whole-database drops
(generating the cancels for every segment in every relation in the
database would be nuts), and that means either exposing some structure
to the requests, right? So the requests would have { request type,
callback ID, db, opaque tag }, where request type is SYNC, CANCEL,
CANCEL_WHOLE_DB, callback ID is used to find the function that
converts opaque tags to paths, and db is used for handling
CANCEL_WHOLE_DB requests where you need to scan the whole hash table.
Right?

At the time of smgrinit(), mdinit() would call into sync and register
it's callbacks with an ID. We can use the same value that we are using
for smgr_which to map the callbacks. Each fsync request will then also
accompany this ID which the sync mechanism will use to call handlers for
resolving forget requests or obtaining paths for files.

I'm not seeing a need to do this dynamically at runtime. Given that smgr
isn't extensible, why don't we just map callbacks (or even just some
switch based logic) based on some enum? Doing things at *init time has
more potential to go wrong, because say a preload_shared_library does
different things in postmaster than normal backends (in EXEC_BACKEND
cases).

Yeah I suggested dynamic registration to avoid the problem that eg
src/backend/storage/sync.c otherwise needs to forward declare
md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
#include <storage/md.h> etc, which seemed like exactly the sort of
thing up with which you would not put. (Which reminds me, smgr.{c,h}
suffers from a similar disease, as the comment says: "move me
elsewhere -- ay 7/94").

Besides those two points, I think this is going in a good direction!

Phew. :-)

--
Thomas Munro
https://enterprisedb.com

#45

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Thomas Munro (#44)

Re: Refactoring the checkpointer's fsync request queue

Hi,

On 2019-02-23 11:42:49 +1300, Thomas Munro wrote:

On Sat, Feb 23, 2019 at 11:15 AM Andres Freund <andres@anarazel.de> wrote:

On 2019-02-22 10:18:57 -0800, Shawn Debnath wrote:

I think using callbacks is the better path forward than having md or
other components issue an invalidate request for each and every segment
which can get quite heavy handed for large databases.

I'm not sure I buy this. Unlinking files isn't cheap, involves many disk
writes, etc. The cost of an inval request in comparison isn't
particularly large. Especially for relation-level (rather than database
level) truncation, per-segment invals will likely commonly be faster
than the sequential scan.

Well even if you do it with individual segment cancel messages for
relations, you still need a way to deal with whole-database drops
(generating the cancels for every segment in every relation in the
database would be nuts), and that means either exposing some structure
to the requests, right? So the requests would have { request type,
callback ID, db, opaque tag }, where request type is SYNC, CANCEL,
CANCEL_WHOLE_DB, callback ID is used to find the function that
converts opaque tags to paths, and db is used for handling
CANCEL_WHOLE_DB requests where you need to scan the whole hash table.
Right?

I'm ok with using callbacks to allow pruning for things like droping
databases. If we use callbacks, I don't see a need to explicitly include
the db in the request however? The callback can look into the opaque
tag, no? Also, why do we need a separation between request type and
callback? That seems like it'll commonly be entirely redundant?

At the time of smgrinit(), mdinit() would call into sync and register
it's callbacks with an ID. We can use the same value that we are using
for smgr_which to map the callbacks. Each fsync request will then also
accompany this ID which the sync mechanism will use to call handlers for
resolving forget requests or obtaining paths for files.

I'm not seeing a need to do this dynamically at runtime. Given that smgr
isn't extensible, why don't we just map callbacks (or even just some
switch based logic) based on some enum? Doing things at *init time has
more potential to go wrong, because say a preload_shared_library does
different things in postmaster than normal backends (in EXEC_BACKEND
cases).

Yeah I suggested dynamic registration to avoid the problem that eg
src/backend/storage/sync.c otherwise needs to forward declare
md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
#include <storage/md.h> etc, which seemed like exactly the sort of
thing up with which you would not put.

I'm not sure I understand. If we have a few known tag types, what's the
problem with including the headers with knowledge of how to implement
them into sync.c file?

Greetings,

Andres Freund

#46

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Andres Freund (#45)

Re: Refactoring the checkpointer's fsync request queue

On Sat, Feb 23, 2019 at 11:48 AM Andres Freund <andres@anarazel.de> wrote:

Yeah I suggested dynamic registration to avoid the problem that eg
src/backend/storage/sync.c otherwise needs to forward declare
md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
#include <storage/md.h> etc, which seemed like exactly the sort of
thing up with which you would not put.

I'm not sure I understand. If we have a few known tag types, what's the
problem with including the headers with knowledge of how to implement
them into sync.c file?

Typo in my previous email: src/backend/storage/file/sync.c was my
proposal for the translation unit holding this stuff (we don't have .c
files directly under storage). But it didn't seem right that stuff
under storage/file (things concerned with files) should know about
stuff under storage/smgr (md.c functions, higher level smgr stuff).
Perhaps that just means it should go into a different subdir, maybe
src/backend/storage/sync/sync.c, that knows about files AND smgr
stuff, or perhaps I'm worrying about nothing.

--
Thomas Munro
https://enterprisedb.com

#47

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Thomas Munro (#46)

Re: Refactoring the checkpointer's fsync request queue

Hi,

On 2019-02-23 11:59:04 +1300, Thomas Munro wrote:

On Sat, Feb 23, 2019 at 11:48 AM Andres Freund <andres@anarazel.de> wrote:

Yeah I suggested dynamic registration to avoid the problem that eg
src/backend/storage/sync.c otherwise needs to forward declare
md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
#include <storage/md.h> etc, which seemed like exactly the sort of
thing up with which you would not put.

I'm not sure I understand. If we have a few known tag types, what's the
problem with including the headers with knowledge of how to implement
them into sync.c file?

Typo in my previous email: src/backend/storage/file/sync.c was my
proposal for the translation unit holding this stuff (we don't have .c
files directly under storage). But it didn't seem right that stuff
under storage/file (things concerned with files) should know about
stuff under storage/smgr (md.c functions, higher level smgr stuff).
Perhaps that just means it should go into a different subdir, maybe
src/backend/storage/sync/sync.c, that knows about files AND smgr
stuff, or perhaps I'm worrying about nothing.

I mean, if you have a md_tagtopath and need its behaviour, I don't
understand why a local forward declaration changes things? But I also
think this is a bit of a non-problem - the abbreviated path names are
just a different representation of paths, I don't see a problem of
having that in sync.[ch], especially if we have the option to also have
a full-length pathname variant if we ever need that.

Greetings,

Andres Freund

#48

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Andres Freund (#47)

Re: Refactoring the checkpointer's fsync request queue

Well even if you do it with individual segment cancel messages for
relations, you still need a way to deal with whole-database drops
(generating the cancels for every segment in every relation in the
database would be nuts), and that means either exposing some structure
to the requests, right? So the requests would have { request type,
callback ID, db, opaque tag }, where request type is SYNC, CANCEL,
CANCEL_WHOLE_DB, callback ID is used to find the function that
converts opaque tags to paths, and db is used for handling
CANCEL_WHOLE_DB requests where you need to scan the whole hash table.
Right?

I'm ok with using callbacks to allow pruning for things like droping
databases. If we use callbacks, I don't see a need to explicitly include
the db in the request however? The callback can look into the opaque
tag, no? Also, why do we need a separation between request type and
callback? That seems like it'll commonly be entirely redundant?

By having the id to distinguish which smgr to use for callbacks, we
don't need DB. You are correct, my plan was to make the whole request
opaque to the sync mechanism. ForwardFsyncRequest will take requester
smgr id, request type (forget [db/rel], sync), relfilenode, forknum, and
segno and convert it into a opaque CheckpointRequest and queue it
locally. The only responsibility here is to map the different pieces of
data into the opaque CheckpointerRequest. Requester ID in combination
with request type will help us look up which callback function to
execute.

A new enum for requester ID is perfectly okay. I was trying to recycle
the smgr id but perhaps that's not the right approach.

Yeah I suggested dynamic registration to avoid the problem that
eg
src/backend/storage/sync.c otherwise needs to forward declare
md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
#include <storage/md.h> etc, which seemed like exactly the sort of
thing up with which you would not put.

I'm not sure I understand. If we have a few known tag types, what's the
problem with including the headers with knowledge of how to implement
them into sync.c file?

Typo in my previous email: src/backend/storage/file/sync.c was my
proposal for the translation unit holding this stuff (we don't have .c
files directly under storage). But it didn't seem right that stuff
under storage/file (things concerned with files) should know about
stuff under storage/smgr (md.c functions, higher level smgr stuff).
Perhaps that just means it should go into a different subdir, maybe
src/backend/storage/sync/sync.c, that knows about files AND smgr
stuff, or perhaps I'm worrying about nothing.

I mean, if you have a md_tagtopath and need its behaviour, I don't
understand why a local forward declaration changes things? But I also
think this is a bit of a non-problem - the abbreviated path names are
just a different representation of paths, I don't see a problem of
having that in sync.[ch], especially if we have the option to also have
a full-length pathname variant if we ever need that.

Today, all the callbacks for smgr have their prototypes defined in
smgr.h and used in smgr.c. Forward declarations within the new sync.h
would be fine but having md callbacks split in two different places may
not be the cleanest approach? One could argue they serve different
purposes so perhaps it correct that we define them separately. I am
fine with either, but now I probably prefer the new enum to fixed
function declaration mappings that Andres suggested. I agree it would be
less error prone.

--
Shawn Debnath
Amazon Web Services (AWS)

#49

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Shawn Debnath (#48)

Re: Refactoring the checkpointer's fsync request queue

On 2019-02-22 15:45:50 -0800, Shawn Debnath wrote:

Well even if you do it with individual segment cancel messages for
relations, you still need a way to deal with whole-database drops
(generating the cancels for every segment in every relation in the
database would be nuts), and that means either exposing some structure
to the requests, right? So the requests would have { request type,
callback ID, db, opaque tag }, where request type is SYNC, CANCEL,
CANCEL_WHOLE_DB, callback ID is used to find the function that
converts opaque tags to paths, and db is used for handling
CANCEL_WHOLE_DB requests where you need to scan the whole hash table.
Right?

I'm ok with using callbacks to allow pruning for things like droping
databases. If we use callbacks, I don't see a need to explicitly include
the db in the request however? The callback can look into the opaque
tag, no? Also, why do we need a separation between request type and
callback? That seems like it'll commonly be entirely redundant?

By having the id to distinguish which smgr to use for callbacks, we
don't need DB. You are correct, my plan was to make the whole request
opaque to the sync mechanism. ForwardFsyncRequest will take requester
smgr id, request type (forget [db/rel], sync), relfilenode, forknum, and
segno and convert it into a opaque CheckpointRequest and queue it
locally. The only responsibility here is to map the different pieces of
data into the opaque CheckpointerRequest. Requester ID in combination
with request type will help us look up which callback function to
execute.

A new enum for requester ID is perfectly okay. I was trying to recycle
the smgr id but perhaps that's not the right approach.

I think it'd be a bad idea to use the smgr id here. So +1 for
separating.

Yeah I suggested dynamic registration to avoid the problem that
eg
src/backend/storage/sync.c otherwise needs to forward declare
md_tagtopath(), undofile_tagtopath(), slru_tagtopath(), ..., or maybe
#include <storage/md.h> etc, which seemed like exactly the sort of
thing up with which you would not put.

I'm not sure I understand. If we have a few known tag types, what's the
problem with including the headers with knowledge of how to implement
them into sync.c file?

Typo in my previous email: src/backend/storage/file/sync.c was my
proposal for the translation unit holding this stuff (we don't have .c
files directly under storage). But it didn't seem right that stuff
under storage/file (things concerned with files) should know about
stuff under storage/smgr (md.c functions, higher level smgr stuff).
Perhaps that just means it should go into a different subdir, maybe
src/backend/storage/sync/sync.c, that knows about files AND smgr
stuff, or perhaps I'm worrying about nothing.

I mean, if you have a md_tagtopath and need its behaviour, I don't
understand why a local forward declaration changes things? But I also
think this is a bit of a non-problem - the abbreviated path names are
just a different representation of paths, I don't see a problem of
having that in sync.[ch], especially if we have the option to also have
a full-length pathname variant if we ever need that.

Today, all the callbacks for smgr have their prototypes defined in
smgr.h and used in smgr.c. Forward declarations within the new sync.h
would be fine but having md callbacks split in two different places may
not be the cleanest approach? One could argue they serve different
purposes so perhaps it correct that we define them separately. I am
fine with either, but now I probably prefer the new enum to fixed
function declaration mappings that Andres suggested. I agree it would be
less error prone.

I'd really just entirely separate the two. I'd not include any knowledge
of this mechanism into smgr.h, and just make the expansion of paths
dispatch statically dispatched (potentially into md.c) to actually do
the work. I'm not sure why sync.h would need any forward declarations
for that?

Greetings,

Andres Freund

#50

Ibrar Ahmed

ibrar.ahmad@gmail.com

almost 7 years ago

In reply to: Paul Jungwirth (#41)

Re: Temporal Table Proposal

Hi Paul,

On Sat, Feb 23, 2019 at 2:16 AM Paul Jungwirth <pj@illuminatedcomputing.com>
wrote:

On 2/22/19 11:31 AM, Euler Taveira wrote:

Em sex, 22 de fev de 2019 às 15:41, Ibrar Ahmed
<ibrar.ahmad@gmail.com> escreveu:

While working on another PostgreSQL feature, I was thinking that we

could use a temporal table in PostgreSQL. Some existing databases offer
this. I searched for any discussion on the PostgreSQL mailing list, but
could not find any. Maybe my search wasn’t accurate enough: if anyone can
point me to a discussion, that would be useful.

/messages/by-id/CA+renyUb+XHzsrPHHR6ELqguxaUPGhOPyVc7NW+kRsRpBZuUFQ@mail.gmail.com

This is the last one. I don't know why it wasn't in the January CF.

Oh that's by me! :-)

I didn't put it into the CF because I wanted to get some feedback on
primary keys before I got too far into foreign keys, but someone
recently advised me to starting adding to CFs anyway with "WIP" in the
title, so I'll do that next time.

Btw my own patch is very modest, and I'd love to see this other much
more extensive patch get some attention:

/messages/by-id/CAHO0eLYyvuqwF=2FsgDn1xOs_NOrFBu9Xh-Wq+aWfFy0y6=jWQ@mail.gmail.com

They were told to adjust where in the query pipeline they do their work,
and the latest patch does that (as I understand it), but I don't think
anyone has looked at it yet.

Both of these patches use range types rather than SQL:2011 PERIODs, but
I'd like to *also* support PERIODs (and accept ranges everywhere we
accept PERIODs). Vik Fearing already has a patch to let you *declare*
PERIODs:

https://www.postgresql-archive.org/Periods-td6022563.html

Actually using PERIODs in queries seems like a decent chunk of work
though: basically it means making our grammar & processing accept
PERIODs anywhere they currently accept columns. I'd love to hear some
thoughts/suggestions around that. For example: a PERIOD is *similar* to
a GENERATED column, so maybe the work being done there can/should
influence how we implement them.

I'm excited to be getting some momentum around temporal features though!
I'm supposed to give a talk about them at PGCon in Ottawa this spring,
so hopefully that will help too.

Yours,

--
Paul ~{:-)
pj@illuminatedcomputing.com

Great, to hear that you are working on that. Do you think I can help you

with this? I did some groundwork to make it possible. I can help in
coding/reviewing or even can take lead if you want to.

--
Ibrar Ahmed

#51

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Andres Freund (#49)

1 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Feb 22, 2019 at 03:57:42PM -0800, Andres Freund wrote:

Today, all the callbacks for smgr have their prototypes defined in
smgr.h and used in smgr.c. Forward declarations within the new sync.h
would be fine but having md callbacks split in two different places may
not be the cleanest approach? One could argue they serve different
purposes so perhaps it correct that we define them separately. I am
fine with either, but now I probably prefer the new enum to fixed
function declaration mappings that Andres suggested. I agree it would be
less error prone.

I'd really just entirely separate the two. I'd not include any knowledge
of this mechanism into smgr.h, and just make the expansion of paths
dispatch statically dispatched (potentially into md.c) to actually do
the work. I'm not sure why sync.h would need any forward declarations
for that?

We had a quick offline discussion to get on the same page and we agreed
to move forward with Andres' approach above. Attached is patch v10.
Here's the overview of the patch:

1. Move the system for requesting and processing fsyncs out of md.c
into storage/sync/sync.c with definitions in include/storage/sync.h.
ProcessSyncRequests() is now responsible for processing the sync
requests during checkpoint.

2. Removed the need for specific storage managers to implement pre and
post checkpoint callbacks. These are now executed by the sync mechanism.

3. We now embed the fork number and the segment number as part of the
hash key for the pending ops table. This eliminates the bitmapset based
segment tracking for each relfilenode during fsync as not all storage
managers may map their segments from zero.

4. Each sync request now must include a type: sync, forget, forget
hierarchy, or unlink, and the owner who will be responsible for
generating paths or matching forget requests.

5. For cancelling relation sync requests, we now must send a forget
request for each fork and segment in the relation.

6. We do not rely on smgr to provide the file descriptor we use to
issue fsync. Instead, we generate the full path based on the FileTag
in the sync request and use PathNameOpenFile to get the file descriptor.

Ran make check-world and repeated the tests described in [1]/messages/by-id/20190220232739.GA8280@f01898859afd.ant.amazon.com. The
numbers show a 12% drop in total time for single run of 1000 clients and
~62% drop in total time for 10 parallel runs with 200 clients:

[Requests Absorbed]

Min Max Average Median Std Dev
--------------- -------- -------- ----------- --------- ----------
patch-1x1000 17554 147410 85252.95 83835.5 21898.88
master-1x1000 25728 138422 81455.04 80601 21295.83

Min Max Average Median Std Dev
patch-10x200 125922 279682 197098.76 197055 34038.25
master-10x200 191833 602512 416533.86 424946 82014.48

[Files Synced]

Min Max Average Median Std Dev
--------------- ------ ------ --------- -------- ---------
patch-1x1000 155 213 158.93 159 2.97
master-1x1000 154 166 158.29 159 10.29

Min Max Average Median Std Dev
patch-10x200 1456 1577 1546.84 1547 11.23
master-10x200 1546 1546 1546 1559 12.79

[Total Time in ProcessFsyncRequest/mdsync]

Min Max Average Median Std Dev
--------------- ------ --------- ---------- -------- ---------
patch-1x1000 606 4022 2145.11 2123 440.72
master-1x1000 806 4430.32 2458.77 2382 497.01

Min Max Average Median Std Dev
patch-10x200 2472 6960 4156.8 4141 817.56
master-10x200 4323 17858 10982.15 11154 2760.47

[1]: /messages/by-id/20190220232739.GA8280@f01898859afd.ant.amazon.com
/messages/by-id/20190220232739.GA8280@f01898859afd.ant.amazon.com

--
Shawn Debnath
Amazon Web Services (AWS)

Attachments:

0001-Refactor-the-fsync-machinery-to-support-future-SMGR-v10.patchtext/plain; charset=us-asciiDownload

From 735ae5eafda5479c073f5eb542be59bb84ba41a5 Mon Sep 17 00:00:00 2001
From: Shawn Debnath <sdn@amazon.com>
Date: Wed, 27 Feb 2019 18:58:58 +0000
Subject: [PATCH] Refactor the fsync mechanism to support future SMGR
 implementations.

In anticipation of proposed block storage managers alongside md.c that
map bufmgr.c blocks to files optimised for different usage patterns:

1. Move the system for requesting and processing fsyncs out of md.c
into storage/sync/sync.c with definitions in include/storage/sync.h.
ProcessSyncRequests() is now responsible for processing the sync
requests during checkpoint.

2. Removed the need for specific storage managers to implement pre and
post checkpoint callbacks. These are now executed by the sync mechanism.

3. We now embed the fork number and the segment number as part of the
hash key for the pending ops table. This eliminates the bitmapset based
segment tracking for each relfilenode during fsync as not all storage
managers may map their segments from zero.

4. Each sync request now must include a type: sync, forget, forget
hierarchy, or unlink, and the owner who will be responsible for
generating paths or matching forget requests.

5. For cancelling relation sync requests, we now must send a forget
request for each fork and segment in the relation.

6. We do not rely on smgr to provide the file descriptor we use to
issue fsync. Instead, we generate the full path based on the FileTag
in the sync request and use PathNameOpenFile to get the file descriptor.

Author: Shawn Debnath, Thomas Munro
Reviewed-by:
Discussion:
https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 src/backend/access/transam/xlog.c     |   7 +-
 src/backend/commands/dbcommands.c     |   6 +-
 src/backend/postmaster/checkpointer.c |  52 +--
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +-
 src/backend/storage/smgr/md.c         | 791 ++++------------------------------
 src/backend/storage/smgr/smgr.c       |  54 ---
 src/backend/storage/sync/Makefile     |  17 +
 src/backend/storage/sync/sync.c       | 590 +++++++++++++++++++++++++
 src/backend/utils/init/postinit.c     |   2 +
 src/include/postmaster/bgwriter.h     |   8 +-
 src/include/storage/fd.h              |  12 +
 src/include/storage/segment.h         |  28 ++
 src/include/storage/smgr.h            |  17 +-
 src/include/storage/sync.h            | 108 +++++
 15 files changed, 890 insertions(+), 806 deletions(-)
 create mode 100644 src/backend/storage/sync/Makefile
 create mode 100644 src/backend/storage/sync/sync.c
 create mode 100644 src/include/storage/segment.h
 create mode 100644 src/include/storage/sync.h

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53a..a04f993e3e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -65,6 +65,7 @@
 #include "storage/reinit.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
+#include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -6986,7 +6987,7 @@ StartupXLOG(void)
 		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			PublishStartupProcessInformation();
-			SetForwardFsyncRequests();
+			EnableSyncRequestForwarding();
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
 			bgwriterLaunched = true;
 		}
@@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags)
 	 * the REDO pointer.  Note that smgr must not do anything that'd have to
 	 * be undone if we decide no checkpoint is needed.
 	 */
-	smgrpreckpt();
+	PreCheckpoint();
 
 	/* Begin filling in the checkpoint WAL record */
 	MemSet(&checkPoint, 0, sizeof(checkPoint));
@@ -8912,7 +8913,7 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
-	smgrpostckpt();
+	PostCheckpoint();
 
 	/*
 	 * Update the average distance between checkpoints if the prior checkpoint
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index d207cd899f..825e44de26 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -940,11 +940,11 @@ dropdb(const char *dbname, bool missing_ok)
 	 * worse, it will delete files that belong to a newly created database
 	 * with the same OID.
 	 */
-	ForgetDatabaseFsyncRequests(db_id);
+	ForgetDatabaseSyncRequests(db_id);
 
 	/*
 	 * Force a checkpoint to make sure the checkpointer has received the
-	 * message sent by ForgetDatabaseFsyncRequests. On Windows, this also
+	 * message sent by ForgetDatabaseSyncRequests. On Windows, this also
 	 * ensures that background procs don't hold any open files, which would
 	 * cause rmdir() to fail.
 	 */
@@ -2149,7 +2149,7 @@ dbase_redo(XLogReaderState *record)
 		DropDatabaseBuffers(xlrec->db_id);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
-		ForgetDatabaseFsyncRequests(xlrec->db_id);
+		ForgetDatabaseSyncRequests(xlrec->db_id);
 
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359..6fb22246a6 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -106,14 +106,6 @@
  * the requests fields are protected by CheckpointerCommLock.
  *----------
  */
-typedef struct
-{
-	RelFileNode rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
-} CheckpointerRequest;
-
 typedef struct
 {
 	pid_t		checkpointer_pid;	/* PID (0 if not started) */
@@ -131,7 +123,7 @@ typedef struct
 
 	int			num_requests;	/* current # of requests */
 	int			max_requests;	/* allocated array size */
-	CheckpointerRequest requests[FLEXIBLE_ARRAY_MEMBER];
+	SyncRequest requests[FLEXIBLE_ARRAY_MEMBER];
 } CheckpointerShmemStruct;
 
 static CheckpointerShmemStruct *CheckpointerShmem;
@@ -347,7 +339,7 @@ CheckpointerMain(void)
 		/*
 		 * Process any requests or signals received recently.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 
 		if (got_SIGHUP)
 		{
@@ -676,7 +668,7 @@ CheckpointWriteDelay(int flags, double progress)
 			UpdateSharedMemoryConfig();
 		}
 
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 
 		CheckArchiveTimeout();
@@ -701,7 +693,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * operations even when we don't sleep, to prevent overflow of the
 		 * fsync request queue.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 	}
 }
@@ -885,7 +877,7 @@ CheckpointerShmemSize(void)
 	 * NBuffers.  This may prove too large or small ...
 	 */
 	size = offsetof(CheckpointerShmemStruct, requests);
-	size = add_size(size, mul_size(NBuffers, sizeof(CheckpointerRequest)));
+	size = add_size(size, mul_size(NBuffers, sizeof(SyncRequest)));
 
 	return size;
 }
@@ -1063,7 +1055,7 @@ RequestCheckpoint(int flags)
 }
 
 /*
- * ForwardFsyncRequest
+ * ForwardSyncRequest
  *		Forward a file-fsync request from a backend to the checkpointer
  *
  * Whenever a backend is compelled to write directly to a relation
@@ -1092,9 +1084,9 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestOwner owner)
 {
-	CheckpointerRequest *request;
+	SyncRequest *request;
 	bool		too_full;
 
 	if (!IsUnderPostmaster)
@@ -1130,9 +1122,9 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
+	request->owner = owner;
+	request->type = type;
+	request->ftag = ftag;
 
 	/* If queue is more than half full, nudge the checkpointer to empty it */
 	too_full = (CheckpointerShmem->num_requests >=
@@ -1168,7 +1160,7 @@ CompactCheckpointerRequestQueue(void)
 {
 	struct CheckpointerSlotMapping
 	{
-		CheckpointerRequest request;
+		SyncRequest request;
 		int			slot;
 	};
 
@@ -1187,7 +1179,7 @@ CompactCheckpointerRequestQueue(void)
 
 	/* Initialize temporary hash table */
 	MemSet(&ctl, 0, sizeof(ctl));
-	ctl.keysize = sizeof(CheckpointerRequest);
+	ctl.keysize = sizeof(SyncRequest);
 	ctl.entrysize = sizeof(struct CheckpointerSlotMapping);
 	ctl.hcxt = CurrentMemoryContext;
 
@@ -1211,7 +1203,7 @@ CompactCheckpointerRequestQueue(void)
 	 */
 	for (n = 0; n < CheckpointerShmem->num_requests; n++)
 	{
-		CheckpointerRequest *request;
+		SyncRequest *request;
 		struct CheckpointerSlotMapping *slotmap;
 		bool		found;
 
@@ -1263,8 +1255,8 @@ CompactCheckpointerRequestQueue(void)
 }
 
 /*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ * AbsorbSyncRequests
+ *		Retrieve queued sync requests and pass them to sync mechanism.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1272,10 +1264,10 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbSyncRequests(void)
 {
-	CheckpointerRequest *requests = NULL;
-	CheckpointerRequest *request;
+	SyncRequest *requests = NULL;
+	SyncRequest *request;
 	int			n;
 
 	if (!AmCheckpointerProcess())
@@ -1303,8 +1295,8 @@ AbsorbFsyncRequests(void)
 	n = CheckpointerShmem->num_requests;
 	if (n > 0)
 	{
-		requests = (CheckpointerRequest *) palloc(n * sizeof(CheckpointerRequest));
-		memcpy(requests, CheckpointerShmem->requests, n * sizeof(CheckpointerRequest));
+		requests = (SyncRequest *) palloc(n * sizeof(SyncRequest));
+		memcpy(requests, CheckpointerShmem->requests, n * sizeof(SyncRequest));
 	}
 
 	START_CRIT_SECTION();
@@ -1314,7 +1306,7 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberSyncRequest(request->ftag, request->type, request->owner);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index bd2d272c6e..8376cdfca2 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr
+SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..887023fc8a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2584,7 +2584,7 @@ CheckPointBuffers(int flags)
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2aba2dfe91..c5c17dc69d 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -29,45 +29,17 @@
 #include "access/xlogutils.h"
 #include "access/xlog.h"
 #include "pgstat.h"
-#include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
 #include "storage/relfilenode.h"
+#include "storage/segment.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
-
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
-
-/*
- * On Windows, we have to interpret EACCES as possibly meaning the same as
- * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
- * that's what you get.  Ugh.  This code is designed so that we don't
- * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
- */
-#ifndef WIN32
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
-#else
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
-#endif
-
 /*
  *	The magnetic disk storage manager keeps track of open file
  *	descriptors in its own descriptor pool.  This is done to make it
@@ -114,53 +86,27 @@ typedef struct _MdfdVec
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 
+/* local routines */
+static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
+			 bool isRedo);
+static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+					   MdfdVec *seg);
+static void register_unlink(RelFileNodeBackend rnode);
+static void _fdvec_resize(SMgrRelation reln,
+			  ForkNumber forknum,
+			  int nseg);
+static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
+			  BlockNumber segno, int oflags);
+static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
+			 BlockNumber blkno, bool skipFsync, int behavior);
+static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+		   MdfdVec *seg);
+
 
 /*
- * In some contexts (currently, standalone backends and the checkpointer)
- * we keep track of pending fsync operations: we need to remember all relation
- * segments that have been written since the last checkpoint, so that we can
- * fsync them down to disk before completing the next checkpoint.  This hash
- * table remembers the pending operations.  We use a hash table mostly as
- * a convenient way of merging duplicate requests.
- *
- * We use a similar mechanism to remember no-longer-needed files that can
- * be deleted after the next checkpoint, but we use a linked list instead of
- * a hash table, because we don't expect there to be any duplicate requests.
- *
- * These mechanisms are only used for non-temp relations; we never fsync
- * temp rels, nor do we need to postpone their deletion (see comments in
- * mdunlink).
- *
- * (Regular backends do not track pending operations locally, but forward
- * them to the checkpointer.)
+ * Segment handling behaviors
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
-
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
-
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
-
-
-/*** behavior for mdopen & _mdfd_getseg ***/
 /* ereport if segment not present */
 #define EXTENSION_FAIL				(1 << 0)
 /* return NULL if segment not present */
@@ -179,26 +125,6 @@ static CycleCtr mdckpt_cycle_ctr = 0;
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
 
 
-/* local routines */
-static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
-			 bool isRedo);
-static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
-static void _fdvec_resize(SMgrRelation reln,
-			  ForkNumber forknum,
-			  int nseg);
-static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
-			  BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
-			  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
-			 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
-		   MdfdVec *seg);
-
-
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
  */
@@ -208,64 +134,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -380,16 +248,6 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 void
 mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
-	/*
-	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
-	 * requests for a temp relation, though.  We can send just one request
-	 * even when deleting multiple forks, since the fsync queuing code accepts
-	 * the "InvalidForkNumber = all forks" convention.
-	 */
-	if (!RelFileNodeBackendIsTemp(rnode))
-		ForgetRelationFsyncRequests(rnode.node, forkNum);
-
 	/* Now do the per-fork work */
 	if (forkNum == InvalidForkNumber)
 	{
@@ -408,6 +266,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 
 	path = relpath(rnode, forkNum);
 
+	/* Forget any pending sync requests for the first segment */
+	if (!RelFileNodeBackendIsTemp(rnode))
+		ForgetRelationSyncRequests(rnode.node, forkNum, InvalidSegmentNumber);
+
 	/*
 	 * Delete or truncate the first segment.
 	 */
@@ -459,6 +321,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 		 */
 		for (segno = 1;; segno++)
 		{
+			/* Forget any pending sync requests for the rest of the segments */
+			if (!RelFileNodeBackendIsTemp(rnode))
+				ForgetRelationSyncRequests(rnode.node, forkNum, segno);
+
 			sprintf(segpath, "%s.%u", path, segno);
 			if (unlink(segpath) < 0)
 			{
@@ -1004,385 +870,50 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	mdsync() -- Sync previous writes to stable storage.
+ *	mdfilepath()
+ *
+ * Return the filename for the specified segment of the relation. The
+ * returned string is palloc'd.
  */
-void
-mdsync(void)
+char *
+mdfilepath(FileTag ftag)
 {
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
+	char	   *path,
+			   *fullpath;
 
 	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
+	 * We can safely pass InvalidBackendId as we never expect to sync
+	 * any segments for temporary relations.
 	 */
-	AbsorbFsyncRequests();
+	path = GetRelationPath(ftag.rnode.dbNode, ftag.rnode.spcNode,
+		ftag.rnode.relNode, InvalidBackendId, ftag.forknum);
 
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
+	if (ftag.segno > 0 && ftag.segno != InvalidSegmentNumber)
 	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
+		fullpath = psprintf("%s.%u", path, ftag.segno);
+		pfree(path);
 	}
+	else
+		fullpath = path;
 
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
+	return fullpath;
 }
 
 /*
- * mdpostckpt() -- Do post-checkpoint work
+ *	mdtagmatches()
  *
- * Remove any lingering files that can now be safely removed.
+ * Returns true if the predicate tag matches with the file tag.
  */
-void
-mdpostckpt(void)
+bool
+mdtagmatches(FileTag ftag, FileTag predicate, SyncRequestType type)
 {
-	int			absorb_counter;
-
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
+	/* Today, we only do matching for hierarchy (forget database) requests */
+	Assert(type == FORGET_HIERARCHY_REQUEST);
 
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
+	if (type == FORGET_HIERARCHY_REQUEST)
+		return ftag.rnode.dbNode == predicate.rnode.dbNode;
 
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
+	return false;
 }
 
 /*
@@ -1397,17 +928,22 @@ mdpostckpt(void)
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	FileTag tag;
+	tag.rnode = reln->smgr_rnode.node;
+	tag.forknum = forknum;
+	tag.segno = seg->mdfd_segno;
+
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
+	if (IsSyncManagedLocally())
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+		RememberSyncRequest(tag, SYNC_REQUEST, SYNC_MD);
 	}
 	else
 	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
+		if (ForwardSyncRequest(tag, SYNC_REQUEST, SYNC_MD))
 			return;				/* passed it off successfully */
 
 		ereport(DEBUG1,
@@ -1433,14 +969,18 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 static void
 register_unlink(RelFileNodeBackend rnode)
 {
+	FileTag tag;
+	tag.rnode = rnode.node;
+	tag.forknum = MAIN_FORKNUM;
+	tag.segno = InvalidSegmentNumber;
+
 	/* Should never be used with temp relations */
 	Assert(!RelFileNodeBackendIsTemp(rnode));
 
-	if (pendingOpsTable)
+	if (IsSyncManagedLocally())
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
+		RememberSyncRequest(tag, UNLINK_REQUEST, SYNC_MD);
 	}
 	else
 	{
@@ -1452,165 +992,11 @@ register_unlink(RelFileNodeBackend rnode)
 		 * XXX should we just leave the file orphaned instead?
 		 */
 		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
+		while (!ForwardSyncRequest(tag, UNLINK_REQUEST, SYNC_MD))
 			pg_usleep(10000L);	/* 10 msec seems a good number */
 	}
 }
 
-/*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
- */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
-{
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
 /*
  * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
  *
@@ -1618,12 +1004,19 @@ RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
  * actually know that, since it's just forwarding the request elsewhere.
  */
 void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+ForgetRelationSyncRequests(RelFileNode rnode, ForkNumber forknum,
+							  SegmentNumber segno)
 {
-	if (pendingOpsTable)
+	FileTag tag;
+
+	tag.rnode = rnode;
+	tag.forknum = forknum;
+	tag.segno = segno;
+
+	if (IsSyncManagedLocally())
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+		RememberSyncRequest(tag, FORGET_REQUEST, SYNC_MD);
 	}
 	else if (IsUnderPostmaster)
 	{
@@ -1637,7 +1030,7 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
 		 * which would be bad, so I'm inclined to assume that the checkpointer
 		 * will always empty the queue soon.
 		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
+		while (!ForwardSyncRequest(tag, FORGET_REQUEST, SYNC_MD))
 			pg_usleep(10000L);	/* 10 msec seems a good number */
 
 		/*
@@ -1651,24 +1044,26 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
  * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
  */
 void
-ForgetDatabaseFsyncRequests(Oid dbid)
+ForgetDatabaseSyncRequests(Oid dbid)
 {
-	RelFileNode rnode;
+	FileTag tag;
 
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
+	tag.rnode.dbNode = dbid;
+	tag.rnode.spcNode = 0;
+	tag.rnode.relNode = 0;
+	tag.forknum = InvalidForkNumber;
+	tag.segno = InvalidSegmentNumber;
 
-	if (pendingOpsTable)
+	if (IsSyncManagedLocally())
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
+		RememberSyncRequest(tag, FORGET_HIERARCHY_REQUEST, SYNC_MD);
+
 	}
 	else if (IsUnderPostmaster)
 	{
 		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
+		while (!ForwardSyncRequest(tag, FORGET_HIERARCHY_REQUEST, SYNC_MD))
 			pg_usleep(10000L);	/* 10 msec seems a good number */
 	}
 }
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0c0bba4ab3..e10ad826aa 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -59,12 +59,8 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
 } f_smgr;
 
-
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{
@@ -82,15 +78,11 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
-
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
@@ -751,52 +743,6 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
-}
-
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/backend/storage/sync/Makefile b/src/backend/storage/sync/Makefile
new file mode 100644
index 0000000000..cfc60cadb4
--- /dev/null
+++ b/src/backend/storage/sync/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/sync
+#
+# IDENTIFICATION
+#    src/backend/storage/sync/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/sync
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = sync.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
new file mode 100644
index 0000000000..584b6c60d5
--- /dev/null
+++ b/src/backend/storage/sync/sync.c
@@ -0,0 +1,590 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.c
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/sync/sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/file.h>
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "commands/tablespace.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+#include "utils/inval.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  This hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+typedef uint16 CycleCtr;		/* can be any convenient integer size */
+
+typedef struct
+{
+	FileTag				ftag;		/* hash table key (must be first!) */
+	SyncRequestOwner	owner;		/* owner for request resolution */
+	CycleCtr			cycle_ctr;	/* sync_cycle_ctr of oldest request */
+	bool				canceled;	/* canceled is true if we canceled "recently" */
+} PendingFsyncEntry;
+
+typedef struct
+{
+	FileTag		ftag;			/* tag for relation file to delete */
+	SyncRequestOwner	owner;	/* owner for request resolution */
+	CycleCtr	cycle_ctr;		/* checkpoint_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static HTAB *pendingOps = NULL;
+static List *pendingUnlinks = NIL;
+static MemoryContext pendingOpsCxt; /* context for the above  */
+
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr checkpoint_cycle_ctr = 0;
+
+/* Intervals for calling AbsorbFsyncRequests */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * This struct of function pointers defines the API between smgr.c and
+ * any individual storage manager module.  Note that smgr subfunctions are
+ * generally expected to report problems via elog(ERROR).  An exception is
+ * that smgr_unlink should use elog(WARNING), rather than erroring out,
+ * because we normally unlink relations during post-commit/abort cleanup,
+ * and so it's too late to raise an error.  Also, various conditions that
+ * would normally be errors should be allowed during bootstrap and/or WAL
+ * recovery --- see comments in md.c for details.
+ */
+typedef struct f_sync
+{
+	char*		(*sync_filepath) (FileTag ftag);
+	bool		(*sync_tagmatches) (FileTag ftag,
+									FileTag predicate, SyncRequestType type);
+} f_sync;
+
+static const f_sync syncsw[] = {
+	/* magnetic disk */
+	{
+		.sync_filepath = mdfilepath,
+		.sync_tagmatches = mdtagmatches
+	}
+};
+
+static const int NSync = lengthof(syncsw);
+
+
+/*
+ * Initialize data structures for the file sync tracking.
+ */
+void
+InitSync(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(FileTag);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOps = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+
+}
+
+/*
+ * PreCheckpoint() -- Do pre-checkpoint work
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+PreCheckpoint(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	checkpoint_cycle_ctr++;
+}
+
+/*
+ * PostCheckpoint() -- Do post-checkpoint work
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+PostCheckpoint(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == checkpoint_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = syncsw[entry->owner].sync_filepath(entry->ftag);
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in ProcessFsyncRequests, we don't want to stop absorbing fsync
+		 * requests for along time when there are many deletions to be done.  We
+		 * can safelycall AbsorbFsyncRequests() at this point in the loop
+		 * (note it might try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+/*
+
+ *	ProcessSyncRequests() -- Process queued fsync requests.
+ */
+void
+ProcessSyncRequests(void)
+{
+	static bool sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	int			processed = 0;
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	uint64		longest = 0;
+	uint64		total_elapsed = 0;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOps.
+	 */
+	if (!pendingOps)
+		elog(ERROR, "cannot sync without a pendingOps table");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbSyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use sync_cycle_ctr to tell old entries apart from new
+	 * ones: new ones will have cycle_ctr equal to the incremented value of
+	 * sync_cycle_ctr.
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous ProcessFsyncRequests() failed to complete, run through the
+	 * table and forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			entry->cycle_ctr = sync_cycle_ctr;
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOps);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		int			failures;
+
+		/*
+		 * If fsync is off then we don't have to bother opening the
+		 * file at all.  (We delay checking until this point so that
+		 * changing fsync on the fly behaves sensibly.)
+		 */
+		if (!enableFsync)
+			continue;
+
+		/*
+		 * If the entry is new then don't process it this time; it might
+		 * contain multiple fsync-request bits, but they are all new.  Note
+		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * If in checkpointer, we want to absorb pending requests
+		 * every so often to prevent overflow of the fsync request
+		 * queue.  It is unspecified whether newly-added entries will
+		 * be visited by hash_seq_search, but we don't care since we
+		 * don't need to process them anyway.
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+
+		/*
+		 * The fsync table could contain requests to fsync segments
+		 * that have been deleted (unlinked) by the time we get to
+		 * them. Rather than just hoping an ENOENT (or EACCES on
+		 * Windows) error can be ignored, what we do on error is
+		 * absorb pending requests and then retry.  Since mdunlink()
+		 * queues a "cancel" message before actually unlinking, the
+		 * fsync request is guaranteed to be marked canceled after the
+		 * absorb if it really was this case. DROP DATABASE likewise
+		 * has to tell us to forget fsync requests before it starts
+		 * deletions.
+		 */
+		for (failures = 0;; failures++) /* loop exits at "break" */
+		{
+			char	   *path;
+			File		fd;
+
+			path = syncsw[entry->owner].sync_filepath(entry->ftag);
+			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+
+			INSTR_TIME_SET_CURRENT(sync_start);
+			if (fd >= 0 &&
+				   FileSync(fd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
+			{
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				if (log_checkpoints)
+					elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
+						 processed,
+						 path,
+						 (double) elapsed / 1000);
+
+				pfree(path);
+				break;	/* out of retry loop */
+			}
+
+			/*
+			 * It is possible that the relation has been dropped or
+			 * truncated since the fsync request was entered.
+			 * Therefore, allow ENOENT, but only if we didn't fail
+			 * already on this file.  This applies both for
+			 * smgrgetseg() and for FileSync, since fd.c might have
+			 * closed the file behind our back.
+			 *
+			 * XXX is there any point in allowing more than one retry?
+			 * Don't see one at the moment, but easy to change the
+			 * test here if so.
+			 */
+			if (!FILE_POSSIBLY_DELETED(errno) || failures > 0)
+				ereport(data_sync_elevel(ERROR),
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								path)));
+			else
+				ereport(DEBUG1,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\" but retrying: %m",
+								path)));
+
+			pfree(path);
+
+			/*
+			 * Absorb incoming requests and check to see if a cancel
+			 * arrived for this relation fork.
+			 */
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
+
+			if (entry->canceled)
+				break;
+		}				/* end retry loop */
+
+		/* We are done with this entry, remove it */
+		if (hash_search(pendingOps, &entry->ftag, HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "pendingOps corrupted");
+	}	/* end loop over hashtable entries */
+
+	/* Return sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of ProcessSyncRequests */
+	sync_in_progress = false;
+}
+
+/*
+ * RememberSyncRequest() -- callback from checkpointer side of sync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * See sync.h for more information on the types of sync requests supported.
+ */
+void
+RememberSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestOwner owner)
+{
+	Assert(pendingOps);
+
+	if (type == FORGET_REQUEST)
+	{
+		/* Remove previously entered request */
+		if (hash_search(pendingOps,
+						(void *) &ftag,
+						HASH_REMOVE, NULL) == NULL)
+			elog(DEBUG5, "pendingOps table couldn't find entry for forget request");
+	}
+	else if (type == FORGET_HIERARCHY_REQUEST)
+	{
+		/* Remove any pending requests for the entire database */
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+		ListCell   *cell,
+				   *prev,
+				   *next;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (syncsw[entry->owner].sync_tagmatches(entry->ftag,
+													ftag /* predicate */, type))
+			{
+				entry->canceled = true;
+			}
+		}
+
+		/* Remove unlink requests */
+		prev = NULL;
+		for (cell = list_head(pendingUnlinks); cell; cell = next)
+		{
+			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+			next = lnext(cell);
+			if (syncsw[entry->owner].sync_tagmatches(entry->ftag,
+													ftag /* predicate */, type))
+			{
+				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
+				pfree(entry);
+			}
+			else
+				prev = cell;
+		}
+	}
+	else if (type == UNLINK_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->ftag = ftag;
+		entry->owner = owner;
+		entry->cycle_ctr = checkpoint_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		Assert(type == SYNC_REQUEST);
+
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+													  &ftag,
+													  HASH_ENTER,
+													  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			entry->owner = owner;
+			entry->cycle_ctr = sync_cycle_ctr;
+			entry->canceled = false;
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+}
+
+/*
+ * Are sync requests managed locally by the backend?
+ */
+bool
+IsSyncManagedLocally(void)
+{
+	return pendingOps != NULL;
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingOps during initialization of the startup
+ * process.  Calling this function drops the local pendingOps so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+EnableSyncRequestForwarding(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingOps)
+	{
+		ProcessSyncRequests();
+		hash_destroy(pendingOps);
+	}
+	pendingOps = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..0326e6c6ed 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -50,6 +50,7 @@
 #include "storage/proc.h"
 #include "storage/sinvaladt.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
@@ -554,6 +555,7 @@ BaseInit(void)
 
 	/* Do local initialization of file, storage and buffer managers */
 	InitFileAccess();
+	InitSync();
 	smgrinit();
 	InitBufferPoolAccess();
 }
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 53b8f5fe3c..2d3672f84a 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -17,6 +17,8 @@
 
 #include "storage/block.h"
 #include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
 
 
 /* GUC options */
@@ -31,9 +33,9 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern bool ForwardSyncRequest(FileTag ftag, SyncRequestType type,
+							   SyncRequestOwner owner);
+extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 74c34757fb..40f46b871d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -54,6 +54,18 @@ extern PGDLLIMPORT bool data_sync_retry;
  */
 extern int	max_safe_fds;
 
+/*
+ * On Windows, we have to interpret EACCES as possibly meaning the same as
+ * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
+ * that's what you get.  Ugh.  This code is designed so that we don't
+ * actually believe these cases are okay without further evidence (namely,
+ * a pending fsync request getting canceled ... see mdsync).
+ */
+#ifndef WIN32
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
+#else
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
+#endif
 
 /*
  * prototypes for functions in fd.c
diff --git a/src/include/storage/segment.h b/src/include/storage/segment.h
new file mode 100644
index 0000000000..c7af945168
--- /dev/null
+++ b/src/include/storage/segment.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * segment.h
+ *	  POSTGRES disk segment definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/segment.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SEGMENT_H
+#define SEGMENT_H
+
+
+/*
+ * Segment Number:
+ *
+ * Each relation and its forks are divided into segments. This
+ * definition formalizes the definition of the segment number.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)
+
+#endif							/* SEGMENT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 820d08ed4e..528a831140 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,7 +17,7 @@
 #include "fmgr.h"
 #include "storage/block.h"
 #include "storage/relfilenode.h"
-
+#include "storage/segment.h"
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -106,12 +106,8 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
-
 /* internals: move me elsewhere -- ay 7/94 */
 
 /* in md.c */
@@ -134,15 +130,10 @@ extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
 
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
+extern void ForgetRelationSyncRequests(RelFileNode rnode, ForkNumber forknum,
+	SegmentNumber segno);
+extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 #endif							/* SMGR_H */
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
new file mode 100644
index 0000000000..b5a918a487
--- /dev/null
+++ b/src/include/storage/sync.h
@@ -0,0 +1,108 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.h
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/sync.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SYNC_H
+#define SYNC_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/segment.h"
+
+/*
+ * Caller specified type of sync request.
+ *
+ * SYNC_REQUESTs are issued to sync a particular file whose path is determined
+ * by calling back the owner. A FORGET_REQUEST instructs the sync mechanism to
+ * cancel a previously submitted sync request.
+ *
+ * FORGET_HIERARCHY_REQUEST is a special type of forget request that involves
+ * scanning all pending sync requests and cancelling any entry that matches. The
+ * entries are resolved by calling back the owner as the key is opaque to the
+ * sync mechanism. Handling these types of requests are a tad slow because we
+ * have to search all the requests linearly, but usage of this such as
+ * dropping databases, is a pretty heavyweight operation anyhow, so we'll live
+ * with it.
+ *
+ * UNLINK_REQUEST is a request to delete the file after the next checkpoint.
+ * The path is determined by calling back the owner.
+ */
+typedef enum syncrequesttype
+{
+	SYNC_REQUEST,
+	FORGET_REQUEST,
+	FORGET_HIERARCHY_REQUEST,
+	UNLINK_REQUEST
+} syncrequesttype;
+
+/*
+ * Identifies the owner for the sync callbacks.
+ *
+ * These enums map back to entries in the callback function table. For
+ * consistency, explicitly set the value to 0. See sync.c for more information.
+ */
+typedef enum syncrequestowner
+{
+	SYNC_MD = 0		/* md smgr */
+} syncrequestowner;
+
+/*
+ * Store the request type and owner identifier in uint8 to reduce the overall
+ * memory footprint of the SyncRequest structure which is used in the
+ * checkpointer queue.
+ */
+typedef uint8 SyncRequestType;
+typedef uint8 SyncRequestOwner;
+
+/*
+ * Augmenting a relfilenode with the fork and segment number provides all
+ * the information to locate the particular segment of interest for a relation.
+ */
+typedef struct
+{
+	RelFileNode		rnode;
+	ForkNumber		forknum;
+	SegmentNumber	segno;
+} FileTag;
+
+typedef struct
+{
+	SyncRequestType		type;	/* type of sync request */
+	SyncRequestOwner	owner;	/* owner for request resolution */
+
+	/*
+	 * Currently, sync requests can be satisfied by information available in
+	 * the FileIdentifier. In the future, this can be combined with a
+	 * physical file descriptor or the full path to a file and put inside
+	 * an union.
+	 *
+	 * This value is opaque to sync mechanism and is used to pass to callback
+	 * handlers to retrieve path of the file to sync or to resolve forget
+	 * requests.
+	 */
+	FileTag		ftag;
+} SyncRequest;
+
+/* sync forward declarations */
+extern void InitSync(void);
+extern void PreCheckpoint(void);
+extern void PostCheckpoint(void);
+extern void ProcessSyncRequests(void);
+extern void RememberSyncRequest(FileTag ftag, SyncRequestType type,
+								 SyncRequestOwner owner);
+extern bool IsSyncManagedLocally(void);
+extern void EnableSyncRequestForwarding(void);
+
+/* md callback forward declarations */
+extern char* mdfilepath(FileTag ftag);
+extern bool mdtagmatches(FileTag ftag, FileTag predicate, SyncRequestType type);
+
+#endif							/* SYNC_H */
-- 
2.16.5

#52

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#51)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Feb 28, 2019 at 10:27 AM Shawn Debnath <sdn@amazon.com> wrote:

We had a quick offline discussion to get on the same page and we agreed
to move forward with Andres' approach above. Attached is patch v10.
Here's the overview of the patch:

Thanks. I will review, and try to rebase my undo patches on top of
this and see what problems I crash into.

Ran make check-world and repeated the tests described in [1]. The
numbers show a 12% drop in total time for single run of 1000 clients and
~62% drop in total time for 10 parallel runs with 200 clients:

Hmm, good but unexpected. Will poke at this here.

--
Thomas Munro
https://enterprisedb.com

#53

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Thomas Munro (#52)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Feb 28, 2019 at 10:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Thu, Feb 28, 2019 at 10:27 AM Shawn Debnath <sdn@amazon.com> wrote:

We had a quick offline discussion to get on the same page and we agreed
to move forward with Andres' approach above. Attached is patch v10.
Here's the overview of the patch:

Thanks. I will review, and try to rebase my undo patches on top of
this and see what problems I crash into.

Some initial feedback:

@@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags)
         * the REDO pointer.  Note that smgr must not do anything that'd have to
         * be undone if we decide no checkpoint is needed.
         */
-       smgrpreckpt();
+       PreCheckpoint();

I would call this and the "post" variant something like
SyncPreCheckpoint(). Otherwise it's too general sounding and not
clear which module it's in.

+static const int NSync = lengthof(syncsw);

Unused.

+ fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);

Needs to be closed.

+ path = syncsw[entry->owner].sync_filepath(entry->ftag);

Probably doesn't make much difference, but wouldn't it be better for
the path to be written into a caller-supplied buffer of size
MAXPGPATH? Then we could have that on the stack instead of alloc/free
for every path.

Hmm, mdfilepath() needs to use GetRelationPath(), and that already
returns palloc'd memory. Oh well.

+ entry->canceled = true;

Now that we killed the bitmapset, I wonder if we still need this
canceled flag. What if we just removed the entry from the hash table?
If you killed the canceled flag you could then replace this:

+                       if (entry->canceled)
+                               break;

.. with another hash table probe to see if the entry went in the
AbsorbSyncRequests() call (having first copied the key into a local
variable since of course "entry" may have been freed). Or maybe you
don't think that's better, I dunno, just an idea :-)

+ForwardSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestOwner owner)

Is it a deliberate choice that you pass FileTag objects around by
value? Rather than, say, pointer to const. Not really a complaint in
the current coding since it's a small object anyway (not much bigger
than a pointer), but I guess someone might eventually want to make it
into a flexible sized object, or something, I dunno.

-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+ForgetRelationSyncRequests(RelFileNode rnode, ForkNumber forknum,
+                                                         SegmentNumber segno)
 {
-       if (pendingOpsTable)
+       FileTag tag;
+
+       tag.rnode = rnode;
+       tag.forknum = forknum;
+       tag.segno = segno;
+
+       if (IsSyncManagedLocally())
        {
                /* standalone backend or startup process: fsync state
is local */
-               RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+               RememberSyncRequest(tag, FORGET_REQUEST, SYNC_MD);
        }
...

You left this and similar functions in md.c, but I think it needs to
move out to sync.c, and just take a FileTag directly. Otherwise I
have to write similar functions in undofile.c, and it seems kinda
weird that those modules are worrying about whether sync is managed
locally or the message needs to be sent to the checkpointer, and even
worse, they have to duplicate the loop that deals with
ForwardSyncRequest() failing and retrying. Concretely I'm saying that
sync.c should define a function like this:

+/*
+ * PostSyncRequest
+ *
+ * Remember locally, or post to checkpointer as appropriate.
+ */
+void
+PostSyncRequest(FileTag tag, SyncRequestType type, SyncRequestOwner owner)
+{
+       if (IsSyncManagedLocally())
+       {
+               /* standalone backend or startup process: fsync state
is local */
+               RememberSyncRequest(tag, type, owner);
+       }
+       else if (IsUnderPostmaster)
+       {
+               while (!ForwardSyncRequest(tag, FORGET_REQUEST, SYNC_MD))
+                       pg_usleep(10000L);      /* 10 msec seems a
good number */
+       }
+}

Hmm, perhaps it would need to take an argument to say whether it
should keep retrying or return false if it fails; that way
register_dirty_segment() could perform the FileSync() itself if the
queue is full, but register_unlink could tell it to keep trying. Does
this make sense?

+typedef enum syncrequesttype
+{
+       SYNC_REQUEST,
+       FORGET_REQUEST,
+       FORGET_HIERARCHY_REQUEST,
+       UNLINK_REQUEST
+} syncrequesttype;

These names are too generic for the global C namespace; how about
SYNC_REQ_CANCEL or similar?

+typedef enum syncrequestowner
+{
+       SYNC_MD = 0             /* md smgr */
+} syncrequestowner;

I have a feeling that Andres wanted to see a single enum combining
both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
have it.

+/* md callback forward declarations */
+extern char* mdfilepath(FileTag ftag);
+extern bool mdtagmatches(FileTag ftag, FileTag predicate,
SyncRequestType type);

It's weird that these ^ are declared in sync.h. I think they should
be declared in a new header md.h and that should be included by
sync.c. I know that we have this historical weird thing there md.c's
functions are declared in smgr.h, but we should eventually fix that.

More soon.

--
Thomas Munro
https://enterprisedb.com

#54

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Thomas Munro (#53)

Re: Refactoring the checkpointer's fsync request queue

Hi,

On 2019-03-01 23:17:27 +1300, Thomas Munro wrote:

@@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags)
* the REDO pointer.  Note that smgr must not do anything that'd have to
* be undone if we decide no checkpoint is needed.
*/
-       smgrpreckpt();
+       PreCheckpoint();
I would call this and the "post" variant something like
SyncPreCheckpoint(). Otherwise it's too general sounding and not
clear which module it's in.

Definitely.

+typedef enum syncrequesttype
+{
+       SYNC_REQUEST,
+       FORGET_REQUEST,
+       FORGET_HIERARCHY_REQUEST,
+       UNLINK_REQUEST
+} syncrequesttype;
These names are too generic for the global C namespace; how about
SYNC_REQ_CANCEL or similar?
+typedef enum syncrequestowner
+{
+       SYNC_MD = 0             /* md smgr */
+} syncrequestowner;
I have a feeling that Andres wanted to see a single enum combining
both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
have it.

Obviously it's nicer looking this way, but OTOH, that means we have to
send more data over the queue, because we can't easily combine the
request + "owner". I don't have too strong feelings about it though.

FWIW, I don't like the name owner here. Class? Method?

Greetings,

Andres Freund

#55

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Andres Freund (#54)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Mar 1, 2019 at 12:43 PM Andres Freund <andres@anarazel.de> wrote:

Obviously it's nicer looking this way, but OTOH, that means we have to
send more data over the queue, because we can't easily combine the
request + "owner". I don't have too strong feelings about it though.

Yeah, I would lean toward combining those.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#56

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Robert Haas (#55)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Mar 01, 2019 at 01:15:21PM -0500, Robert Haas wrote:

+typedef enum syncrequestowner
+{
+       SYNC_MD = 0             /* md smgr */
+} syncrequestowner;
I have a feeling that Andres wanted to see a single enum combining
both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
have it.
Obviously it's nicer looking this way, but OTOH, that means we have to
send more data over the queue, because we can't easily combine the
request + "owner". I don't have too strong feelings about it though.
Yeah, I would lean toward combining those.

I disagree, at least with combining and retaining enums. Encoding all
the possible request types with the current, planned and future SMGRs
would cause a sheer explosion in the number of enum values. Not to
mention that you have multiple enum values for the same behavior - which
just isn't clean. And handling of these enums in the code would be ugly
too.

Do note that these are typedef'ed to uint8 currently. For a default
config with 128 MB shared_buffers, we will use an extra 16kB (one extra
byte to represent the owner). I am hesitant to change this right now
unless folks feel strongly about it.

If so, I would combine the type and owner by splitting it up in 4 bit
chunks, allowing for 16 request types and 16 smgrs. This change would
only apply for the in-memory queue. The code and functions would
continue using the enums.

--
Shawn Debnath
Amazon Web Services (AWS)

#57

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#56)

Re: Refactoring the checkpointer's fsync request queue

On Sat, Mar 2, 2019 at 8:36 AM Shawn Debnath <sdn@amazon.com> wrote:

On Fri, Mar 01, 2019 at 01:15:21PM -0500, Robert Haas wrote:
+typedef enum syncrequestowner
+{
+       SYNC_MD = 0             /* md smgr */
+} syncrequestowner;
I have a feeling that Andres wanted to see a single enum combining
both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
have it.
Obviously it's nicer looking this way, but OTOH, that means we have to
send more data over the queue, because we can't easily combine the
request + "owner". I don't have too strong feelings about it though.
Yeah, I would lean toward combining those.
I disagree, at least with combining and retaining enums. Encoding all
the possible request types with the current, planned and future SMGRs
would cause a sheer explosion in the number of enum values. Not to
mention that you have multiple enum values for the same behavior - which
just isn't clean. And handling of these enums in the code would be ugly
too.

Do note that these are typedef'ed to uint8 currently. For a default
config with 128 MB shared_buffers, we will use an extra 16kB (one extra
byte to represent the owner). I am hesitant to change this right now
unless folks feel strongly about it.

If so, I would combine the type and owner by splitting it up in 4 bit
chunks, allowing for 16 request types and 16 smgrs. This change would
only apply for the in-memory queue. The code and functions would
continue using the enums.

--
Thomas Munro
https://enterprisedb.com

#58

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#56)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote:

I disagree, at least with combining and retaining enums. Encoding all
the possible request types with the current, planned and future SMGRs
would cause a sheer explosion in the number of enum values.

How big of an explosion would it be?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#59

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Robert Haas (#58)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:

On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote:

I disagree, at least with combining and retaining enums. Encoding all
the possible request types with the current, planned and future SMGRs
would cause a sheer explosion in the number of enum values.

How big of an explosion would it be?

4 enum values x # of smgrs; currently md, soon undo and slru so 12 in
total. Any future smgr addition will expand this further.

--
Shawn Debnath
Amazon Web Services (AWS)

#60

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#59)

Re: Refactoring the checkpointer's fsync request queue

On Sat, Mar 2, 2019 at 9:35 AM Shawn Debnath <sdn@amazon.com> wrote:

On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:

On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote:

I disagree, at least with combining and retaining enums. Encoding all
the possible request types with the current, planned and future SMGRs
would cause a sheer explosion in the number of enum values.

How big of an explosion would it be?

4 enum values x # of smgrs; currently md, soon undo and slru so 12 in
total. Any future smgr addition will expand this further.

It's not so much the "explosion" that bothers me. I think we should
have a distinct sync requester enum, because we need a way to index
into the table of callbacks. How exactly you pack the two enums into
compact space seems like a separate question; doing it with two words
would obviously be wasteful, but it should be possible stuff them into
(say) a single uint8_t, uint16_t or whatever will pack nicely in the
request struct and allow the full range of request types (4?) + the
full range of sync requesters (which we propose to expand to 3 in the
forseeable future). Now perhaps the single enum idea was going to
involve explicit values that encode the two values SYNC_REQ_CANCEL_MD
= 0x1 | (0x04 << 4) so you could still extract the requester part, but
that's just the same thing with uglier code.

--
Thomas Munro
https://enterprisedb.com

#61

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#59)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Mar 1, 2019 at 3:35 PM Shawn Debnath <sdn@amazon.com> wrote:

On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:

On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote:

I disagree, at least with combining and retaining enums. Encoding all
the possible request types with the current, planned and future SMGRs
would cause a sheer explosion in the number of enum values.

How big of an explosion would it be?

4 enum values x # of smgrs; currently md, soon undo and slru so 12 in
total. Any future smgr addition will expand this further.

I thought the idea was that each smgr might have a different set of
requests. If they're all going to have the same set of requests then
I agree with you.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#62

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Andres Freund (#54)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Mar 01, 2019 at 09:42:56AM -0800, Andres Freund wrote:

FWIW, I don't like the name owner here. Class? Method?

How about handler? Similar to when looking up callback functions in
lookup_index_am_handler_func()?

--
Shawn Debnath
Amazon Web Services (AWS)

#63

Paul Jungwirth

pj@illuminatedcomputing.com

almost 7 years ago

In reply to: Ibrar Ahmed (#50)

Re: Temporal Table Proposal

On 2/25/19 4:21 AM, Ibrar Ahmed wrote:

Great, to hear that you are working on that. Do you think I can help you
with this? I did some groundwork to make it possible. I can help in
coding/reviewing or even can take lead if you want to.

Hi Ibrar,

I'd love some help with this! I submitted my patch to the March
commitfest, and Peter Moser & Anton Dignös submitted theirs also. I
still need to rebase on the most recent commits, but I'll try to do that
tonight or tomorrow. Personally I'd love some review and feedback,
because this is my first substantial patch. (I made a small change to
btree_gist a couple years ago also....)

I think the challenge with temporal functionality is that there are a
lot of new concepts, and we'd like them all to hang together in a
coherent way. (That's why I want to give a talk about it: to increase
background understanding in the Postgres community.) So having someone
take the lead on it makes sense. I'm happy to provide some opinions and
direction, but my own coding contributions are likely to be slow, and
having a regular contributor more closely involved would help a lot.

Here are some thoughts about things that need work:

- temporal primary keys (my patch)
- temporal foreign keys (I've done some work on this adding to my patch
but I haven't finished it yet.)
- temporal joins (Moser/Dignös patch)
- declaring PERIODs (Vik Fearing's patch)
- showing PERIODs in the system catalog (Vik Fearing's patch)
- using PERIODs in SELECT, WHERE, GROUP BY, HAVING, function arguments,
etc. (TODO)
- SYSTEM_TIME PERIODs for transaction-time tables (TODO)
- temporal insert/update/delete for transaction-time tables (TODO)
- temporal insert/update/delete for valid-time tables (TODO)
- temporal SELECT for valid-time tables (TODO, could build off the
Moser/Dignös work)
- temporal SELECT for transaction-time tables (TODO)

I think the transaction-time stuff is easier, but also less interesting,
and there are well-known patterns for accomplishing it already. I'm more
interested in supporting valid-time tables personally.

--
Paul ~{:-)
pj@illuminatedcomputing.com

#64

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Robert Haas (#61)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Mar 01, 2019 at 04:27:47PM -0500, Robert Haas wrote:

On Fri, Mar 1, 2019 at 3:35 PM Shawn Debnath <sdn@amazon.com> wrote:

On Fri, Mar 01, 2019 at 03:03:19PM -0500, Robert Haas wrote:

On Fri, Mar 1, 2019 at 2:36 PM Shawn Debnath <sdn@amazon.com> wrote:

I disagree, at least with combining and retaining enums. Encoding all
the possible request types with the current, planned and future SMGRs
would cause a sheer explosion in the number of enum values.

How big of an explosion would it be?

4 enum values x # of smgrs; currently md, soon undo and slru so 12 in
total. Any future smgr addition will expand this further.

I thought the idea was that each smgr might have a different set of
requests. If they're all going to have the same set of requests then
I agree with you.

Yeah, in this particular case and at this layer, the operations are
consistent across all storage managers, in that, they want to queue a
new sync request for a specific file, forget an already queued request,
forget a hierarchy of requests, or unlink a specific file.

The fun is at the smgr layer which was discussed in a sub-thread in the
"Drop type smgr" thread started by Thomas. I started on a patch and will
be sending it out after the refactor patch is revised.

--
Shawn Debnath
Amazon Web Services (AWS)

#65

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Andres Freund (#43)

1 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Mar 01, 2019 at 11:17:27PM +1300, Thomas Munro wrote:

-       smgrpreckpt();
+       PreCheckpoint();
I would call this and the "post" variant something like
SyncPreCheckpoint(). Otherwise it's too general sounding and not
clear which module it's in.

Sure - fixed.

+static const int NSync = lengthof(syncsw);

Unused.

Good catch - interesting how there was no warning thrown for this.

+ fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);

Needs to be closed.

*ahem* fixed

+ path = syncsw[entry->owner].sync_filepath(entry->ftag);

Probably doesn't make much difference, but wouldn't it be better for
the path to be written into a caller-supplied buffer of size
MAXPGPATH? Then we could have that on the stack instead of alloc/free
for every path.

Hmm, mdfilepath() needs to use GetRelationPath(), and that already
returns palloc'd memory. Oh well.

Yep - I tried to muck with this as well but the logic is too neatly
encapsulated inside relpath.c and given the usages via
relpath[perm|backend] - I chose to forgo the attempt.

+ entry->canceled = true;

Now that we killed the bitmapset, I wonder if we still need this
canceled flag. What if we just removed the entry from the hash table?
If you killed the canceled flag you could then replace this:
+                       if (entry->canceled)
+                               break;
.. with another hash table probe to see if the entry went in the
AbsorbSyncRequests() call (having first copied the key into a local
variable since of course "entry" may have been freed). Or maybe you
don't think that's better, I dunno, just an idea :-)

It seems safer and cleaner to have the canceled flag as other approaches
don't really give us any gain. But this made me realize that I should
also cancel the entry for the simple forget case instead of removing it
from the hash table. Fixed and modified the for loop to break if entry
was cancelled. Deletion of the entry now happens in one place - after it
has been processed or skipped inside ProcessSyncRequests.

+ForwardSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestOwner owner)

Is it a deliberate choice that you pass FileTag objects around by
value? Rather than, say, pointer to const. Not really a complaint in
the current coding since it's a small object anyway (not much bigger
than a pointer), but I guess someone might eventually want to make it
into a flexible sized object, or something, I dunno.

It was deliberate to match what we are doing today. You have a good
point regarding future modifications. May as well get the API changed
correctly the first time around. Changed.

-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+ForgetRelationSyncRequests(RelFileNode rnode, ForkNumber forknum,
+                                                         SegmentNumber segno)
{
-       if (pendingOpsTable)
+       FileTag tag;
+
+       tag.rnode = rnode;
+       tag.forknum = forknum;
+       tag.segno = segno;
+
+       if (IsSyncManagedLocally())
{
/* standalone backend or startup process: fsync state
is local */
-               RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+               RememberSyncRequest(tag, FORGET_REQUEST, SYNC_MD);
}
...
You left this and similar functions in md.c, but I think it needs to
move out to sync.c, and just take a FileTag directly. Otherwise I
have to write similar functions in undofile.c, and it seems kinda
weird that those modules are worrying about whether sync is managed
locally or the message needs to be sent to the checkpointer, and even
worse, they have to duplicate the loop that deals with
ForwardSyncRequest() failing and retrying. Concretely I'm saying that
sync.c should define a function like this:
+/*
+ * PostSyncRequest
+ *
+ * Remember locally, or post to checkpointer as appropriate.
+ */
+void
+PostSyncRequest(FileTag tag, SyncRequestType type, SyncRequestOwner owner)
+{
+       if (IsSyncManagedLocally())
+       {
+               /* standalone backend or startup process: fsync state
is local */
+               RememberSyncRequest(tag, type, owner);
+       }
+       else if (IsUnderPostmaster)
+       {
+               while (!ForwardSyncRequest(tag, FORGET_REQUEST, SYNC_MD))
+                       pg_usleep(10000L);      /* 10 msec seems a
good number */
+       }
+}
Hmm, perhaps it would need to take an argument to say whether it
should keep retrying or return false if it fails; that way
register_dirty_segment() could perform the FileSync() itself if the
queue is full, but register_unlink could tell it to keep trying. Does
this make sense?

Yeah - makes sense. I did have it on my list, but I think I wanted to
minimize changes to md and wanted to wait for a use case. Though it
seems clear it's needed and looks like you already ran into it with
undo. Added a new function RegisterSyncRequest that checks for local
table or forwards to checkpointer. Accepts a parameter retryOnError
which does what it says or exits on failure.

+typedef enum syncrequesttype
+{
+       SYNC_REQUEST,
+       FORGET_REQUEST,
+       FORGET_HIERARCHY_REQUEST,
+       UNLINK_REQUEST
+} syncrequesttype;
These names are too generic for the global C namespace; how about
SYNC_REQ_CANCEL or similar?

Yeah, again, was trying to match what we had before. I prefixed these
enums with SYNC to have a more readable set:

SYNC_REQUEST
SYNC_FORGET_REQUEST
SYNC_FORGET_HIERARCHY_REQUEST
SYNC_UNLINK_REQUEST

Except for the hierarchy one, they are pretty reasonable in length.

+typedef enum syncrequestowner
+{
+       SYNC_MD = 0             /* md smgr */
+} syncrequestowner;
I have a feeling that Andres wanted to see a single enum combining
both the "operation" and the "owner", like SYNC_REQ_CANCEL_MD,
SYNC_REQ_CANCEL_UNDO, ... but I actually like it better the way you
have it.

This was tackled in a sub-thread. I am keeping the enums but changing
the the checkpointer queue to persist the type and owner combo in 8
bits.

+/* md callback forward declarations */
+extern char* mdfilepath(FileTag ftag);
+extern bool mdtagmatches(FileTag ftag, FileTag predicate,
SyncRequestType type);
It's weird that these ^ are declared in sync.h. I think they should
be declared in a new header md.h and that should be included by
sync.c. I know that we have this historical weird thing there md.c's
functions are declared in smgr.h, but we should eventually fix that.

Long story short - I wanted to avoid having md.h include sync.h for the
FileTag definitions. But - it's cleaner this way. Changed.

--
Shawn Debnath
Amazon Web Services (AWS)

Attachments:

0001-Refactor-the-fsync-machinery-to-support-future-SMGR-v11.patchtext/plain; charset=us-asciiDownload

From 0eb4e5eb03e173e47007ece5379b721497dda400 Mon Sep 17 00:00:00 2001
From: Shawn Debnath <sdn@amazon.com>
Date: Wed, 27 Feb 2019 18:58:58 +0000
Subject: [PATCH] Refactor the fsync mechanism to support future SMGR
 implementations.

In anticipation of proposed block storage managers alongside md.c that
map bufmgr.c blocks to files optimised for different usage patterns:

1. Move the system for requesting and processing fsyncs out of md.c
into storage/sync/sync.c with definitions in include/storage/sync.h.
ProcessSyncRequests() is now responsible for processing the sync
requests during checkpoint.

2. Removed the need for specific storage managers to implement pre and
post checkpoint callbacks. These are now executed by the sync mechanism.

3. We now embed the fork number and the segment number as part of the
hash key for the pending ops table. This eliminates the bitmapset based
segment tracking for each relfilenode during fsync as not all storage
managers may map their segments from zero.

4. Each sync request now must include a type: sync, forget, forget
hierarchy, or unlink, and the owner who will be responsible for
generating paths or matching forget requests.

5. For cancelling relation sync requests, we now must send a forget
request for each fork and segment in the relation.

6. We do not rely on smgr to provide the file descriptor we use to
issue fsync. Instead, we generate the full path based on the FileTag
in the sync request and use PathNameOpenFile to get the file descriptor.

Author: Shawn Debnath, Thomas Munro
Reviewed-by:
Discussion:
https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 src/backend/access/transam/twophase.c |   1 +
 src/backend/access/transam/xact.c     |   1 +
 src/backend/access/transam/xlog.c     |   7 +-
 src/backend/commands/dbcommands.c     |   7 +-
 src/backend/postmaster/checkpointer.c |  63 ++-
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +-
 src/backend/storage/smgr/md.c         | 860 ++++------------------------------
 src/backend/storage/smgr/smgr.c       |  55 +--
 src/backend/storage/sync/Makefile     |  17 +
 src/backend/storage/sync/sync.c       | 633 +++++++++++++++++++++++++
 src/backend/utils/init/postinit.c     |   2 +
 src/include/postmaster/bgwriter.h     |   8 +-
 src/include/storage/fd.h              |  12 +
 src/include/storage/md.h              |  52 ++
 src/include/storage/segment.h         |  28 ++
 src/include/storage/smgr.h            |  38 --
 src/include/storage/sync.h            |  81 ++++
 18 files changed, 985 insertions(+), 884 deletions(-)
 create mode 100644 src/backend/storage/sync/Makefile
 create mode 100644 src/backend/storage/sync/sync.c
 create mode 100644 src/include/storage/md.h
 create mode 100644 src/include/storage/segment.h
 create mode 100644 src/include/storage/sync.h

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 64679dd2de..80150467c7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -98,6 +98,7 @@
 #include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e93262975d..5384f62b34 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -50,6 +50,7 @@
 #include "storage/fd.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53a..b2b154e77a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -65,6 +65,7 @@
 #include "storage/reinit.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
+#include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -6986,7 +6987,7 @@ StartupXLOG(void)
 		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			PublishStartupProcessInformation();
-			SetForwardFsyncRequests();
+			EnableSyncRequestForwarding();
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
 			bgwriterLaunched = true;
 		}
@@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags)
 	 * the REDO pointer.  Note that smgr must not do anything that'd have to
 	 * be undone if we decide no checkpoint is needed.
 	 */
-	smgrpreckpt();
+	SyncPreCheckpoint();
 
 	/* Begin filling in the checkpoint WAL record */
 	MemSet(&checkPoint, 0, sizeof(checkPoint));
@@ -8912,7 +8913,7 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
-	smgrpostckpt();
+	SyncPostCheckpoint();
 
 	/*
 	 * Update the average distance between checkpoints if the prior checkpoint
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index d207cd899f..d553e2087c 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -53,6 +53,7 @@
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
@@ -940,11 +941,11 @@ dropdb(const char *dbname, bool missing_ok)
 	 * worse, it will delete files that belong to a newly created database
 	 * with the same OID.
 	 */
-	ForgetDatabaseFsyncRequests(db_id);
+	ForgetDatabaseSyncRequests(db_id);
 
 	/*
 	 * Force a checkpoint to make sure the checkpointer has received the
-	 * message sent by ForgetDatabaseFsyncRequests. On Windows, this also
+	 * message sent by ForgetDatabaseSyncRequests. On Windows, this also
 	 * ensures that background procs don't hold any open files, which would
 	 * cause rmdir() to fail.
 	 */
@@ -2149,7 +2150,7 @@ dbase_redo(XLogReaderState *record)
 		DropDatabaseBuffers(xlrec->db_id);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
-		ForgetDatabaseFsyncRequests(xlrec->db_id);
+		ForgetDatabaseSyncRequests(xlrec->db_id);
 
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359..be51342e45 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -108,12 +108,38 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
+	/*
+	 * To reduce mmemory footprint, we combine the SyncRequestType and the
+	 * SyncRequestHandler by splitting them into 4 bits each and storing them
+	 * in an uint8. The type and handler values account for far fewer than
+	 * 15 entries, so works just fine.
+	 */
+	uint8 sync_type_handler_combo;
+
+	/*
+	 * Currently, sync requests can be satisfied by information available in
+	 * the FileIdentifier. In the future, this can be combined with a
+	 * physical file descriptor or the full path to a file and put inside
+	 * an union.
+	 *
+	 * This value is opaque to sync mechanism and is used to pass to callback
+	 * handlers to retrieve path of the file to sync or to resolve forget
+	 * requests.
+	 */
+	FileTagData		ftag;
 } CheckpointerRequest;
 
+/*
+ * Handler occupies the higher 4 bits while type occupies the lower 4 in
+ * the uint8 combo storage.
+ */
+static uint8 sync_request_type_mask = 0x0F;
+static uint8 sync_request_handler_mask = 0xF0;
+
+#define SYNC_TYPE_AND_HANDLER_COMBO(t, h) ((h) << 4 | (t))
+#define SYNC_REQUEST_TYPE_VALUE(v) (sync_request_type_mask & (v))
+#define SYNC_REQUEST_HANDLER_VALUE(v) ((sync_request_handler_mask & (v)) >> 4)
+
 typedef struct
 {
 	pid_t		checkpointer_pid;	/* PID (0 if not started) */
@@ -347,7 +373,7 @@ CheckpointerMain(void)
 		/*
 		 * Process any requests or signals received recently.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 
 		if (got_SIGHUP)
 		{
@@ -676,7 +702,7 @@ CheckpointWriteDelay(int flags, double progress)
 			UpdateSharedMemoryConfig();
 		}
 
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 
 		CheckArchiveTimeout();
@@ -701,7 +727,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * operations even when we don't sleep, to prevent overflow of the
 		 * fsync request queue.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 	}
 }
@@ -1063,7 +1089,7 @@ RequestCheckpoint(int flags)
 }
 
 /*
- * ForwardFsyncRequest
+ * ForwardSyncRequest
  *		Forward a file-fsync request from a backend to the checkpointer
  *
  * Whenever a backend is compelled to write directly to a relation
@@ -1092,10 +1118,10 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestHandler handler)
 {
 	CheckpointerRequest *request;
-	bool		too_full;
+	bool				too_full;
 
 	if (!IsUnderPostmaster)
 		return false;			/* probably shouldn't even get here */
@@ -1130,9 +1156,8 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
+	request->sync_type_handler_combo = SYNC_TYPE_AND_HANDLER_COMBO(type, handler);
+	request->ftag = *ftag;
 
 	/* If queue is more than half full, nudge the checkpointer to empty it */
 	too_full = (CheckpointerShmem->num_requests >=
@@ -1169,7 +1194,7 @@ CompactCheckpointerRequestQueue(void)
 	struct CheckpointerSlotMapping
 	{
 		CheckpointerRequest request;
-		int			slot;
+		int					slot;
 	};
 
 	int			n,
@@ -1263,8 +1288,8 @@ CompactCheckpointerRequestQueue(void)
 }
 
 /*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ * AbsorbSyncRequests
+ *		Retrieve queued sync requests and pass them to sync mechanism.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1272,7 +1297,7 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbSyncRequests(void)
 {
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
@@ -1314,7 +1339,9 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberSyncRequest(&(request->ftag),
+				SYNC_REQUEST_TYPE_VALUE(request->sync_type_handler_combo),
+				SYNC_REQUEST_HANDLER_VALUE(request->sync_type_handler_combo));
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index bd2d272c6e..8376cdfca2 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr
+SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..887023fc8a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2584,7 +2584,7 @@ CheckPointBuffers(int flags)
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2aba2dfe91..2ddcd52a5a 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -29,45 +29,18 @@
 #include "access/xlogutils.h"
 #include "access/xlog.h"
 #include "pgstat.h"
-#include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
+#include "storage/md.h"
 #include "storage/relfilenode.h"
+#include "storage/segment.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
-
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
-
-/*
- * On Windows, we have to interpret EACCES as possibly meaning the same as
- * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
- * that's what you get.  Ugh.  This code is designed so that we don't
- * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
- */
-#ifndef WIN32
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
-#else
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
-#endif
-
 /*
  *	The magnetic disk storage manager keeps track of open file
  *	descriptors in its own descriptor pool.  This is done to make it
@@ -114,53 +87,30 @@ typedef struct _MdfdVec
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 
+/* local routines */
+static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
+			 bool isRedo);
+static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+					   MdfdVec *seg);
+static void register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno);
+static void register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno);
+static void _fdvec_resize(SMgrRelation reln,
+			  ForkNumber forknum,
+			  int nseg);
+static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
+			  BlockNumber segno, int oflags);
+static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
+			 BlockNumber blkno, bool skipFsync, int behavior);
+static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+		   MdfdVec *seg);
+
 
 /*
- * In some contexts (currently, standalone backends and the checkpointer)
- * we keep track of pending fsync operations: we need to remember all relation
- * segments that have been written since the last checkpoint, so that we can
- * fsync them down to disk before completing the next checkpoint.  This hash
- * table remembers the pending operations.  We use a hash table mostly as
- * a convenient way of merging duplicate requests.
- *
- * We use a similar mechanism to remember no-longer-needed files that can
- * be deleted after the next checkpoint, but we use a linked list instead of
- * a hash table, because we don't expect there to be any duplicate requests.
- *
- * These mechanisms are only used for non-temp relations; we never fsync
- * temp rels, nor do we need to postpone their deletion (see comments in
- * mdunlink).
- *
- * (Regular backends do not track pending operations locally, but forward
- * them to the checkpointer.)
+ * Segment handling behaviors
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
-
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
-
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
-
-
-/*** behavior for mdopen & _mdfd_getseg ***/
 /* ereport if segment not present */
 #define EXTENSION_FAIL				(1 << 0)
 /* return NULL if segment not present */
@@ -179,26 +129,6 @@ static CycleCtr mdckpt_cycle_ctr = 0;
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
 
 
-/* local routines */
-static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
-			 bool isRedo);
-static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
-static void _fdvec_resize(SMgrRelation reln,
-			  ForkNumber forknum,
-			  int nseg);
-static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
-			  BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
-			  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
-			 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
-		   MdfdVec *seg);
-
-
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
  */
@@ -208,64 +138,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -380,16 +252,6 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 void
 mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
-	/*
-	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
-	 * requests for a temp relation, though.  We can send just one request
-	 * even when deleting multiple forks, since the fsync queuing code accepts
-	 * the "InvalidForkNumber = all forks" convention.
-	 */
-	if (!RelFileNodeBackendIsTemp(rnode))
-		ForgetRelationFsyncRequests(rnode.node, forkNum);
-
 	/* Now do the per-fork work */
 	if (forkNum == InvalidForkNumber)
 	{
@@ -413,6 +275,11 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 	 */
 	if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
 	{
+		/* First, forget any pending sync requests for the first segment */
+		if (!RelFileNodeBackendIsTemp(rnode))
+			register_forget_request(rnode, forkNum, 0 /* first seg */);
+
+		/* Next unlink the file */
 		ret = unlink(path);
 		if (ret < 0 && errno != ENOENT)
 			ereport(WARNING,
@@ -442,7 +309,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		register_unlink_segment(rnode, forkNum, 0 /* first seg */);
 	}
 
 	/*
@@ -459,6 +326,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 		 */
 		for (segno = 1;; segno++)
 		{
+			/* Forget any pending sync requests for the segment before we unlink */
+			if (!RelFileNodeBackendIsTemp(rnode))
+				register_forget_request(rnode, forkNum, segno);
+
 			sprintf(segpath, "%s.%u", path, segno);
 			if (unlink(segpath) < 0)
 			{
@@ -1004,385 +875,50 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	mdsync() -- Sync previous writes to stable storage.
+ *	mdfilepath()
+ *
+ * Return the filename for the specified segment of the relation. The
+ * returned string is palloc'd.
  */
-void
-mdsync(void)
+char *
+mdfilepath(FileTag ftag)
 {
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
+	char	   *path,
+			   *fullpath;
 
 	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
+	 * We can safely pass InvalidBackendId as we never expect to sync
+	 * any segments for temporary relations.
 	 */
-	AbsorbFsyncRequests();
+	path = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+		ftag->rnode.relNode, InvalidBackendId, ftag->forknum);
 
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
+	if (ftag->segno > 0 && ftag->segno != InvalidSegmentNumber)
 	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
+		fullpath = psprintf("%s.%u", path, ftag->segno);
+		pfree(path);
 	}
+	else
+		fullpath = path;
 
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
+	return fullpath;
 }
 
 /*
- * mdpostckpt() -- Do post-checkpoint work
+ *	mdfiletagmatches()
  *
- * Remove any lingering files that can now be safely removed.
+ * Returns true if the predicate tag matches with the file tag.
  */
-void
-mdpostckpt(void)
+bool
+mdfiletagmatches(FileTag ftag, FileTag predicate, SyncRequestType type)
 {
-	int			absorb_counter;
+	/* Today, we only do matching for hierarchy (forget database) requests */
+	Assert(type == SYNC_FORGET_HIERARCHY_REQUEST);
 
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
+	if (type == SYNC_FORGET_HIERARCHY_REQUEST)
+		return ftag->rnode.dbNode == predicate->rnode.dbNode;
 
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
-
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
+	return false;
 }
 
 /*
@@ -1397,19 +933,17 @@ mdpostckpt(void)
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	FileTagData tag;
+	tag.rnode = reln->smgr_rnode.node;
+	tag.forknum = forknum;
+	tag.segno = seg->mdfd_segno;
+
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, SYNC_HANDLER_MD,
+							 false /*retryOnError*/))
 	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1423,254 +957,54 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 /*
  * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
  */
 static void
-register_unlink(RelFileNodeBackend rnode)
+register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno)
 {
+	FileTagData tag;
+	tag.rnode = rnode.node;
+	tag.forknum = forknum;
+	tag.segno = segno;
+
 	/* Should never be used with temp relations */
 	Assert(!RelFileNodeBackendIsTemp(rnode));
 
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
-}
-
-/*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
- */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
-{
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
+	RegisterSyncRequest(&tag, SYNC_UNLINK_REQUEST, SYNC_HANDLER_MD,
+							 true /*retryOnError*/);
 }
 
 /*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
+ * register_forget_request() -- forget any fsyncs for a relation fork's segment
  */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
+static void
+register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+							  SegmentNumber segno)
 {
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
+	FileTagData tag;
+	tag.rnode = rnode.node;
+	tag.forknum = forknum;
+	tag.segno = segno;
 
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
+	RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, SYNC_HANDLER_MD,
+							 true /*retryOnError*/);
 }
 
 /*
  * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
  */
 void
-ForgetDatabaseFsyncRequests(Oid dbid)
+ForgetDatabaseSyncRequests(Oid dbid)
 {
-	RelFileNode rnode;
-
-	rnode.dbNode = dbid;
-	rnode.spcNode = 0;
-	rnode.relNode = 0;
-
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	FileTagData tag;
+	tag.rnode.dbNode = dbid;
+	tag.rnode.spcNode = 0;
+	tag.rnode.relNode = 0;
+	tag.forknum = InvalidForkNumber;
+	tag.segno = InvalidSegmentNumber;
+
+	RegisterSyncRequest(&tag, SYNC_FORGET_HIERARCHY_REQUEST, SYNC_HANDLER_MD,
+							 true /*retryOnError*/);
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0c0bba4ab3..190cf1c83f 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
 #include "storage/smgr.h"
+#include "storage/md.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
 
@@ -59,12 +60,8 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
 } f_smgr;
 
-
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{
@@ -82,15 +79,11 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
-
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
@@ -751,52 +744,6 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
-}
-
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/backend/storage/sync/Makefile b/src/backend/storage/sync/Makefile
new file mode 100644
index 0000000000..cfc60cadb4
--- /dev/null
+++ b/src/backend/storage/sync/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/sync
+#
+# IDENTIFICATION
+#    src/backend/storage/sync/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/sync
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = sync.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
new file mode 100644
index 0000000000..4e665407af
--- /dev/null
+++ b/src/backend/storage/sync/sync.c
@@ -0,0 +1,633 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.c
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/sync/sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/file.h>
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "commands/tablespace.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/md.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+#include "utils/inval.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  This hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+typedef uint16 CycleCtr;		/* can be any convenient integer size */
+
+typedef struct
+{
+	FileTagData			ftag;		/* hash table key (must be first!) */
+	SyncRequestHandler	handler;	/* request resolution handler */
+	CycleCtr			cycle_ctr;	/* sync_cycle_ctr of oldest request */
+	bool				canceled;	/* canceled is true if we canceled "recently" */
+} PendingFsyncEntry;
+
+typedef struct
+{
+	FileTagData			ftag;	/* tag for relation file to delete */
+	SyncRequestHandler	handler; /* request resolution handler */
+	CycleCtr	cycle_ctr;		/* checkpoint_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static HTAB *pendingOps = NULL;
+static List *pendingUnlinks = NIL;
+static MemoryContext pendingOpsCxt; /* context for the above  */
+
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr checkpoint_cycle_ctr = 0;
+
+/* Intervals for calling AbsorbFsyncRequests */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * This struct of function pointers defines the API between smgr.c and
+ * any individual storage manager module.  Note that smgr subfunctions are
+ * generally expected to report problems via elog(ERROR).  An exception is
+ * that smgr_unlink should use elog(WARNING), rather than erroring out,
+ * because we normally unlink relations during post-commit/abort cleanup,
+ * and so it's too late to raise an error.  Also, various conditions that
+ * would normally be errors should be allowed during bootstrap and/or WAL
+ * recovery --- see comments in md.c for details.
+ */
+typedef struct f_sync
+{
+	char*		(*sync_filepath) (FileTag ftag);
+	bool		(*sync_filetagmatches) (FileTag ftag,
+										FileTag predicate, SyncRequestType type);
+} f_sync;
+
+static const f_sync syncsw[] = {
+	/* magnetic disk */
+	{
+		.sync_filepath = mdfilepath,
+		.sync_filetagmatches = mdfiletagmatches
+	}
+};
+
+/*
+ * Initialize data structures for the file sync tracking.
+ */
+void
+InitSync(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(FileTagData);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOps = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+
+}
+
+/*
+ * SyncPreCheckpoint() -- Do pre-checkpoint work
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+SyncPreCheckpoint(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	checkpoint_cycle_ctr++;
+}
+
+/*
+ * SyncPostCheckpoint() -- Do post-checkpoint work
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+SyncPostCheckpoint(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == checkpoint_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = syncsw[entry->handler].sync_filepath(&(entry->ftag));
+
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in ProcessFsyncRequests, we don't want to stop absorbing fsync
+		 * requests for along time when there are many deletions to be done.  We
+		 * can safelycall AbsorbFsyncRequests() at this point in the loop
+		 * (note it might try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+/*
+
+ *	ProcessSyncRequests() -- Process queued fsync requests.
+ */
+void
+ProcessSyncRequests(void)
+{
+	static bool sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	int			processed = 0;
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	uint64		longest = 0;
+	uint64		total_elapsed = 0;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOps.
+	 */
+	if (!pendingOps)
+		elog(ERROR, "cannot sync without a pendingOps table");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbSyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use sync_cycle_ctr to tell old entries apart from new
+	 * ones: new ones will have cycle_ctr equal to the incremented value of
+	 * sync_cycle_ctr.
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous ProcessFsyncRequests() failed to complete, run through the
+	 * table and forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			entry->cycle_ctr = sync_cycle_ctr;
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOps);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		int			failures;
+
+		/*
+		 * If fsync is off then we don't have to bother opening the
+		 * file at all.  (We delay checking until this point so that
+		 * changing fsync on the fly behaves sensibly.)
+		 */
+		if (!enableFsync)
+			continue;
+
+		/*
+		 * If the entry is new then don't process it this time; it might
+		 * contain multiple fsync-request bits, but they are all new.  Note
+		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * If in checkpointer, we want to absorb pending requests
+		 * every so often to prevent overflow of the fsync request
+		 * queue.  It is unspecified whether newly-added entries will
+		 * be visited by hash_seq_search, but we don't care since we
+		 * don't need to process them anyway.
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+
+		/*
+		 * The fsync table could contain requests to fsync segments
+		 * that have been deleted (unlinked) by the time we get to
+		 * them. Rather than just hoping an ENOENT (or EACCES on
+		 * Windows) error can be ignored, what we do on error is
+		 * absorb pending requests and then retry.  Since mdunlink()
+		 * queues a "cancel" message before actually unlinking, the
+		 * fsync request is guaranteed to be marked canceled after the
+		 * absorb if it really was this case. DROP DATABASE likewise
+		 * has to tell us to forget fsync requests before it starts
+		 * deletions.
+		 *
+		 * If the entry was cancelled after the absorb above, or within the
+		 * absorb inside the loop, exit the loop. We delete the entry right
+		 * after. Look can also exit at "break".
+		 */
+		for (failures = 0; !(entry->canceled); failures++)
+		{
+			char	   *path;
+			File		fd;
+
+			path = syncsw[entry->handler].sync_filepath(&(entry->ftag));
+			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+
+			INSTR_TIME_SET_CURRENT(sync_start);
+			if (fd >= 0 &&
+				   FileSync(fd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
+			{
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				if (log_checkpoints)
+					elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
+						 processed,
+						 path,
+						 (double) elapsed / 1000);
+
+				FileClose(fd);
+				pfree(path);
+				break;	/* out of retry loop */
+			}
+
+			/* Done with the file descriptor, close it */
+			if (fd >= 0)
+				FileClose(fd);
+
+			/*
+			 * It is possible that the relation has been dropped or
+			 * truncated since the fsync request was entered.
+			 * Therefore, allow ENOENT, but only if we didn't fail
+			 * already on this file.  This applies both for
+			 * smgrgetseg() and for FileSync, since fd.c might have
+			 * closed the file behind our back.
+			 *
+			 * XXX is there any point in allowing more than one retry?
+			 * Don't see one at the moment, but easy to change the
+			 * test here if so.
+			 */
+			if (!FILE_POSSIBLY_DELETED(errno) || failures > 0)
+				ereport(data_sync_elevel(ERROR),
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								path)));
+			else
+				ereport(DEBUG1,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\" but retrying: %m",
+								path)));
+
+			pfree(path);
+
+			/*
+			 * Absorb incoming requests and check to see if a cancel
+			 * arrived for this relation fork.
+			 */
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
+		}				/* end retry loop */
+
+		/* We are done with this entry, remove it */
+		if (hash_search(pendingOps, &entry->ftag, HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "pendingOps corrupted");
+	}	/* end loop over hashtable entries */
+
+	/* Return sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of ProcessSyncRequests */
+	sync_in_progress = false;
+}
+
+/*
+ * RememberSyncRequest() -- callback from checkpointer side of sync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * See sync.h for more information on the types of sync requests supported.
+ */
+void
+RememberSyncRequest(FileTag ftag, SyncRequestType type, SyncRequestHandler handler)
+{
+	Assert(pendingOps);
+
+	if (type == SYNC_FORGET_REQUEST)
+	{
+		PendingFsyncEntry *entry;
+		/* Cancel previously entered request */
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+												 (void *)ftag,
+												 HASH_FIND,
+												 NULL);
+		if (entry != NULL)
+			entry->canceled = true;
+	}
+	else if (type == SYNC_FORGET_HIERARCHY_REQUEST)
+	{
+		/* Remove any pending requests for the entire database */
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+		ListCell   *cell,
+				   *prev,
+				   *next;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (syncsw[entry->handler].sync_filetagmatches(&(entry->ftag),
+													ftag /* predicate */, type))
+			{
+				entry->canceled = true;
+			}
+		}
+
+		/* Remove unlink requests */
+		prev = NULL;
+		for (cell = list_head(pendingUnlinks); cell; cell = next)
+		{
+			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+			next = lnext(cell);
+			if (syncsw[entry->handler].sync_filetagmatches(&(entry->ftag),
+													ftag /* predicate */, type))
+			{
+				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
+				pfree(entry);
+			}
+			else
+				prev = cell;
+		}
+	}
+	else if (type == SYNC_UNLINK_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->ftag = *ftag;
+		entry->handler = handler;
+		entry->cycle_ctr = checkpoint_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		Assert(type == SYNC_REQUEST);
+
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+													  ftag,
+													  HASH_ENTER,
+													  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			entry->handler = handler;
+			entry->cycle_ctr = sync_cycle_ctr;
+			entry->canceled = false;
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+}
+
+/*
+ * RegisterSyncRequest()
+ *
+ *   Register the sync request locally, or forward it to the checkpointer.
+ *   Caller can chose to infinitely retry or return immediately on error. We
+ *   currently will wait for 10 ms before retrying.
+ */
+bool
+RegisterSyncRequest(FileTag ftag, SyncRequestType type,
+				SyncRequestHandler handler, bool retryOnError)
+{
+	bool ret;
+
+	if (pendingOps != NULL)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberSyncRequest(ftag, type, handler);
+		return true;
+	}
+	else
+	{
+		do
+		{
+			/*
+			 * Notify the checkpointer about it.  If we fail to queue the cancel
+			 * message, we have to sleep and try again ... ugly, but hopefully
+			 * won't happen often.
+			 *
+			 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
+			 * error would leave the no-longer-used file still present on disk,
+			 * which would be bad, so I'm inclined to assume that the checkpointer
+			 * will always empty the queue soon.
+			 */
+			ret = ForwardSyncRequest(ftag, type, handler);
+			if (retryOnError)
+				pg_usleep(10000L);
+
+		} while(!ret && retryOnError);
+
+		Assert(ret || (!ret && !retryOnError));
+		return ret;
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingOps during initialization of the startup
+ * process.  Calling this function drops the local pendingOps so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+EnableSyncRequestForwarding(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingOps)
+	{
+		ProcessSyncRequests();
+		hash_destroy(pendingOps);
+	}
+	pendingOps = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..0326e6c6ed 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -50,6 +50,7 @@
 #include "storage/proc.h"
 #include "storage/sinvaladt.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
@@ -554,6 +555,7 @@ BaseInit(void)
 
 	/* Do local initialization of file, storage and buffer managers */
 	InitFileAccess();
+	InitSync();
 	smgrinit();
 	InitBufferPoolAccess();
 }
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 53b8f5fe3c..76b60a36fc 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -17,6 +17,8 @@
 
 #include "storage/block.h"
 #include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
 
 
 /* GUC options */
@@ -31,9 +33,9 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern bool ForwardSyncRequest(FileTag ftag, SyncRequestType type,
+							   SyncRequestHandler handler);
+extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 74c34757fb..40f46b871d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -54,6 +54,18 @@ extern PGDLLIMPORT bool data_sync_retry;
  */
 extern int	max_safe_fds;
 
+/*
+ * On Windows, we have to interpret EACCES as possibly meaning the same as
+ * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
+ * that's what you get.  Ugh.  This code is designed so that we don't
+ * actually believe these cases are okay without further evidence (namely,
+ * a pending fsync request getting canceled ... see mdsync).
+ */
+#ifndef WIN32
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
+#else
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
+#endif
 
 /*
  * prototypes for functions in fd.c
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
new file mode 100644
index 0000000000..5d9e873586
--- /dev/null
+++ b/src/include/storage/md.h
@@ -0,0 +1,52 @@
+/*-------------------------------------------------------------------------
+ *
+ * md.h
+ *	  magnetic disk storage manager public interface declarations.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/md.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MD_H
+#define MD_H
+
+#include "fmgr.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
+
+/* md storage manager funcationality */
+extern void mdinit(void);
+extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
+extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
+extern void mdextend(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber blocknum);
+extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
+extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber nblocks);
+extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+
+extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
+
+/* md sync callback forward declarations */
+extern char* mdfilepath(FileTag ftag);
+extern bool mdfiletagmatches(FileTag ftag, FileTag predicate,
+							 SyncRequestType type);
+
+#endif							/* MD_H */
diff --git a/src/include/storage/segment.h b/src/include/storage/segment.h
new file mode 100644
index 0000000000..c7af945168
--- /dev/null
+++ b/src/include/storage/segment.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * segment.h
+ *	  POSTGRES disk segment definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/segment.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SEGMENT_H
+#define SEGMENT_H
+
+
+/*
+ * Segment Number:
+ *
+ * Each relation and its forks are divided into segments. This
+ * definition formalizes the definition of the segment number.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)
+
+#endif							/* SEGMENT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 820d08ed4e..26ac8f2cec 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,7 +18,6 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
-
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -106,43 +105,6 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
-
-/* internals: move me elsewhere -- ay 7/94 */
-
-/* in md.c */
-extern void mdinit(void);
-extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
-extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
-extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
-extern void mdextend(SMgrRelation reln, ForkNumber forknum,
-		 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber blocknum);
-extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
-			BlockNumber blocknum, BlockNumber nblocks);
-extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
-extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
-extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
-
 #endif							/* SMGR_H */
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
new file mode 100644
index 0000000000..1241eb40c9
--- /dev/null
+++ b/src/include/storage/sync.h
@@ -0,0 +1,81 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.h
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/sync.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SYNC_H
+#define SYNC_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/segment.h"
+
+/*
+ * Caller specified type of sync request.
+ *
+ * SYNC_REQUESTs are issued to sync a particular file whose path is determined
+ * by calling back the handler. A SYNC_FORGET_REQUEST instructs the sync
+ * mechanism to cancel a previously submitted sync request.
+ *
+ * SYNC_FORGET_HIERARCHY_REQUEST is a special type of forget request that
+ * involves scanning all pending sync requests and cancelling any entry that
+ * matches. The entries are resolved by calling back the handler as the key is
+ * opaque to the sync mechanism. Handling these types of requests are a tad slow
+ * because we have to search all the requests linearly, but usage of this such
+ * as dropping databases, is a pretty heavyweight operation anyhow, so we'll
+ * live with it.
+ *
+ * SYNC_UNLINK_REQUEST is a request to delete the file after the next
+ * checkpoint. The path is determined by calling back the handler.
+ */
+typedef enum syncrequesttype
+{
+	SYNC_REQUEST,
+	SYNC_FORGET_REQUEST,
+	SYNC_FORGET_HIERARCHY_REQUEST,
+	SYNC_UNLINK_REQUEST
+} SyncRequestType;
+
+/*
+ * Identifies the handler for the sync callbacks.
+ *
+ * These enums map back to entries in the callback function table. For
+ * consistency, explicitly set the value to 0. See sync.c for more information.
+ */
+typedef enum syncrequesthandler
+{
+	SYNC_HANDLER_MD = 0		/* md smgr */
+} SyncRequestHandler;
+
+/*
+ * Augmenting a relfilenode with the fork and segment number provides all
+ * the information to locate the particular segment of interest for a relation.
+ */
+typedef struct
+{
+	RelFileNode		rnode;
+	ForkNumber		forknum;
+	SegmentNumber	segno;
+} FileTagData;
+
+typedef FileTagData *FileTag;
+
+/* sync forward declarations */
+extern void InitSync(void);
+extern void SyncPreCheckpoint(void);
+extern void SyncPostCheckpoint(void);
+extern void ProcessSyncRequests(void);
+extern void RememberSyncRequest(FileTag ftag, SyncRequestType type,
+								 SyncRequestHandler handler);
+extern void EnableSyncRequestForwarding(void);
+extern bool RegisterSyncRequest(FileTag ftag, SyncRequestType type,
+				SyncRequestHandler handler, bool retryOnError);
+
+#endif							/* SYNC_H */
-- 
2.16.5

#66

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#65)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Mar 5, 2019 at 2:25 PM Shawn Debnath <sdn@amazon.com> wrote:

[v11 patch]

Thanks. Hmm, something is wrong here because make check is
dramatically slower -- for example the "insert" test runs in ~8-13
seconds instead of the usual ~0.2 seconds according to Travis,
AppVeyor and my local FreeBSD system (note that fsync is disabled so
it's not that -- it must be bogus queue-related CPU?)

--
Thomas Munro
https://enterprisedb.com

#67

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Thomas Munro (#66)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Mar 05, 2019 at 10:45:37PM +1300, Thomas Munro wrote:

On Tue, Mar 5, 2019 at 2:25 PM Shawn Debnath <sdn@amazon.com> wrote:

[v11 patch]

Thanks. Hmm, something is wrong here because make check is
dramatically slower -- for example the "insert" test runs in ~8-13
seconds instead of the usual ~0.2 seconds according to Travis,
AppVeyor and my local FreeBSD system (note that fsync is disabled so
it's not that -- it must be bogus queue-related CPU?)

Confirmed. Patch shows 8900 ms vs 192 ms on master for the insert test.
Interesting! It's reproducible so should be able to figure out what's
going on. The only thing we do in ForwardSyncRequest() is split up the 8
bits into 2x4 bits and copy the FileTagData structure to the
checkpointer queue. Will report back what I find.

--
Shawn Debnath
Amazon Web Services (AWS)

#68

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#67)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Mar 6, 2019 at 5:07 AM Shawn Debnath <sdn@amazon.com> wrote:

Confirmed. Patch shows 8900 ms vs 192 ms on master for the insert test.
Interesting! It's reproducible so should be able to figure out what's
going on. The only thing we do in ForwardSyncRequest() is split up the 8
bits into 2x4 bits and copy the FileTagData structure to the
checkpointer queue. Will report back what I find.

More review, all superficial stuff:

+typedef struct
+{
+       RelFileNode             rnode;
+       ForkNumber              forknum;
+       SegmentNumber   segno;
+} FileTagData;
+
+typedef FileTagData *FileTag;

Even though I know I said we should take FileTag by pointer, and even
though there is an older tradition in the tree of having a struct
named "FooData" and a corresponding pointer typedef named "Foo", as
far as I know most people are not following the convention for new
code and I for one don't like it. One problem is that there isn't a
way to make a pointer-to-const type given a pointer-to-non-const type,
so you finish up throwing away const from your programs. I like const
as documentation and a tiny bit of extra compiler checking. What do
you think about "FileTag" for the struct and eg "const FileTag *tag"
when receiving one as a function argument?

-/* internals: move me elsewhere -- ay 7/94 */

Aha, about time too!

+#include "fmgr.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"

Why do we need to include fmgr.h in md.h?

+/* md storage manager funcationality */

Typo.

+/* md sync callback forward declarations */

These aren't "forward" declarations, they're plain old declarations.

+extern char* mdfilepath(FileTag ftag);

Doesn't really matter too much because all of this will get
pgindent-ed at some point, but FYI we write "char *md", not "char*
md".

#include "storage/smgr.h"
+#include "storage/md.h"
#include "utils/hsearch.h"

Bad sorting.

+       FileTagData tag;
+       tag.rnode = reln->smgr_rnode.node;
+       tag.forknum = forknum;
+       tag.segno = seg->mdfd_segno;

I wonder if it would be better practice to zero-initialise that
sucker, so that if more members are added we don't leave them
uninitialised. I like the syntax "FileTagData tag = {{0}}".
(Unfortunately extra nesting required here because first member is a
struct, and C99 doesn't allow us to use empty {} like C++, even though
some versions of GCC accept it. Rats.)

--
Thomas Munro
https://enterprisedb.com

#69

Tom Lane

tgl@sss.pgh.pa.us

almost 7 years ago

In reply to: Thomas Munro (#68)

Re: Refactoring the checkpointer's fsync request queue

Thomas Munro <thomas.munro@gmail.com> writes:

+#include "fmgr.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"

Why do we need to include fmgr.h in md.h?

More generally, any massive increase in an include file's inclusions
is probably a sign that you need to refactor. Cross-header inclusions
are best avoided altogether if you can --- obviously that's not always
possible, but we should minimize them. We've had some very unfortunate
problems in the past from indiscriminate #includes in headers.

regards, tom lane

#70

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Tom Lane (#69)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Mar 05, 2019 at 11:53:16AM -0500, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:
+#include "fmgr.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
Why do we need to include fmgr.h in md.h?

More generally, any massive increase in an include file's inclusions
is probably a sign that you need to refactor. Cross-header inclusions
are best avoided altogether if you can --- obviously that's not always
possible, but we should minimize them. We've had some very unfortunate
problems in the past from indiscriminate #includes in headers.

Agree - I do pay attention to these, but this one slipped through the
cracks (copied smgr.h then edited to remove smgr bits). Thanks for
catching this, will fix in the next patch iteration.

--
Shawn Debnath
Amazon Web Services (AWS)

#71

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Andres Freund (#47)

1 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Mar 06, 2019 at 05:33:54AM +1300, Thomas Munro wrote:

On Wed, Mar 6, 2019 at 5:07 AM Shawn Debnath <sdn@amazon.com> wrote:

Confirmed. Patch shows 8900 ms vs 192 ms on master for the insert test.
Interesting! It's reproducible so should be able to figure out what's
going on. The only thing we do in ForwardSyncRequest() is split up the 8
bits into 2x4 bits and copy the FileTagData structure to the
checkpointer queue. Will report back what I find.

Fixed - tried to be clever with a do while loop and ended up forcing a
sleep of 10 ms for every register request. Silly mistake (had an assert
with the right assertion right after!). Reverted to while(1) with a
clean break if we meet the conditions. Thanks for Thomas for being a
second set of eyes during the investigation. make check runs are happy:

patch:
make check 2.48s user 0.94s system 12% cpu 27.411 total

master:
make check 2.50s user 0.88s system 12% cpu 27.573 total

More review, all superficial stuff:
+typedef struct
+{
+       RelFileNode             rnode;
+       ForkNumber              forknum;
+       SegmentNumber   segno;
+} FileTagData;
+
+typedef FileTagData *FileTag;
Even though I know I said we should take FileTag by pointer, and even
though there is an older tradition in the tree of having a struct
named "FooData" and a corresponding pointer typedef named "Foo", as
far as I know most people are not following the convention for new
code and I for one don't like it. One problem is that there isn't a
way to make a pointer-to-const type given a pointer-to-non-const type,
so you finish up throwing away const from your programs. I like const
as documentation and a tiny bit of extra compiler checking. What do
you think about "FileTag" for the struct and eg "const FileTag *tag"
when receiving one as a function argument?

More compile time safety checks are always better. I have made the
changes in this new patch. Also, followed BufferTag pattern and defined
INIT_FILETAG that will initialize the structure with the correct values.
This avoids the point you bring up of accidentally omitting initializing
members when new ones are added.

+#include "fmgr.h"
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"

Why do we need to include fmgr.h in md.h?

Removed.

+/* md storage manager funcationality */

Typo.

Fixed

+/* md sync callback forward declarations */

These aren't "forward" declarations, they're plain old declarations.

Removed and simplified.

+extern char* mdfilepath(FileTag ftag);

Doesn't really matter too much because all of this will get
pgindent-ed at some point, but FYI we write "char *md", not "char*
md".

Hmm - different, noted. Changed.

#include "storage/smgr.h"
+#include "storage/md.h"
#include "utils/hsearch.h"

Bad sorting.

Ordered correctly..

+       FileTagData tag;
+       tag.rnode = reln->smgr_rnode.node;
+       tag.forknum = forknum;
+       tag.segno = seg->mdfd_segno;
I wonder if it would be better practice to zero-initialise that
sucker, so that if more members are added we don't leave them
uninitialised. I like the syntax "FileTagData tag = {{0}}".
(Unfortunately extra nesting required here because first member is a
struct, and C99 doesn't allow us to use empty {} like C++, even though
some versions of GCC accept it. Rats.)

See comments above for re-defining FileTag.

--
Shawn Debnath
Amazon Web Services (AWS)

Attachments:

0001-Refactor-the-fsync-machinery-to-support-future-SMGR-v12.patchtext/plain; charset=us-asciiDownload

From ebaf6feaf0530fb0eace516bb1c8487b5ef9fa7f Mon Sep 17 00:00:00 2001
From: Shawn Debnath <sdn@amazon.com>
Date: Wed, 27 Feb 2019 18:58:58 +0000
Subject: [PATCH] Refactor the fsync mechanism to support future SMGR
 implementations.

In anticipation of proposed block storage managers alongside md.c that
map bufmgr.c blocks to files optimised for different usage patterns:

1. Move the system for requesting and processing fsyncs out of md.c
into storage/sync/sync.c with definitions in include/storage/sync.h.
ProcessSyncRequests() is now responsible for processing the sync
requests during checkpoint.

2. Removed the need for specific storage managers to implement pre and
post checkpoint callbacks. These are now executed by the sync mechanism.

3. We now embed the fork number and the segment number as part of the
hash key for the pending ops table. This eliminates the bitmapset based
segment tracking for each relfilenode during fsync as not all storage
managers may map their segments from zero.

4. Each sync request now must include a type: sync, forget, forget
hierarchy, or unlink, and the owner who will be responsible for
generating paths or matching forget requests.

5. For cancelling relation sync requests, we now must send a forget
request for each fork and segment in the relation.

6. We do not rely on smgr to provide the file descriptor we use to
issue fsync. Instead, we generate the full path based on the FileTag
in the sync request and use PathNameOpenFile to get the file descriptor.

Author: Shawn Debnath, Thomas Munro
Reviewed-by:
Discussion:
https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 src/backend/access/transam/twophase.c |   1 +
 src/backend/access/transam/xact.c     |   1 +
 src/backend/access/transam/xlog.c     |   7 +-
 src/backend/commands/dbcommands.c     |   7 +-
 src/backend/postmaster/checkpointer.c |  64 ++-
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +-
 src/backend/storage/smgr/md.c         | 846 ++++------------------------------
 src/backend/storage/smgr/smgr.c       |  55 +--
 src/backend/storage/sync/Makefile     |  17 +
 src/backend/storage/sync/sync.c       | 638 +++++++++++++++++++++++++
 src/backend/utils/init/postinit.c     |   2 +
 src/include/postmaster/bgwriter.h     |   8 +-
 src/include/storage/fd.h              |  12 +
 src/include/storage/md.h              |  51 ++
 src/include/storage/segment.h         |  28 ++
 src/include/storage/smgr.h            |  38 --
 src/include/storage/sync.h            |  86 ++++
 18 files changed, 988 insertions(+), 877 deletions(-)
 create mode 100644 src/backend/storage/sync/Makefile
 create mode 100644 src/backend/storage/sync/sync.c
 create mode 100644 src/include/storage/md.h
 create mode 100644 src/include/storage/segment.h
 create mode 100644 src/include/storage/sync.h

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 64679dd2de..80150467c7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -98,6 +98,7 @@
 #include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e93262975d..5384f62b34 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -50,6 +50,7 @@
 #include "storage/fd.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ecd12fc53a..b2b154e77a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -65,6 +65,7 @@
 #include "storage/reinit.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
+#include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -6986,7 +6987,7 @@ StartupXLOG(void)
 		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			PublishStartupProcessInformation();
-			SetForwardFsyncRequests();
+			EnableSyncRequestForwarding();
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
 			bgwriterLaunched = true;
 		}
@@ -8616,7 +8617,7 @@ CreateCheckPoint(int flags)
 	 * the REDO pointer.  Note that smgr must not do anything that'd have to
 	 * be undone if we decide no checkpoint is needed.
 	 */
-	smgrpreckpt();
+	SyncPreCheckpoint();
 
 	/* Begin filling in the checkpoint WAL record */
 	MemSet(&checkPoint, 0, sizeof(checkPoint));
@@ -8912,7 +8913,7 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
-	smgrpostckpt();
+	SyncPostCheckpoint();
 
 	/*
 	 * Update the average distance between checkpoints if the prior checkpoint
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index d207cd899f..d553e2087c 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -53,6 +53,7 @@
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
@@ -940,11 +941,11 @@ dropdb(const char *dbname, bool missing_ok)
 	 * worse, it will delete files that belong to a newly created database
 	 * with the same OID.
 	 */
-	ForgetDatabaseFsyncRequests(db_id);
+	ForgetDatabaseSyncRequests(db_id);
 
 	/*
 	 * Force a checkpoint to make sure the checkpointer has received the
-	 * message sent by ForgetDatabaseFsyncRequests. On Windows, this also
+	 * message sent by ForgetDatabaseSyncRequests. On Windows, this also
 	 * ensures that background procs don't hold any open files, which would
 	 * cause rmdir() to fail.
 	 */
@@ -2149,7 +2150,7 @@ dbase_redo(XLogReaderState *record)
 		DropDatabaseBuffers(xlrec->db_id);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
-		ForgetDatabaseFsyncRequests(xlrec->db_id);
+		ForgetDatabaseSyncRequests(xlrec->db_id);
 
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41359..7529ea4bba 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -108,12 +108,38 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
+	/*
+	 * To reduce mmemory footprint, we combine the SyncRequestType and the
+	 * SyncRequestHandler by splitting them into 4 bits each and storing them
+	 * in an uint8. The type and handler values account for far fewer than
+	 * 15 entries, so works just fine.
+	 */
+	uint8 sync_type_handler_combo;
+
+	/*
+	 * Currently, sync requests can be satisfied by information available in
+	 * the FileIdentifier. In the future, this can be combined with a
+	 * physical file descriptor or the full path to a file and put inside
+	 * an union.
+	 *
+	 * This value is opaque to sync mechanism and is used to pass to callback
+	 * handlers to retrieve path of the file to sync or to resolve forget
+	 * requests.
+	 */
+	FileTag		ftag;
 } CheckpointerRequest;
 
+/*
+ * Handler occupies the higher 4 bits while type occupies the lower 4 in
+ * the uint8 combo storage.
+ */
+static uint8 sync_request_type_mask = 0x0F;
+static uint8 sync_request_handler_mask = 0xF0;
+
+#define SYNC_TYPE_AND_HANDLER_COMBO(t, h) ((h) << 4 | (t))
+#define SYNC_REQUEST_TYPE_VALUE(v) (sync_request_type_mask & (v))
+#define SYNC_REQUEST_HANDLER_VALUE(v) ((sync_request_handler_mask & (v)) >> 4)
+
 typedef struct
 {
 	pid_t		checkpointer_pid;	/* PID (0 if not started) */
@@ -347,7 +373,7 @@ CheckpointerMain(void)
 		/*
 		 * Process any requests or signals received recently.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 
 		if (got_SIGHUP)
 		{
@@ -676,7 +702,7 @@ CheckpointWriteDelay(int flags, double progress)
 			UpdateSharedMemoryConfig();
 		}
 
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 
 		CheckArchiveTimeout();
@@ -701,7 +727,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * operations even when we don't sleep, to prevent overflow of the
 		 * fsync request queue.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 	}
 }
@@ -1063,7 +1089,7 @@ RequestCheckpoint(int flags)
 }
 
 /*
- * ForwardFsyncRequest
+ * ForwardSyncRequest
  *		Forward a file-fsync request from a backend to the checkpointer
  *
  * Whenever a backend is compelled to write directly to a relation
@@ -1092,10 +1118,11 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardSyncRequest(const FileTag *ftag, SyncRequestType type,
+				   SyncRequestHandler handler)
 {
 	CheckpointerRequest *request;
-	bool		too_full;
+	bool				too_full;
 
 	if (!IsUnderPostmaster)
 		return false;			/* probably shouldn't even get here */
@@ -1130,9 +1157,8 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
+	request->sync_type_handler_combo = SYNC_TYPE_AND_HANDLER_COMBO(type, handler);
+	request->ftag = *ftag;
 
 	/* If queue is more than half full, nudge the checkpointer to empty it */
 	too_full = (CheckpointerShmem->num_requests >=
@@ -1169,7 +1195,7 @@ CompactCheckpointerRequestQueue(void)
 	struct CheckpointerSlotMapping
 	{
 		CheckpointerRequest request;
-		int			slot;
+		int					slot;
 	};
 
 	int			n,
@@ -1263,8 +1289,8 @@ CompactCheckpointerRequestQueue(void)
 }
 
 /*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ * AbsorbSyncRequests
+ *		Retrieve queued sync requests and pass them to sync mechanism.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1272,7 +1298,7 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbSyncRequests(void)
 {
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
@@ -1314,7 +1340,9 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberSyncRequest(&(request->ftag),
+				SYNC_REQUEST_TYPE_VALUE(request->sync_type_handler_combo),
+				SYNC_REQUEST_HANDLER_VALUE(request->sync_type_handler_combo));
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index bd2d272c6e..8376cdfca2 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr
+SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385f..887023fc8a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2584,7 +2584,7 @@ CheckPointBuffers(int flags)
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2aba2dfe91..8cc9fb1614 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -29,45 +29,18 @@
 #include "access/xlogutils.h"
 #include "access/xlog.h"
 #include "pgstat.h"
-#include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
+#include "storage/md.h"
 #include "storage/relfilenode.h"
+#include "storage/segment.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
-
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
-
-/*
- * On Windows, we have to interpret EACCES as possibly meaning the same as
- * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
- * that's what you get.  Ugh.  This code is designed so that we don't
- * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
- */
-#ifndef WIN32
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
-#else
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
-#endif
-
 /*
  *	The magnetic disk storage manager keeps track of open file
  *	descriptors in its own descriptor pool.  This is done to make it
@@ -114,53 +87,30 @@ typedef struct _MdfdVec
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 
+/* local routines */
+static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
+			 bool isRedo);
+static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+					   MdfdVec *seg);
+static void register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno);
+static void register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno);
+static void _fdvec_resize(SMgrRelation reln,
+			  ForkNumber forknum,
+			  int nseg);
+static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
+			  BlockNumber segno, int oflags);
+static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
+			 BlockNumber blkno, bool skipFsync, int behavior);
+static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+		   MdfdVec *seg);
+
 
 /*
- * In some contexts (currently, standalone backends and the checkpointer)
- * we keep track of pending fsync operations: we need to remember all relation
- * segments that have been written since the last checkpoint, so that we can
- * fsync them down to disk before completing the next checkpoint.  This hash
- * table remembers the pending operations.  We use a hash table mostly as
- * a convenient way of merging duplicate requests.
- *
- * We use a similar mechanism to remember no-longer-needed files that can
- * be deleted after the next checkpoint, but we use a linked list instead of
- * a hash table, because we don't expect there to be any duplicate requests.
- *
- * These mechanisms are only used for non-temp relations; we never fsync
- * temp rels, nor do we need to postpone their deletion (see comments in
- * mdunlink).
- *
- * (Regular backends do not track pending operations locally, but forward
- * them to the checkpointer.)
+ * Segment handling behaviors
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
-
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
-
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
-
-
-/*** behavior for mdopen & _mdfd_getseg ***/
 /* ereport if segment not present */
 #define EXTENSION_FAIL				(1 << 0)
 /* return NULL if segment not present */
@@ -179,26 +129,6 @@ static CycleCtr mdckpt_cycle_ctr = 0;
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
 
 
-/* local routines */
-static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
-			 bool isRedo);
-static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
-static void _fdvec_resize(SMgrRelation reln,
-			  ForkNumber forknum,
-			  int nseg);
-static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
-			  BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
-			  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
-			 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
-		   MdfdVec *seg);
-
-
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
  */
@@ -208,64 +138,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -380,16 +252,6 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 void
 mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
-	/*
-	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
-	 * requests for a temp relation, though.  We can send just one request
-	 * even when deleting multiple forks, since the fsync queuing code accepts
-	 * the "InvalidForkNumber = all forks" convention.
-	 */
-	if (!RelFileNodeBackendIsTemp(rnode))
-		ForgetRelationFsyncRequests(rnode.node, forkNum);
-
 	/* Now do the per-fork work */
 	if (forkNum == InvalidForkNumber)
 	{
@@ -413,6 +275,11 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 	 */
 	if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
 	{
+		/* First, forget any pending sync requests for the first segment */
+		if (!RelFileNodeBackendIsTemp(rnode))
+			register_forget_request(rnode, forkNum, 0 /* first seg */);
+
+		/* Next unlink the file */
 		ret = unlink(path);
 		if (ret < 0 && errno != ENOENT)
 			ereport(WARNING,
@@ -442,7 +309,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		register_unlink_segment(rnode, forkNum, 0 /* first seg */);
 	}
 
 	/*
@@ -459,6 +326,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 		 */
 		for (segno = 1;; segno++)
 		{
+			/* Forget any pending sync requests for the segment before we unlink */
+			if (!RelFileNodeBackendIsTemp(rnode))
+				register_forget_request(rnode, forkNum, segno);
+
 			sprintf(segpath, "%s.%u", path, segno);
 			if (unlink(segpath) < 0)
 			{
@@ -1004,385 +875,51 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	mdsync() -- Sync previous writes to stable storage.
+ *	mdfilepath()
+ *
+ * Return the filename for the specified segment of the relation. The
+ * returned string is palloc'd.
  */
-void
-mdsync(void)
+char *
+mdfilepath(const FileTag *ftag)
 {
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
+	char	   *path,
+			   *fullpath;
 
 	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
+	 * We can safely pass InvalidBackendId as we never expect to sync
+	 * any segments for temporary relations.
 	 */
-	AbsorbFsyncRequests();
+	path = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+		ftag->rnode.relNode, InvalidBackendId, ftag->forknum);
 
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
+	if (ftag->segno > 0 && ftag->segno != InvalidSegmentNumber)
 	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
+		fullpath = psprintf("%s.%u", path, ftag->segno);
+		pfree(path);
 	}
+	else
+		fullpath = path;
 
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
+	return fullpath;
 }
 
 /*
- * mdpostckpt() -- Do post-checkpoint work
+ *	mdfiletagmatches()
  *
- * Remove any lingering files that can now be safely removed.
+ * Returns true if the predicate tag matches with the file tag.
  */
-void
-mdpostckpt(void)
+bool
+mdfiletagmatches(const FileTag *ftag, const FileTag *predicate,
+				 SyncRequestType type)
 {
-	int			absorb_counter;
+	/* Today, we only do matching for hierarchy (forget database) requests */
+	Assert(type == SYNC_FORGET_HIERARCHY_REQUEST);
 
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
+	if (type == SYNC_FORGET_HIERARCHY_REQUEST)
+		return ftag->rnode.dbNode == predicate->rnode.dbNode;
 
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
-
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
+	return false;
 }
 
 /*
@@ -1397,19 +934,16 @@ mdpostckpt(void)
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	FileTag tag;
+
+	INIT_FILETAG(tag, reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, SYNC_HANDLER_MD,
+							 false /*retryOnError*/))
 	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1423,254 +957,54 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 /*
  * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
  */
 static void
-register_unlink(RelFileNodeBackend rnode)
+register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno)
 {
+	FileTag tag;
+
+	INIT_FILETAG(tag, rnode.node, forknum, segno);
+
 	/* Should never be used with temp relations */
 	Assert(!RelFileNodeBackendIsTemp(rnode));
 
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	RegisterSyncRequest(&tag, SYNC_UNLINK_REQUEST, SYNC_HANDLER_MD,
+							 true /*retryOnError*/);
 }
 
 /*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
+ * register_forget_request() -- forget any fsyncs for a relation fork's segment
  */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+static void
+register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+							  SegmentNumber segno)
 {
-	Assert(pendingOpsTable);
+	FileTag tag;
 
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+	INIT_FILETAG(tag, rnode.node, forknum, segno);
 
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
+	RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, SYNC_HANDLER_MD,
+							 true /*retryOnError*/);
 }
 
 /*
  * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
  */
 void
-ForgetDatabaseFsyncRequests(Oid dbid)
+ForgetDatabaseSyncRequests(Oid dbid)
 {
+	FileTag tag;
 	RelFileNode rnode;
 
 	rnode.dbNode = dbid;
 	rnode.spcNode = 0;
 	rnode.relNode = 0;
 
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	INIT_FILETAG(tag, rnode, InvalidForkNumber, InvalidSegmentNumber);
+
+	RegisterSyncRequest(&tag, SYNC_FORGET_HIERARCHY_REQUEST, SYNC_HANDLER_MD,
+							 true /*retryOnError*/);
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0c0bba4ab3..39f4fed25e 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -20,6 +20,7 @@
 #include "commands/tablespace.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -59,12 +60,8 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
 } f_smgr;
 
-
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{
@@ -82,15 +79,11 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
-
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
@@ -751,52 +744,6 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
-}
-
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/backend/storage/sync/Makefile b/src/backend/storage/sync/Makefile
new file mode 100644
index 0000000000..cfc60cadb4
--- /dev/null
+++ b/src/backend/storage/sync/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/sync
+#
+# IDENTIFICATION
+#    src/backend/storage/sync/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/sync
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = sync.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
new file mode 100644
index 0000000000..5f7db69e8a
--- /dev/null
+++ b/src/backend/storage/sync/sync.c
@@ -0,0 +1,638 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.c
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/sync/sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/file.h>
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "commands/tablespace.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/md.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+#include "utils/inval.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  This hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+typedef uint16 CycleCtr;		/* can be any convenient integer size */
+
+typedef struct
+{
+	FileTag				ftag;		/* hash table key (must be first!) */
+	SyncRequestHandler	handler;	/* request resolution handler */
+	CycleCtr			cycle_ctr;	/* sync_cycle_ctr of oldest request */
+	bool				canceled;	/* canceled is true if we canceled "recently" */
+} PendingFsyncEntry;
+
+typedef struct
+{
+	FileTag				ftag;	/* tag for relation file to delete */
+	SyncRequestHandler	handler; /* request resolution handler */
+	CycleCtr	cycle_ctr;		/* checkpoint_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static HTAB *pendingOps = NULL;
+static List *pendingUnlinks = NIL;
+static MemoryContext pendingOpsCxt; /* context for the above  */
+
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr checkpoint_cycle_ctr = 0;
+
+/* Intervals for calling AbsorbFsyncRequests */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * This struct of function pointers defines the API between smgr.c and
+ * any individual storage manager module.  Note that smgr subfunctions are
+ * generally expected to report problems via elog(ERROR).  An exception is
+ * that smgr_unlink should use elog(WARNING), rather than erroring out,
+ * because we normally unlink relations during post-commit/abort cleanup,
+ * and so it's too late to raise an error.  Also, various conditions that
+ * would normally be errors should be allowed during bootstrap and/or WAL
+ * recovery --- see comments in md.c for details.
+ */
+typedef struct f_sync
+{
+	char*		(*sync_filepath) (const FileTag *ftag);
+	bool		(*sync_filetagmatches) (const FileTag *ftag,
+								const FileTag *predicate, SyncRequestType type);
+} f_sync;
+
+static const f_sync syncsw[] = {
+	/* magnetic disk */
+	{
+		.sync_filepath = mdfilepath,
+		.sync_filetagmatches = mdfiletagmatches
+	}
+};
+
+/*
+ * Initialize data structures for the file sync tracking.
+ */
+void
+InitSync(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(FileTag);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOps = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+
+}
+
+/*
+ * SyncPreCheckpoint() -- Do pre-checkpoint work
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+SyncPreCheckpoint(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	checkpoint_cycle_ctr++;
+}
+
+/*
+ * SyncPostCheckpoint() -- Do post-checkpoint work
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+SyncPostCheckpoint(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == checkpoint_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = syncsw[entry->handler].sync_filepath(&(entry->ftag));
+
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in ProcessFsyncRequests, we don't want to stop absorbing fsync
+		 * requests for along time when there are many deletions to be done.  We
+		 * can safelycall AbsorbFsyncRequests() at this point in the loop
+		 * (note it might try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+/*
+
+ *	ProcessSyncRequests() -- Process queued fsync requests.
+ */
+void
+ProcessSyncRequests(void)
+{
+	static bool sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	int			processed = 0;
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	uint64		longest = 0;
+	uint64		total_elapsed = 0;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOps.
+	 */
+	if (!pendingOps)
+		elog(ERROR, "cannot sync without a pendingOps table");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbSyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use sync_cycle_ctr to tell old entries apart from new
+	 * ones: new ones will have cycle_ctr equal to the incremented value of
+	 * sync_cycle_ctr.
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous ProcessFsyncRequests() failed to complete, run through the
+	 * table and forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			entry->cycle_ctr = sync_cycle_ctr;
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOps);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		int			failures;
+
+		/*
+		 * If fsync is off then we don't have to bother opening the
+		 * file at all.  (We delay checking until this point so that
+		 * changing fsync on the fly behaves sensibly.)
+		 */
+		if (!enableFsync)
+			continue;
+
+		/*
+		 * If the entry is new then don't process it this time; it might
+		 * contain multiple fsync-request bits, but they are all new.  Note
+		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * If in checkpointer, we want to absorb pending requests
+		 * every so often to prevent overflow of the fsync request
+		 * queue.  It is unspecified whether newly-added entries will
+		 * be visited by hash_seq_search, but we don't care since we
+		 * don't need to process them anyway.
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+
+		/*
+		 * The fsync table could contain requests to fsync segments
+		 * that have been deleted (unlinked) by the time we get to
+		 * them. Rather than just hoping an ENOENT (or EACCES on
+		 * Windows) error can be ignored, what we do on error is
+		 * absorb pending requests and then retry.  Since mdunlink()
+		 * queues a "cancel" message before actually unlinking, the
+		 * fsync request is guaranteed to be marked canceled after the
+		 * absorb if it really was this case. DROP DATABASE likewise
+		 * has to tell us to forget fsync requests before it starts
+		 * deletions.
+		 *
+		 * If the entry was cancelled after the absorb above, or within the
+		 * absorb inside the loop, exit the loop. We delete the entry right
+		 * after. Look can also exit at "break".
+		 */
+		for (failures = 0; !(entry->canceled); failures++)
+		{
+			char	   *path;
+			File		fd;
+
+			path = syncsw[entry->handler].sync_filepath(&(entry->ftag));
+			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+
+			INSTR_TIME_SET_CURRENT(sync_start);
+			if (fd >= 0 &&
+				   FileSync(fd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
+			{
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				if (log_checkpoints)
+					elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
+						 processed,
+						 path,
+						 (double) elapsed / 1000);
+
+				FileClose(fd);
+				pfree(path);
+				break;	/* out of retry loop */
+			}
+
+			/* Done with the file descriptor, close it */
+			if (fd >= 0)
+				FileClose(fd);
+
+			/*
+			 * It is possible that the relation has been dropped or
+			 * truncated since the fsync request was entered.
+			 * Therefore, allow ENOENT, but only if we didn't fail
+			 * already on this file.  This applies both for
+			 * smgrgetseg() and for FileSync, since fd.c might have
+			 * closed the file behind our back.
+			 *
+			 * XXX is there any point in allowing more than one retry?
+			 * Don't see one at the moment, but easy to change the
+			 * test here if so.
+			 */
+			if (!FILE_POSSIBLY_DELETED(errno) || failures > 0)
+				ereport(data_sync_elevel(ERROR),
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								path)));
+			else
+				ereport(DEBUG1,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\" but retrying: %m",
+								path)));
+
+			pfree(path);
+
+			/*
+			 * Absorb incoming requests and check to see if a cancel
+			 * arrived for this relation fork.
+			 */
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
+		}				/* end retry loop */
+
+		/* We are done with this entry, remove it */
+		if (hash_search(pendingOps, &entry->ftag, HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "pendingOps corrupted");
+	}	/* end loop over hashtable entries */
+
+	/* Return sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of ProcessSyncRequests */
+	sync_in_progress = false;
+}
+
+/*
+ * RememberSyncRequest() -- callback from checkpointer side of sync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * See sync.h for more information on the types of sync requests supported.
+ */
+void
+RememberSyncRequest(const FileTag *ftag, SyncRequestType type, SyncRequestHandler handler)
+{
+	Assert(pendingOps);
+
+	if (type == SYNC_FORGET_REQUEST)
+	{
+		PendingFsyncEntry *entry;
+		/* Cancel previously entered request */
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+												 (void *)ftag,
+												 HASH_FIND,
+												 NULL);
+		if (entry != NULL)
+			entry->canceled = true;
+	}
+	else if (type == SYNC_FORGET_HIERARCHY_REQUEST)
+	{
+		/* Remove any pending requests for the entire database */
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+		ListCell   *cell,
+				   *prev,
+				   *next;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (syncsw[entry->handler].sync_filetagmatches(&(entry->ftag),
+													ftag /* predicate */, type))
+			{
+				entry->canceled = true;
+			}
+		}
+
+		/* Remove unlink requests */
+		prev = NULL;
+		for (cell = list_head(pendingUnlinks); cell; cell = next)
+		{
+			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+			next = lnext(cell);
+			if (syncsw[entry->handler].sync_filetagmatches(&(entry->ftag),
+													ftag /* predicate */, type))
+			{
+				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
+				pfree(entry);
+			}
+			else
+				prev = cell;
+		}
+	}
+	else if (type == SYNC_UNLINK_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->ftag = *ftag;
+		entry->handler = handler;
+		entry->cycle_ctr = checkpoint_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		Assert(type == SYNC_REQUEST);
+
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+													  ftag,
+													  HASH_ENTER,
+													  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			entry->handler = handler;
+			entry->cycle_ctr = sync_cycle_ctr;
+			entry->canceled = false;
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+}
+
+/*
+ * RegisterSyncRequest()
+ *
+ *   Register the sync request locally, or forward it to the checkpointer.
+ *   Caller can chose to infinitely retry or return immediately on error. We
+ *   currently will wait for 10 ms before retrying.
+ */
+bool
+RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
+				SyncRequestHandler handler, bool retryOnError)
+{
+	bool ret;
+
+	if (pendingOps != NULL)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberSyncRequest(ftag, type, handler);
+		return true;
+	}
+	else
+	{
+		while(1)
+		{
+			/*
+			 * Notify the checkpointer about it.  If we fail to queue the cancel
+			 * message, we have to sleep and try again ... ugly, but hopefully
+			 * won't happen often.
+			 *
+			 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
+			 * error would leave the no-longer-used file still present on disk,
+			 * which would be bad, so I'm inclined to assume that the checkpointer
+			 * will always empty the queue soon.
+			 */
+			ret = ForwardSyncRequest(ftag, type, handler);
+
+			/*
+			 * If we are successful in queueing the request, or we failed and
+			 * was instructed not to retry on error, break.
+			 */
+			if (ret || (!ret && !retryOnError))
+				break;
+
+			pg_usleep(10000L);
+		}
+
+		return ret;
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingOps during initialization of the startup
+ * process.  Calling this function drops the local pendingOps so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+EnableSyncRequestForwarding(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingOps)
+	{
+		ProcessSyncRequests();
+		hash_destroy(pendingOps);
+	}
+	pendingOps = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a5ee209f91..0326e6c6ed 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -50,6 +50,7 @@
 #include "storage/proc.h"
 #include "storage/sinvaladt.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
@@ -554,6 +555,7 @@ BaseInit(void)
 
 	/* Do local initialization of file, storage and buffer managers */
 	InitFileAccess();
+	InitSync();
 	smgrinit();
 	InitBufferPoolAccess();
 }
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 53b8f5fe3c..40b05d4661 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -17,6 +17,8 @@
 
 #include "storage/block.h"
 #include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
 
 
 /* GUC options */
@@ -31,9 +33,9 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type,
+							   SyncRequestHandler handler);
+extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 74c34757fb..40f46b871d 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -54,6 +54,18 @@ extern PGDLLIMPORT bool data_sync_retry;
  */
 extern int	max_safe_fds;
 
+/*
+ * On Windows, we have to interpret EACCES as possibly meaning the same as
+ * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
+ * that's what you get.  Ugh.  This code is designed so that we don't
+ * actually believe these cases are okay without further evidence (namely,
+ * a pending fsync request getting canceled ... see mdsync).
+ */
+#ifndef WIN32
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
+#else
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
+#endif
 
 /*
  * prototypes for functions in fd.c
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
new file mode 100644
index 0000000000..fc13e34a6f
--- /dev/null
+++ b/src/include/storage/md.h
@@ -0,0 +1,51 @@
+/*-------------------------------------------------------------------------
+ *
+ * md.h
+ *	  magnetic disk storage manager public interface declarations.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/md.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MD_H
+#define MD_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
+
+/* md storage manager functionality */
+extern void mdinit(void);
+extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
+extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
+extern void mdextend(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber blocknum);
+extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
+extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber nblocks);
+extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+
+extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
+
+/* md sync callbacks */
+extern char *mdfilepath(const FileTag *ftag);
+extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *predicate,
+							 SyncRequestType type);
+
+#endif							/* MD_H */
diff --git a/src/include/storage/segment.h b/src/include/storage/segment.h
new file mode 100644
index 0000000000..c7af945168
--- /dev/null
+++ b/src/include/storage/segment.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * segment.h
+ *	  POSTGRES disk segment definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/segment.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SEGMENT_H
+#define SEGMENT_H
+
+
+/*
+ * Segment Number:
+ *
+ * Each relation and its forks are divided into segments. This
+ * definition formalizes the definition of the segment number.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)
+
+#endif							/* SEGMENT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 820d08ed4e..26ac8f2cec 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,7 +18,6 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
-
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -106,43 +105,6 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
-
-/* internals: move me elsewhere -- ay 7/94 */
-
-/* in md.c */
-extern void mdinit(void);
-extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
-extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
-extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
-extern void mdextend(SMgrRelation reln, ForkNumber forknum,
-		 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber blocknum);
-extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
-			BlockNumber blocknum, BlockNumber nblocks);
-extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
-extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
-extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
-
 #endif							/* SMGR_H */
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
new file mode 100644
index 0000000000..11a0f01d42
--- /dev/null
+++ b/src/include/storage/sync.h
@@ -0,0 +1,86 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.h
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/sync.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SYNC_H
+#define SYNC_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/segment.h"
+
+/*
+ * Caller specified type of sync request.
+ *
+ * SYNC_REQUESTs are issued to sync a particular file whose path is determined
+ * by calling back the handler. A SYNC_FORGET_REQUEST instructs the sync
+ * mechanism to cancel a previously submitted sync request.
+ *
+ * SYNC_FORGET_HIERARCHY_REQUEST is a special type of forget request that
+ * involves scanning all pending sync requests and cancelling any entry that
+ * matches. The entries are resolved by calling back the handler as the key is
+ * opaque to the sync mechanism. Handling these types of requests are a tad slow
+ * because we have to search all the requests linearly, but usage of this such
+ * as dropping databases, is a pretty heavyweight operation anyhow, so we'll
+ * live with it.
+ *
+ * SYNC_UNLINK_REQUEST is a request to delete the file after the next
+ * checkpoint. The path is determined by calling back the handler.
+ */
+typedef enum syncrequesttype
+{
+	SYNC_REQUEST,
+	SYNC_FORGET_REQUEST,
+	SYNC_FORGET_HIERARCHY_REQUEST,
+	SYNC_UNLINK_REQUEST
+} SyncRequestType;
+
+/*
+ * Identifies the handler for the sync callbacks.
+ *
+ * These enums map back to entries in the callback function table. For
+ * consistency, explicitly set the value to 0. See sync.c for more information.
+ */
+typedef enum syncrequesthandler
+{
+	SYNC_HANDLER_MD = 0		/* md smgr */
+} SyncRequestHandler;
+
+/*
+ * Augmenting a relfilenode with the fork and segment number provides all
+ * the information to locate the particular segment of interest for a relation.
+ */
+typedef struct filetag
+{
+	RelFileNode		rnode;
+	ForkNumber		forknum;
+	SegmentNumber	segno;
+} FileTag;
+
+#define INIT_FILETAG(a,xx_rnode,xx_forknum,xx_segno) \
+( \
+	(a).rnode = (xx_rnode), \
+	(a).forknum = (xx_forknum), \
+	(a).segno = (xx_segno) \
+)
+
+/* sync forward declarations */
+extern void InitSync(void);
+extern void SyncPreCheckpoint(void);
+extern void SyncPostCheckpoint(void);
+extern void ProcessSyncRequests(void);
+extern void RememberSyncRequest(const FileTag *ftag, SyncRequestType type,
+								 SyncRequestHandler handler);
+extern void EnableSyncRequestForwarding(void);
+extern bool RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
+				SyncRequestHandler handler, bool retryOnError);
+
+#endif							/* SYNC_H */
-- 
2.16.5

#72

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#70)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Mar 6, 2019 at 6:16 AM Shawn Debnath <sdn@amazon.com> wrote:

On Tue, Mar 05, 2019 at 11:53:16AM -0500, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Why do we need to include fmgr.h in md.h?

More generally, any massive increase in an include file's inclusions
is probably a sign that you need to refactor. Cross-header inclusions
are best avoided altogether if you can --- obviously that's not always
possible, but we should minimize them. We've had some very unfortunate
problems in the past from indiscriminate #includes in headers.

Agree - I do pay attention to these, but this one slipped through the
cracks (copied smgr.h then edited to remove smgr bits). Thanks for
catching this, will fix in the next patch iteration.

Huh... so why it was in smgr.h then? Seems bogus. Fix pushed to master.

--
Thomas Munro
https://enterprisedb.com

#73

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Thomas Munro (#72)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Mar 07, 2019 at 08:11:51PM +1300, Thomas Munro wrote:

On Wed, Mar 6, 2019 at 6:16 AM Shawn Debnath <sdn@amazon.com> wrote:

On Tue, Mar 05, 2019 at 11:53:16AM -0500, Tom Lane wrote:

Thomas Munro <thomas.munro@gmail.com> writes:

Why do we need to include fmgr.h in md.h?

More generally, any massive increase in an include file's inclusions
is probably a sign that you need to refactor. Cross-header inclusions
are best avoided altogether if you can --- obviously that's not always
possible, but we should minimize them. We've had some very unfortunate
problems in the past from indiscriminate #includes in headers.

Agree - I do pay attention to these, but this one slipped through the
cracks (copied smgr.h then edited to remove smgr bits). Thanks for
catching this, will fix in the next patch iteration.

Huh... so why it was in smgr.h then? Seems bogus. Fix pushed to master.

So ... wondering if there are any other left over items for this patch
or is it good to go? I imagine there's at least a couple of us who would
love to see this get in for PG12.

Thanks!

--
Shawn Debnath
Amazon Web Services (AWS)

#74

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#73)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Mar 13, 2019 at 2:00 PM Shawn Debnath <sdn@amazon.com> wrote:

So ... wondering if there are any other left over items for this patch
or is it good to go? I imagine there's at least a couple of us who would
love to see this get in for PG12.

I rebased my WIP undo stuff[1]/messages/by-id/CA+hUKGKN42jB+ubCKru716HPtMbahdia39GwG5pLgWLMZ_y1ng@mail.gmail.com (targeting 13) on top of this, and that
seemed to go smoothly and the interfaces made sense, which was
reassuring. I do wonder if we'll need to expose a way for eg
pg_rewind and perhaps external backup tools to find paths and offsets
given WAL block references that might in future include an SMGR ID
(well that's one proposal), but that handwavy requirement doesn't seem
to conflict with anything we've done here. I'm planning to do another
round of review and testing. Aside from some refactoring which I
think looks good anyway and prepares for future patches, the main
effect of this patch is to force the checkpointer to open and close
the files every time which seems OK to me. I know Andres wants to
make a pass through it too.

[1]: /messages/by-id/CA+hUKGKN42jB+ubCKru716HPtMbahdia39GwG5pLgWLMZ_y1ng@mail.gmail.com

--
Thomas Munro
https://enterprisedb.com

#75

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Thomas Munro (#74)

2 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Mar 13, 2019 at 2:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:

[...] Aside from some refactoring which I
think looks good anyway and prepares for future patches, the main
effect of this patch is to force the checkpointer to open and close
the files every time which seems OK to me.

I've been trying to decide if that is a problem. Maybe there is a
performance angle, and I'm also wondering if it might increase the
risk of missing a write-back error. Of course we'll find a proper
solution to that problem (perhaps fd-passing or sync-on-close[1]/messages/by-id/CA+hUKGLMPXMnSLDwgnNRFPyxvf_0bJ5HwXcHWjCp7Cvh7G=xEA@mail.gmail.com), but
I don't want to commit anything in the name of refactoring that might
make matters worse incidentally. Or perhaps those worries are bogus,
since the checkpointer calls smgrcloseall() at the end anyway.

On balance, I'm inclined to err on the side of caution and try to keep
things a bit closer to the way they are for now.

Here's a fixup patch. 0001 is the same as Shawn's v12 patch, and 0002
has my proposed changes to switch to callbacks that actually perform
the sync and unlink operations given a file tag, and do so via the
SMGR fd cache, rather than exposing the path to sync.c. This moves us
back towards the way I had it in an earlier version of the patch, but
instead of using smgrsyncimmed() as I had it, it goes via Shawn's new
sync handler function lookup table, allowing for non-smgr components
to use this machinery in future (as requested by Andres).

Thoughts?

It all needs to be pgindented, I'll do that later. I'll post a rebase
of my undo stuff on top of this soon, to show how it looks with this
interface.

[1]: /messages/by-id/CA+hUKGLMPXMnSLDwgnNRFPyxvf_0bJ5HwXcHWjCp7Cvh7G=xEA@mail.gmail.com

--
Thomas Munro
https://enterprisedb.com

Attachments:

0001-Refactor-the-fsync-mechanism-to-support-future-S-v13.patchapplication/octet-stream; name=0001-Refactor-the-fsync-mechanism-to-support-future-S-v13.patchDownload

From cd78a2be9830fb8a29c2f8bb516d2fa7728aecc4 Mon Sep 17 00:00:00 2001
From: Shawn Debnath <sdn@amazon.com>
Date: Wed, 27 Feb 2019 18:58:58 +0000
Subject: [PATCH 1/2] Refactor the fsync mechanism to support future SMGR
 implementations.

In anticipation of proposed block storage managers alongside md.c that
map bufmgr.c blocks to files optimized for different usage patterns:

1. Move the system for requesting and processing fsyncs out of md.c
into storage/sync/sync.c with definitions in include/storage/sync.h.
ProcessSyncRequests() is now responsible for processing the sync
requests during checkpoint.

2. Removed the need for specific storage managers to implement pre and
post checkpoint callbacks. These are now executed by the sync mechanism.

3. We now embed the fork number and the segment number as part of the
hash key for the pending ops table. This eliminates the bitmapset based
segment tracking for each relfilenode during fsync as not all storage
managers may map their segments from zero.

4. Each sync request now must include a type: sync, forget, forget
hierarchy, or unlink, and the owner who will be responsible for
generating paths or matching forget requests.

5. For cancelling relation sync requests, we now must send a forget
request for each fork and segment in the relation.

6. We do not rely on smgr to provide the file descriptor we use to
issue fsync. Instead, we generate the full path based on the FileTag
in the sync request and use PathNameOpenFile to get the file descriptor.

Author: Shawn Debnath, Thomas Munro
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 src/backend/access/transam/twophase.c |   1 +
 src/backend/access/transam/xact.c     |   1 +
 src/backend/access/transam/xlog.c     |   7 +-
 src/backend/commands/dbcommands.c     |   7 +-
 src/backend/postmaster/checkpointer.c |  64 +-
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +-
 src/backend/storage/smgr/md.c         | 846 +++-----------------------
 src/backend/storage/smgr/smgr.c       |  55 +-
 src/backend/storage/sync/Makefile     |  17 +
 src/backend/storage/sync/sync.c       | 638 +++++++++++++++++++
 src/backend/utils/init/postinit.c     |   2 +
 src/include/postmaster/bgwriter.h     |   8 +-
 src/include/storage/fd.h              |  12 +
 src/include/storage/md.h              |  51 ++
 src/include/storage/segment.h         |  28 +
 src/include/storage/smgr.h            |  38 --
 src/include/storage/sync.h            |  86 +++
 18 files changed, 988 insertions(+), 877 deletions(-)
 create mode 100644 src/backend/storage/sync/Makefile
 create mode 100644 src/backend/storage/sync/sync.c
 create mode 100644 src/include/storage/md.h
 create mode 100644 src/include/storage/segment.h
 create mode 100644 src/include/storage/sync.h

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 21986e48fe2..17d779fcae4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -98,6 +98,7 @@
 #include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index c3214d4f4d8..e6f7bbcbfe6 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -50,6 +50,7 @@
 #include "storage/fd.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ad12ebc4269..06a13e6dd1d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "storage/reinit.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
+#include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -6946,7 +6947,7 @@ StartupXLOG(void)
 		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			PublishStartupProcessInformation();
-			SetForwardFsyncRequests();
+			EnableSyncRequestForwarding();
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
 			bgwriterLaunched = true;
 		}
@@ -8576,7 +8577,7 @@ CreateCheckPoint(int flags)
 	 * the REDO pointer.  Note that smgr must not do anything that'd have to
 	 * be undone if we decide no checkpoint is needed.
 	 */
-	smgrpreckpt();
+	SyncPreCheckpoint();
 
 	/* Begin filling in the checkpoint WAL record */
 	MemSet(&checkPoint, 0, sizeof(checkPoint));
@@ -8872,7 +8873,7 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
-	smgrpostckpt();
+	SyncPostCheckpoint();
 
 	/*
 	 * Update the average distance between checkpoints if the prior checkpoint
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 35cad0b6294..9707afabd98 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -54,6 +54,7 @@
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
@@ -941,11 +942,11 @@ dropdb(const char *dbname, bool missing_ok)
 	 * worse, it will delete files that belong to a newly created database
 	 * with the same OID.
 	 */
-	ForgetDatabaseFsyncRequests(db_id);
+	ForgetDatabaseSyncRequests(db_id);
 
 	/*
 	 * Force a checkpoint to make sure the checkpointer has received the
-	 * message sent by ForgetDatabaseFsyncRequests. On Windows, this also
+	 * message sent by ForgetDatabaseSyncRequests. On Windows, this also
 	 * ensures that background procs don't hold any open files, which would
 	 * cause rmdir() to fail.
 	 */
@@ -2150,7 +2151,7 @@ dbase_redo(XLogReaderState *record)
 		DropDatabaseBuffers(xlrec->db_id);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
-		ForgetDatabaseFsyncRequests(xlrec->db_id);
+		ForgetDatabaseSyncRequests(xlrec->db_id);
 
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c2411081a5e..834da553177 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -108,12 +108,38 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
+	/*
+	 * To reduce mmemory footprint, we combine the SyncRequestType and the
+	 * SyncRequestHandler by splitting them into 4 bits each and storing them
+	 * in an uint8. The type and handler values account for far fewer than
+	 * 15 entries, so works just fine.
+	 */
+	uint8 sync_type_handler_combo;
+
+	/*
+	 * Currently, sync requests can be satisfied by information available in
+	 * the FileIdentifier. In the future, this can be combined with a
+	 * physical file descriptor or the full path to a file and put inside
+	 * an union.
+	 *
+	 * This value is opaque to sync mechanism and is used to pass to callback
+	 * handlers to retrieve path of the file to sync or to resolve forget
+	 * requests.
+	 */
+	FileTag		ftag;
 } CheckpointerRequest;
 
+/*
+ * Handler occupies the higher 4 bits while type occupies the lower 4 in
+ * the uint8 combo storage.
+ */
+static uint8 sync_request_type_mask = 0x0F;
+static uint8 sync_request_handler_mask = 0xF0;
+
+#define SYNC_TYPE_AND_HANDLER_COMBO(t, h) ((h) << 4 | (t))
+#define SYNC_REQUEST_TYPE_VALUE(v) (sync_request_type_mask & (v))
+#define SYNC_REQUEST_HANDLER_VALUE(v) ((sync_request_handler_mask & (v)) >> 4)
+
 typedef struct
 {
 	pid_t		checkpointer_pid;	/* PID (0 if not started) */
@@ -349,7 +375,7 @@ CheckpointerMain(void)
 		/*
 		 * Process any requests or signals received recently.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 
 		if (got_SIGHUP)
 		{
@@ -684,7 +710,7 @@ CheckpointWriteDelay(int flags, double progress)
 			UpdateSharedMemoryConfig();
 		}
 
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 
 		CheckArchiveTimeout();
@@ -709,7 +735,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * operations even when we don't sleep, to prevent overflow of the
 		 * fsync request queue.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 	}
 }
@@ -1084,7 +1110,7 @@ RequestCheckpoint(int flags)
 }
 
 /*
- * ForwardFsyncRequest
+ * ForwardSyncRequest
  *		Forward a file-fsync request from a backend to the checkpointer
  *
  * Whenever a backend is compelled to write directly to a relation
@@ -1113,10 +1139,11 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardSyncRequest(const FileTag *ftag, SyncRequestType type,
+				   SyncRequestHandler handler)
 {
 	CheckpointerRequest *request;
-	bool		too_full;
+	bool				too_full;
 
 	if (!IsUnderPostmaster)
 		return false;			/* probably shouldn't even get here */
@@ -1151,9 +1178,8 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
+	request->sync_type_handler_combo = SYNC_TYPE_AND_HANDLER_COMBO(type, handler);
+	request->ftag = *ftag;
 
 	/* If queue is more than half full, nudge the checkpointer to empty it */
 	too_full = (CheckpointerShmem->num_requests >=
@@ -1190,7 +1216,7 @@ CompactCheckpointerRequestQueue(void)
 	struct CheckpointerSlotMapping
 	{
 		CheckpointerRequest request;
-		int			slot;
+		int					slot;
 	};
 
 	int			n,
@@ -1284,8 +1310,8 @@ CompactCheckpointerRequestQueue(void)
 }
 
 /*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ * AbsorbSyncRequests
+ *		Retrieve queued sync requests and pass them to sync mechanism.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1293,7 +1319,7 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbSyncRequests(void)
 {
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
@@ -1335,7 +1361,9 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberSyncRequest(&(request->ftag),
+				SYNC_REQUEST_TYPE_VALUE(request->sync_type_handler_combo),
+				SYNC_REQUEST_HANDLER_VALUE(request->sync_type_handler_combo));
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index bd2d272c6ea..8376cdfca20 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr
+SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385fe..887023fc8a5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2584,7 +2584,7 @@ CheckPointBuffers(int flags)
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2aba2dfe917..8cc9fb16148 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -29,45 +29,18 @@
 #include "access/xlogutils.h"
 #include "access/xlog.h"
 #include "pgstat.h"
-#include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
+#include "storage/md.h"
 #include "storage/relfilenode.h"
+#include "storage/segment.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
-
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
-
-/*
- * On Windows, we have to interpret EACCES as possibly meaning the same as
- * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
- * that's what you get.  Ugh.  This code is designed so that we don't
- * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
- */
-#ifndef WIN32
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
-#else
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
-#endif
-
 /*
  *	The magnetic disk storage manager keeps track of open file
  *	descriptors in its own descriptor pool.  This is done to make it
@@ -114,53 +87,30 @@ typedef struct _MdfdVec
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 
+/* local routines */
+static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
+			 bool isRedo);
+static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+					   MdfdVec *seg);
+static void register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno);
+static void register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno);
+static void _fdvec_resize(SMgrRelation reln,
+			  ForkNumber forknum,
+			  int nseg);
+static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
+			  BlockNumber segno, int oflags);
+static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
+			 BlockNumber blkno, bool skipFsync, int behavior);
+static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+		   MdfdVec *seg);
+
 
 /*
- * In some contexts (currently, standalone backends and the checkpointer)
- * we keep track of pending fsync operations: we need to remember all relation
- * segments that have been written since the last checkpoint, so that we can
- * fsync them down to disk before completing the next checkpoint.  This hash
- * table remembers the pending operations.  We use a hash table mostly as
- * a convenient way of merging duplicate requests.
- *
- * We use a similar mechanism to remember no-longer-needed files that can
- * be deleted after the next checkpoint, but we use a linked list instead of
- * a hash table, because we don't expect there to be any duplicate requests.
- *
- * These mechanisms are only used for non-temp relations; we never fsync
- * temp rels, nor do we need to postpone their deletion (see comments in
- * mdunlink).
- *
- * (Regular backends do not track pending operations locally, but forward
- * them to the checkpointer.)
+ * Segment handling behaviors
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
-
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
-
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
-
-
-/*** behavior for mdopen & _mdfd_getseg ***/
 /* ereport if segment not present */
 #define EXTENSION_FAIL				(1 << 0)
 /* return NULL if segment not present */
@@ -179,26 +129,6 @@ static CycleCtr mdckpt_cycle_ctr = 0;
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
 
 
-/* local routines */
-static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
-			 bool isRedo);
-static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
-static void _fdvec_resize(SMgrRelation reln,
-			  ForkNumber forknum,
-			  int nseg);
-static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
-			  BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
-			  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
-			 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
-		   MdfdVec *seg);
-
-
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
  */
@@ -208,64 +138,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -380,16 +252,6 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 void
 mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
-	/*
-	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
-	 * requests for a temp relation, though.  We can send just one request
-	 * even when deleting multiple forks, since the fsync queuing code accepts
-	 * the "InvalidForkNumber = all forks" convention.
-	 */
-	if (!RelFileNodeBackendIsTemp(rnode))
-		ForgetRelationFsyncRequests(rnode.node, forkNum);
-
 	/* Now do the per-fork work */
 	if (forkNum == InvalidForkNumber)
 	{
@@ -413,6 +275,11 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 	 */
 	if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
 	{
+		/* First, forget any pending sync requests for the first segment */
+		if (!RelFileNodeBackendIsTemp(rnode))
+			register_forget_request(rnode, forkNum, 0 /* first seg */);
+
+		/* Next unlink the file */
 		ret = unlink(path);
 		if (ret < 0 && errno != ENOENT)
 			ereport(WARNING,
@@ -442,7 +309,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		register_unlink_segment(rnode, forkNum, 0 /* first seg */);
 	}
 
 	/*
@@ -459,6 +326,10 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 		 */
 		for (segno = 1;; segno++)
 		{
+			/* Forget any pending sync requests for the segment before we unlink */
+			if (!RelFileNodeBackendIsTemp(rnode))
+				register_forget_request(rnode, forkNum, segno);
+
 			sprintf(segpath, "%s.%u", path, segno);
 			if (unlink(segpath) < 0)
 			{
@@ -1004,385 +875,51 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 }
 
 /*
- *	mdsync() -- Sync previous writes to stable storage.
+ *	mdfilepath()
+ *
+ * Return the filename for the specified segment of the relation. The
+ * returned string is palloc'd.
  */
-void
-mdsync(void)
+char *
+mdfilepath(const FileTag *ftag)
 {
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
+	char	   *path,
+			   *fullpath;
 
 	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
+	 * We can safely pass InvalidBackendId as we never expect to sync
+	 * any segments for temporary relations.
 	 */
-	AbsorbFsyncRequests();
+	path = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
+		ftag->rnode.relNode, InvalidBackendId, ftag->forknum);
 
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
+	if (ftag->segno > 0 && ftag->segno != InvalidSegmentNumber)
 	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
+		fullpath = psprintf("%s.%u", path, ftag->segno);
+		pfree(path);
 	}
+	else
+		fullpath = path;
 
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
+	return fullpath;
 }
 
 /*
- * mdpostckpt() -- Do post-checkpoint work
+ *	mdfiletagmatches()
  *
- * Remove any lingering files that can now be safely removed.
+ * Returns true if the predicate tag matches with the file tag.
  */
-void
-mdpostckpt(void)
+bool
+mdfiletagmatches(const FileTag *ftag, const FileTag *predicate,
+				 SyncRequestType type)
 {
-	int			absorb_counter;
+	/* Today, we only do matching for hierarchy (forget database) requests */
+	Assert(type == SYNC_FORGET_HIERARCHY_REQUEST);
 
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
+	if (type == SYNC_FORGET_HIERARCHY_REQUEST)
+		return ftag->rnode.dbNode == predicate->rnode.dbNode;
 
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
-
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
+	return false;
 }
 
 /*
@@ -1397,19 +934,16 @@ mdpostckpt(void)
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	FileTag tag;
+
+	INIT_FILETAG(tag, reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, SYNC_HANDLER_MD,
+							 false /*retryOnError*/))
 	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1423,254 +957,54 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 /*
  * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
  */
 static void
-register_unlink(RelFileNodeBackend rnode)
+register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno)
 {
+	FileTag tag;
+
+	INIT_FILETAG(tag, rnode.node, forknum, segno);
+
 	/* Should never be used with temp relations */
 	Assert(!RelFileNodeBackendIsTemp(rnode));
 
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	RegisterSyncRequest(&tag, SYNC_UNLINK_REQUEST, SYNC_HANDLER_MD,
+							 true /*retryOnError*/);
 }
 
 /*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
+ * register_forget_request() -- forget any fsyncs for a relation fork's segment
  */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+static void
+register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+							  SegmentNumber segno)
 {
-	Assert(pendingOpsTable);
+	FileTag tag;
 
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
-
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+	INIT_FILETAG(tag, rnode.node, forknum, segno);
 
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
+	RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, SYNC_HANDLER_MD,
+							 true /*retryOnError*/);
 }
 
 /*
  * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
  */
 void
-ForgetDatabaseFsyncRequests(Oid dbid)
+ForgetDatabaseSyncRequests(Oid dbid)
 {
+	FileTag tag;
 	RelFileNode rnode;
 
 	rnode.dbNode = dbid;
 	rnode.spcNode = 0;
 	rnode.relNode = 0;
 
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	INIT_FILETAG(tag, rnode, InvalidForkNumber, InvalidSegmentNumber);
+
+	RegisterSyncRequest(&tag, SYNC_FORGET_HIERARCHY_REQUEST, SYNC_HANDLER_MD,
+							 true /*retryOnError*/);
 }
 
 /*
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0c0bba4ab33..39f4fed25eb 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -20,6 +20,7 @@
 #include "commands/tablespace.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -59,12 +60,8 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
 } f_smgr;
 
-
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{
@@ -82,15 +79,11 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
-
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
@@ -751,52 +744,6 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
-}
-
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/backend/storage/sync/Makefile b/src/backend/storage/sync/Makefile
new file mode 100644
index 00000000000..cfc60cadb4c
--- /dev/null
+++ b/src/backend/storage/sync/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/sync
+#
+# IDENTIFICATION
+#    src/backend/storage/sync/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/sync
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = sync.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
new file mode 100644
index 00000000000..5f7db69e8a0
--- /dev/null
+++ b/src/backend/storage/sync/sync.c
@@ -0,0 +1,638 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.c
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/sync/sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/file.h>
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "commands/tablespace.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/md.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+#include "utils/inval.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  This hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+typedef uint16 CycleCtr;		/* can be any convenient integer size */
+
+typedef struct
+{
+	FileTag				ftag;		/* hash table key (must be first!) */
+	SyncRequestHandler	handler;	/* request resolution handler */
+	CycleCtr			cycle_ctr;	/* sync_cycle_ctr of oldest request */
+	bool				canceled;	/* canceled is true if we canceled "recently" */
+} PendingFsyncEntry;
+
+typedef struct
+{
+	FileTag				ftag;	/* tag for relation file to delete */
+	SyncRequestHandler	handler; /* request resolution handler */
+	CycleCtr	cycle_ctr;		/* checkpoint_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static HTAB *pendingOps = NULL;
+static List *pendingUnlinks = NIL;
+static MemoryContext pendingOpsCxt; /* context for the above  */
+
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr checkpoint_cycle_ctr = 0;
+
+/* Intervals for calling AbsorbFsyncRequests */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * This struct of function pointers defines the API between smgr.c and
+ * any individual storage manager module.  Note that smgr subfunctions are
+ * generally expected to report problems via elog(ERROR).  An exception is
+ * that smgr_unlink should use elog(WARNING), rather than erroring out,
+ * because we normally unlink relations during post-commit/abort cleanup,
+ * and so it's too late to raise an error.  Also, various conditions that
+ * would normally be errors should be allowed during bootstrap and/or WAL
+ * recovery --- see comments in md.c for details.
+ */
+typedef struct f_sync
+{
+	char*		(*sync_filepath) (const FileTag *ftag);
+	bool		(*sync_filetagmatches) (const FileTag *ftag,
+								const FileTag *predicate, SyncRequestType type);
+} f_sync;
+
+static const f_sync syncsw[] = {
+	/* magnetic disk */
+	{
+		.sync_filepath = mdfilepath,
+		.sync_filetagmatches = mdfiletagmatches
+	}
+};
+
+/*
+ * Initialize data structures for the file sync tracking.
+ */
+void
+InitSync(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(FileTag);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOps = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+
+}
+
+/*
+ * SyncPreCheckpoint() -- Do pre-checkpoint work
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+SyncPreCheckpoint(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	checkpoint_cycle_ctr++;
+}
+
+/*
+ * SyncPostCheckpoint() -- Do post-checkpoint work
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+SyncPostCheckpoint(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char	   *path;
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == checkpoint_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		path = syncsw[entry->handler].sync_filepath(&(entry->ftag));
+
+		if (unlink(path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+		pfree(path);
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in ProcessFsyncRequests, we don't want to stop absorbing fsync
+		 * requests for along time when there are many deletions to be done.  We
+		 * can safelycall AbsorbFsyncRequests() at this point in the loop
+		 * (note it might try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+/*
+
+ *	ProcessSyncRequests() -- Process queued fsync requests.
+ */
+void
+ProcessSyncRequests(void)
+{
+	static bool sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	int			processed = 0;
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	uint64		longest = 0;
+	uint64		total_elapsed = 0;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOps.
+	 */
+	if (!pendingOps)
+		elog(ERROR, "cannot sync without a pendingOps table");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbSyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use sync_cycle_ctr to tell old entries apart from new
+	 * ones: new ones will have cycle_ctr equal to the incremented value of
+	 * sync_cycle_ctr.
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous ProcessFsyncRequests() failed to complete, run through the
+	 * table and forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			entry->cycle_ctr = sync_cycle_ctr;
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOps);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		int			failures;
+
+		/*
+		 * If fsync is off then we don't have to bother opening the
+		 * file at all.  (We delay checking until this point so that
+		 * changing fsync on the fly behaves sensibly.)
+		 */
+		if (!enableFsync)
+			continue;
+
+		/*
+		 * If the entry is new then don't process it this time; it might
+		 * contain multiple fsync-request bits, but they are all new.  Note
+		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * If in checkpointer, we want to absorb pending requests
+		 * every so often to prevent overflow of the fsync request
+		 * queue.  It is unspecified whether newly-added entries will
+		 * be visited by hash_seq_search, but we don't care since we
+		 * don't need to process them anyway.
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+
+		/*
+		 * The fsync table could contain requests to fsync segments
+		 * that have been deleted (unlinked) by the time we get to
+		 * them. Rather than just hoping an ENOENT (or EACCES on
+		 * Windows) error can be ignored, what we do on error is
+		 * absorb pending requests and then retry.  Since mdunlink()
+		 * queues a "cancel" message before actually unlinking, the
+		 * fsync request is guaranteed to be marked canceled after the
+		 * absorb if it really was this case. DROP DATABASE likewise
+		 * has to tell us to forget fsync requests before it starts
+		 * deletions.
+		 *
+		 * If the entry was cancelled after the absorb above, or within the
+		 * absorb inside the loop, exit the loop. We delete the entry right
+		 * after. Look can also exit at "break".
+		 */
+		for (failures = 0; !(entry->canceled); failures++)
+		{
+			char	   *path;
+			File		fd;
+
+			path = syncsw[entry->handler].sync_filepath(&(entry->ftag));
+			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+
+			INSTR_TIME_SET_CURRENT(sync_start);
+			if (fd >= 0 &&
+				   FileSync(fd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
+			{
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				if (log_checkpoints)
+					elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
+						 processed,
+						 path,
+						 (double) elapsed / 1000);
+
+				FileClose(fd);
+				pfree(path);
+				break;	/* out of retry loop */
+			}
+
+			/* Done with the file descriptor, close it */
+			if (fd >= 0)
+				FileClose(fd);
+
+			/*
+			 * It is possible that the relation has been dropped or
+			 * truncated since the fsync request was entered.
+			 * Therefore, allow ENOENT, but only if we didn't fail
+			 * already on this file.  This applies both for
+			 * smgrgetseg() and for FileSync, since fd.c might have
+			 * closed the file behind our back.
+			 *
+			 * XXX is there any point in allowing more than one retry?
+			 * Don't see one at the moment, but easy to change the
+			 * test here if so.
+			 */
+			if (!FILE_POSSIBLY_DELETED(errno) || failures > 0)
+				ereport(data_sync_elevel(ERROR),
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								path)));
+			else
+				ereport(DEBUG1,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\" but retrying: %m",
+								path)));
+
+			pfree(path);
+
+			/*
+			 * Absorb incoming requests and check to see if a cancel
+			 * arrived for this relation fork.
+			 */
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
+		}				/* end retry loop */
+
+		/* We are done with this entry, remove it */
+		if (hash_search(pendingOps, &entry->ftag, HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "pendingOps corrupted");
+	}	/* end loop over hashtable entries */
+
+	/* Return sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of ProcessSyncRequests */
+	sync_in_progress = false;
+}
+
+/*
+ * RememberSyncRequest() -- callback from checkpointer side of sync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * See sync.h for more information on the types of sync requests supported.
+ */
+void
+RememberSyncRequest(const FileTag *ftag, SyncRequestType type, SyncRequestHandler handler)
+{
+	Assert(pendingOps);
+
+	if (type == SYNC_FORGET_REQUEST)
+	{
+		PendingFsyncEntry *entry;
+		/* Cancel previously entered request */
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+												 (void *)ftag,
+												 HASH_FIND,
+												 NULL);
+		if (entry != NULL)
+			entry->canceled = true;
+	}
+	else if (type == SYNC_FORGET_HIERARCHY_REQUEST)
+	{
+		/* Remove any pending requests for the entire database */
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+		ListCell   *cell,
+				   *prev,
+				   *next;
+
+		/* Remove fsync requests */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (syncsw[entry->handler].sync_filetagmatches(&(entry->ftag),
+													ftag /* predicate */, type))
+			{
+				entry->canceled = true;
+			}
+		}
+
+		/* Remove unlink requests */
+		prev = NULL;
+		for (cell = list_head(pendingUnlinks); cell; cell = next)
+		{
+			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+			next = lnext(cell);
+			if (syncsw[entry->handler].sync_filetagmatches(&(entry->ftag),
+													ftag /* predicate */, type))
+			{
+				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
+				pfree(entry);
+			}
+			else
+				prev = cell;
+		}
+	}
+	else if (type == SYNC_UNLINK_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->ftag = *ftag;
+		entry->handler = handler;
+		entry->cycle_ctr = checkpoint_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		Assert(type == SYNC_REQUEST);
+
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+													  ftag,
+													  HASH_ENTER,
+													  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			entry->handler = handler;
+			entry->cycle_ctr = sync_cycle_ctr;
+			entry->canceled = false;
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+}
+
+/*
+ * RegisterSyncRequest()
+ *
+ *   Register the sync request locally, or forward it to the checkpointer.
+ *   Caller can chose to infinitely retry or return immediately on error. We
+ *   currently will wait for 10 ms before retrying.
+ */
+bool
+RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
+				SyncRequestHandler handler, bool retryOnError)
+{
+	bool ret;
+
+	if (pendingOps != NULL)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberSyncRequest(ftag, type, handler);
+		return true;
+	}
+	else
+	{
+		while(1)
+		{
+			/*
+			 * Notify the checkpointer about it.  If we fail to queue the cancel
+			 * message, we have to sleep and try again ... ugly, but hopefully
+			 * won't happen often.
+			 *
+			 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
+			 * error would leave the no-longer-used file still present on disk,
+			 * which would be bad, so I'm inclined to assume that the checkpointer
+			 * will always empty the queue soon.
+			 */
+			ret = ForwardSyncRequest(ftag, type, handler);
+
+			/*
+			 * If we are successful in queueing the request, or we failed and
+			 * was instructed not to retry on error, break.
+			 */
+			if (ret || (!ret && !retryOnError))
+				break;
+
+			pg_usleep(10000L);
+		}
+
+		return ret;
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingOps during initialization of the startup
+ * process.  Calling this function drops the local pendingOps so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+EnableSyncRequestForwarding(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingOps)
+	{
+		ProcessSyncRequests();
+		hash_destroy(pendingOps);
+	}
+	pendingOps = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 752010ed276..1c2a99c9c8c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/sinvaladt.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
@@ -555,6 +556,7 @@ BaseInit(void)
 
 	/* Do local initialization of file, storage and buffer managers */
 	InitFileAccess();
+	InitSync();
 	smgrinit();
 	InitBufferPoolAccess();
 }
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 53b8f5fe3cb..40b05d46617 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -17,6 +17,8 @@
 
 #include "storage/block.h"
 #include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
 
 
 /* GUC options */
@@ -31,9 +33,9 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type,
+							   SyncRequestHandler handler);
+extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 74c34757fb5..40f46b871d7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -54,6 +54,18 @@ extern PGDLLIMPORT bool data_sync_retry;
  */
 extern int	max_safe_fds;
 
+/*
+ * On Windows, we have to interpret EACCES as possibly meaning the same as
+ * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
+ * that's what you get.  Ugh.  This code is designed so that we don't
+ * actually believe these cases are okay without further evidence (namely,
+ * a pending fsync request getting canceled ... see mdsync).
+ */
+#ifndef WIN32
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
+#else
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
+#endif
 
 /*
  * prototypes for functions in fd.c
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
new file mode 100644
index 00000000000..fc13e34a6f6
--- /dev/null
+++ b/src/include/storage/md.h
@@ -0,0 +1,51 @@
+/*-------------------------------------------------------------------------
+ *
+ * md.h
+ *	  magnetic disk storage manager public interface declarations.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/md.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MD_H
+#define MD_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
+
+/* md storage manager functionality */
+extern void mdinit(void);
+extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
+extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
+extern void mdextend(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber blocknum);
+extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
+extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber nblocks);
+extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+
+extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
+
+/* md sync callbacks */
+extern char *mdfilepath(const FileTag *ftag);
+extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *predicate,
+							 SyncRequestType type);
+
+#endif							/* MD_H */
diff --git a/src/include/storage/segment.h b/src/include/storage/segment.h
new file mode 100644
index 00000000000..c7af9451687
--- /dev/null
+++ b/src/include/storage/segment.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * segment.h
+ *	  POSTGRES disk segment definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/segment.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SEGMENT_H
+#define SEGMENT_H
+
+
+/*
+ * Segment Number:
+ *
+ * Each relation and its forks are divided into segments. This
+ * definition formalizes the definition of the segment number.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)
+
+#endif							/* SEGMENT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 4d60b28dac3..fd5025f8531 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,7 +17,6 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
-
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -105,43 +104,6 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
-
-/* internals: move me elsewhere -- ay 7/94 */
-
-/* in md.c */
-extern void mdinit(void);
-extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
-extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
-extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
-extern void mdextend(SMgrRelation reln, ForkNumber forknum,
-		 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber blocknum);
-extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
-			BlockNumber blocknum, BlockNumber nblocks);
-extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
-extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
-extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
-
 #endif							/* SMGR_H */
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
new file mode 100644
index 00000000000..11a0f01d424
--- /dev/null
+++ b/src/include/storage/sync.h
@@ -0,0 +1,86 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.h
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/sync.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SYNC_H
+#define SYNC_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/segment.h"
+
+/*
+ * Caller specified type of sync request.
+ *
+ * SYNC_REQUESTs are issued to sync a particular file whose path is determined
+ * by calling back the handler. A SYNC_FORGET_REQUEST instructs the sync
+ * mechanism to cancel a previously submitted sync request.
+ *
+ * SYNC_FORGET_HIERARCHY_REQUEST is a special type of forget request that
+ * involves scanning all pending sync requests and cancelling any entry that
+ * matches. The entries are resolved by calling back the handler as the key is
+ * opaque to the sync mechanism. Handling these types of requests are a tad slow
+ * because we have to search all the requests linearly, but usage of this such
+ * as dropping databases, is a pretty heavyweight operation anyhow, so we'll
+ * live with it.
+ *
+ * SYNC_UNLINK_REQUEST is a request to delete the file after the next
+ * checkpoint. The path is determined by calling back the handler.
+ */
+typedef enum syncrequesttype
+{
+	SYNC_REQUEST,
+	SYNC_FORGET_REQUEST,
+	SYNC_FORGET_HIERARCHY_REQUEST,
+	SYNC_UNLINK_REQUEST
+} SyncRequestType;
+
+/*
+ * Identifies the handler for the sync callbacks.
+ *
+ * These enums map back to entries in the callback function table. For
+ * consistency, explicitly set the value to 0. See sync.c for more information.
+ */
+typedef enum syncrequesthandler
+{
+	SYNC_HANDLER_MD = 0		/* md smgr */
+} SyncRequestHandler;
+
+/*
+ * Augmenting a relfilenode with the fork and segment number provides all
+ * the information to locate the particular segment of interest for a relation.
+ */
+typedef struct filetag
+{
+	RelFileNode		rnode;
+	ForkNumber		forknum;
+	SegmentNumber	segno;
+} FileTag;
+
+#define INIT_FILETAG(a,xx_rnode,xx_forknum,xx_segno) \
+( \
+	(a).rnode = (xx_rnode), \
+	(a).forknum = (xx_forknum), \
+	(a).segno = (xx_segno) \
+)
+
+/* sync forward declarations */
+extern void InitSync(void);
+extern void SyncPreCheckpoint(void);
+extern void SyncPostCheckpoint(void);
+extern void ProcessSyncRequests(void);
+extern void RememberSyncRequest(const FileTag *ftag, SyncRequestType type,
+								 SyncRequestHandler handler);
+extern void EnableSyncRequestForwarding(void);
+extern bool RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
+				SyncRequestHandler handler, bool retryOnError);
+
+#endif							/* SYNC_H */
-- 
2.21.0

0002-fixup-v13.patchapplication/octet-stream; name=0002-fixup-v13.patchDownload

From e24543ea7c4b2ac70c3d4b6df23e69f9cebf92f7 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Mon, 25 Mar 2019 20:06:49 +1300
Subject: [PATCH 2/2] fixup

---
 src/backend/storage/smgr/md.c   | 122 +++++++++++++++++++-------------
 src/backend/storage/sync/sync.c |  48 ++++---------
 src/include/storage/md.h        |   3 +-
 3 files changed, 88 insertions(+), 85 deletions(-)

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 8cc9fb16148..eced84f8413 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -874,54 +874,6 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
-/*
- *	mdfilepath()
- *
- * Return the filename for the specified segment of the relation. The
- * returned string is palloc'd.
- */
-char *
-mdfilepath(const FileTag *ftag)
-{
-	char	   *path,
-			   *fullpath;
-
-	/*
-	 * We can safely pass InvalidBackendId as we never expect to sync
-	 * any segments for temporary relations.
-	 */
-	path = GetRelationPath(ftag->rnode.dbNode, ftag->rnode.spcNode,
-		ftag->rnode.relNode, InvalidBackendId, ftag->forknum);
-
-	if (ftag->segno > 0 && ftag->segno != InvalidSegmentNumber)
-	{
-		fullpath = psprintf("%s.%u", path, ftag->segno);
-		pfree(path);
-	}
-	else
-		fullpath = path;
-
-	return fullpath;
-}
-
-/*
- *	mdfiletagmatches()
- *
- * Returns true if the predicate tag matches with the file tag.
- */
-bool
-mdfiletagmatches(const FileTag *ftag, const FileTag *predicate,
-				 SyncRequestType type)
-{
-	/* Today, we only do matching for hierarchy (forget database) requests */
-	Assert(type == SYNC_FORGET_HIERARCHY_REQUEST);
-
-	if (type == SYNC_FORGET_HIERARCHY_REQUEST)
-		return ftag->rnode.dbNode == predicate->rnode.dbNode;
-
-	return false;
-}
-
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1291,3 +1243,77 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	/* note that this calculation will ignore any partial block at EOF */
 	return (BlockNumber) (len / BLCKSZ);
 }
+
+/*
+ * Sync a file to disk, given a file tag.  Write the path into an output
+ * buffer so the caller can use it in error messages.
+ *
+ * Return 0 on success, and otherwise an errno value.
+ */
+int
+mdsyncfiletag(const FileTag *ftag, char *path)
+{
+	SMgrRelation reln = smgropen(ftag->rnode, InvalidBackendId);
+	MdfdVec	   *v;
+	char	   *p;
+
+	/* Provide the path for informational messages. */
+	p = _mdfd_segpath(reln, ftag->forknum, ftag->segno);
+	strlcpy(path, p, MAXPGPATH);
+	pfree(p);
+
+	/* Open the relation using the cache, for performance. */
+	reln = smgropen(ftag->rnode, InvalidBackendId);
+
+	/* Try to find open the requested segment. */
+	v = _mdfd_openseg(reln, ftag->forknum, ftag->segno, 0);
+	if (v == NULL)
+		return ENOENT;
+
+	/* Try to fsync the file. */
+	if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+		return errno;
+
+	return 0;
+}
+
+/*
+ * Unlink a file, given a file tag.  Write the path into an output
+ * buffer so the caller can use it in error messages.
+ *
+ * Return 0 on success, and otherwise an errno value.
+ */
+int
+mdunlinkfiletag(const FileTag *ftag, char *path)
+{
+	SMgrRelation reln = smgropen(ftag->rnode, InvalidBackendId);
+	char	   *p;
+
+	/* Compute the path. */
+	p = _mdfd_segpath(reln, ftag->forknum, ftag->segno);
+	strlcpy(path, p, MAXPGPATH);
+	pfree(p);
+
+	/* Try to unlink the file. */
+	if (unlink(path) < 0)
+		return errno;
+
+	return 0;
+}
+
+/*
+ * Return true if the predicate tag matches with file tag, for the
+ * purpose of filtering out requests.
+ */
+bool
+mdfiletagmatches(const FileTag *ftag, const FileTag *predicate,
+				 SyncRequestType type)
+{
+	Assert(type == SYNC_FORGET_HIERARCHY_REQUEST);
+
+	/* We only support dropping all requests for a given database. */
+	if (type == SYNC_FORGET_HIERARCHY_REQUEST)
+		return ftag->rnode.dbNode == predicate->rnode.dbNode;
+
+	return false;
+}
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 5f7db69e8a0..c82d2db91fb 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -82,18 +82,12 @@ static CycleCtr checkpoint_cycle_ctr = 0;
 #define UNLINKS_PER_ABSORB		10
 
 /*
- * This struct of function pointers defines the API between smgr.c and
- * any individual storage manager module.  Note that smgr subfunctions are
- * generally expected to report problems via elog(ERROR).  An exception is
- * that smgr_unlink should use elog(WARNING), rather than erroring out,
- * because we normally unlink relations during post-commit/abort cleanup,
- * and so it's too late to raise an error.  Also, various conditions that
- * would normally be errors should be allowed during bootstrap and/or WAL
- * recovery --- see comments in md.c for details.
+ * Function pointers for handling sync and unlink requests.
  */
 typedef struct f_sync
 {
-	char*		(*sync_filepath) (const FileTag *ftag);
+	int			(*sync_syncfiletag) (const FileTag *ftag, char *path);
+	int			(*sync_unlinkfiletag) (const FileTag *ftag, char *path);
 	bool		(*sync_filetagmatches) (const FileTag *ftag,
 								const FileTag *predicate, SyncRequestType type);
 } f_sync;
@@ -101,7 +95,8 @@ typedef struct f_sync
 static const f_sync syncsw[] = {
 	/* magnetic disk */
 	{
-		.sync_filepath = mdfilepath,
+		.sync_syncfiletag = mdsyncfiletag,
+		.sync_unlinkfiletag = mdunlinkfiletag,
 		.sync_filetagmatches = mdfiletagmatches
 	}
 };
@@ -186,7 +181,7 @@ SyncPostCheckpoint(void)
 	while (pendingUnlinks != NIL)
 	{
 		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
+		char		path[MAXPGPATH];
 
 		/*
 		 * New entries are appended to the end, so if the entry is new we've
@@ -201,9 +196,9 @@ SyncPostCheckpoint(void)
 			break;
 
 		/* Unlink the file */
-		path = syncsw[entry->handler].sync_filepath(&(entry->ftag));
+		errno = syncsw[entry->handler].sync_unlinkfiletag(&entry->ftag, path);
 
-		if (unlink(path) < 0)
+		if (errno != 0)
 		{
 			/*
 			 * There's a race condition, when the database is dropped at the
@@ -217,7 +212,6 @@ SyncPostCheckpoint(void)
 						(errcode_for_file_access(),
 						 errmsg("could not remove file \"%s\": %m", path)));
 		}
-		pfree(path);
 
 		/* And remove the list entry */
 		pendingUnlinks = list_delete_first(pendingUnlinks);
@@ -374,15 +368,11 @@ ProcessSyncRequests(void)
 		 */
 		for (failures = 0; !(entry->canceled); failures++)
 		{
-			char	   *path;
-			File		fd;
-
-			path = syncsw[entry->handler].sync_filepath(&(entry->ftag));
-			fd = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+			char	path[MAXPGPATH];
 
 			INSTR_TIME_SET_CURRENT(sync_start);
-			if (fd >= 0 &&
-				   FileSync(fd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
+			errno = syncsw[entry->handler].sync_syncfiletag(&entry->ftag, path);
+			if (errno == 0)
 			{
 				/* Success; update statistics about sync timing */
 				INSTR_TIME_SET_CURRENT(sync_end);
@@ -400,26 +390,14 @@ ProcessSyncRequests(void)
 						 path,
 						 (double) elapsed / 1000);
 
-				FileClose(fd);
-				pfree(path);
 				break;	/* out of retry loop */
 			}
 
-			/* Done with the file descriptor, close it */
-			if (fd >= 0)
-				FileClose(fd);
-
 			/*
 			 * It is possible that the relation has been dropped or
 			 * truncated since the fsync request was entered.
 			 * Therefore, allow ENOENT, but only if we didn't fail
-			 * already on this file.  This applies both for
-			 * smgrgetseg() and for FileSync, since fd.c might have
-			 * closed the file behind our back.
-			 *
-			 * XXX is there any point in allowing more than one retry?
-			 * Don't see one at the moment, but easy to change the
-			 * test here if so.
+			 * already on this file.
 			 */
 			if (!FILE_POSSIBLY_DELETED(errno) || failures > 0)
 				ereport(data_sync_elevel(ERROR),
@@ -432,8 +410,6 @@ ProcessSyncRequests(void)
 						 errmsg("could not fsync file \"%s\" but retrying: %m",
 								path)));
 
-			pfree(path);
-
 			/*
 			 * Absorb incoming requests and check to see if a cancel
 			 * arrived for this relation fork.
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
index fc13e34a6f6..b162f10edd9 100644
--- a/src/include/storage/md.h
+++ b/src/include/storage/md.h
@@ -44,7 +44,8 @@ extern void ForgetDatabaseSyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
 
 /* md sync callbacks */
-extern char *mdfilepath(const FileTag *ftag);
+extern int mdsyncfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path);
 extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *predicate,
 							 SyncRequestType type);
 
-- 
2.21.0

#76

Robert Haas

robertmhaas@gmail.com

almost 7 years ago

In reply to: Thomas Munro (#75)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Mar 26, 2019 at 12:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I've been trying to decide if that is a problem. Maybe there is a
performance angle, and I'm also wondering if it might increase the
risk of missing a write-back error. Of course we'll find a proper
solution to that problem (perhaps fd-passing or sync-on-close[1]), but
I don't want to commit anything in the name of refactoring that might
make matters worse incidentally. Or perhaps those worries are bogus,
since the checkpointer calls smgrcloseall() at the end anyway.

On balance, I'm inclined to err on the side of caution and try to keep
things a bit closer to the way they are for now.

Here's a fixup patch. 0001 is the same as Shawn's v12 patch, and 0002
has my proposed changes to switch to callbacks that actually perform
the sync and unlink operations given a file tag, and do so via the
SMGR fd cache, rather than exposing the path to sync.c. This moves us
back towards the way I had it in an earlier version of the patch, but
instead of using smgrsyncimmed() as I had it, it goes via Shawn's new
sync handler function lookup table, allowing for non-smgr components
to use this machinery in future (as requested by Andres).

Strong +1. Not only might closing and reopening the files have
performance and reliability implications, but a future smgr might talk
to the network, having no local file to sync.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#77

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Robert Haas (#76)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Mar 26, 2019 at 09:22:56AM -0400, Robert Haas wrote:

On Tue, Mar 26, 2019 at 12:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I've been trying to decide if that is a problem. Maybe there is a
performance angle, and I'm also wondering if it might increase the
risk of missing a write-back error. Of course we'll find a proper
solution to that problem (perhaps fd-passing or sync-on-close[1]), but
I don't want to commit anything in the name of refactoring that might
make matters worse incidentally. Or perhaps those worries are bogus,
since the checkpointer calls smgrcloseall() at the end anyway.

On balance, I'm inclined to err on the side of caution and try to keep
things a bit closer to the way they are for now.

Here's a fixup patch. 0001 is the same as Shawn's v12 patch, and 0002
has my proposed changes to switch to callbacks that actually perform
the sync and unlink operations given a file tag, and do so via the
SMGR fd cache, rather than exposing the path to sync.c. This moves us
back towards the way I had it in an earlier version of the patch, but
instead of using smgrsyncimmed() as I had it, it goes via Shawn's new
sync handler function lookup table, allowing for non-smgr components
to use this machinery in future (as requested by Andres).

Strong +1. Not only might closing and reopening the files have
performance and reliability implications, but a future smgr might talk
to the network, having no local file to sync.

Makes sense for now. When we re-visit the fd-passing or sync-on-close
implementations, we can adapt the changes relatively easily given the
rest of the framework is staying intact. I am hoping these patches do
not delay the last fsync-gate issue discussion further.

--
Shawn Debnath
Amazon Web Services (AWS)

#78

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#77)

1 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Wed, Mar 27, 2019 at 5:48 AM Shawn Debnath <sdn@amazon.com> wrote:

On Tue, Mar 26, 2019 at 09:22:56AM -0400, Robert Haas wrote:

On Tue, Mar 26, 2019 at 12:20 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Here's a fixup patch. 0001 is the same as Shawn's v12 patch, and 0002
has my proposed changes to switch to callbacks that actually perform
the sync and unlink operations given a file tag, and do so via the
SMGR fd cache, rather than exposing the path to sync.c. This moves us
back towards the way I had it in an earlier version of the patch, but
instead of using smgrsyncimmed() as I had it, it goes via Shawn's new
sync handler function lookup table, allowing for non-smgr components
to use this machinery in future (as requested by Andres).

Strong +1. Not only might closing and reopening the files have
performance and reliability implications, but a future smgr might talk
to the network, having no local file to sync.

Makes sense for now. When we re-visit the fd-passing or sync-on-close
implementations, we can adapt the changes relatively easily given the
rest of the framework is staying intact. I am hoping these patches do
not delay the last fsync-gate issue discussion further.

I found a few more things that I thought needed adjustment:

* Packing handler and request type into a uint8 is cute but a waste of
time if we're just going to put it in a struct next to a member that
requires word-alignment. So I changed it to a pair of plain old int16
members. The ftag member starts at offset 4 either way, on my system.

* I didn't really like the use of the word HIERARCHY in the name of
the request type, and changed it to SYNC_FILTER_REQUEST. That word
came up because we were implementing a kind of hierarchy, where if you
drop a database you want to forget things for all segments inside all
relations inside that database, but the whole point of this new API is
that it doesn't understand that, it calls a filter function to decide
which requests to keep. So I preferred "filter" as a name for the
type of request.

* I simplified the "matches" callback interface.

* Various typos and comment clean-up.

I'm going to do some more testing and tidying tomorrow (for example I
think the segment.h header is silly and I'd like to remove that), and
commit this.

--
Thomas Munro
https://enterprisedb.com

Attachments:

0001-Refactor-the-fsync-queue-for-future-SMGR-impleme-v14.patchapplication/octet-stream; name=0001-Refactor-the-fsync-queue-for-future-SMGR-impleme-v14.patchDownload

From db6da006ef987f40622ae9f047e05a56135b3310 Mon Sep 17 00:00:00 2001
From: Shawn Debnath <sdn@amazon.com>
Date: Wed, 27 Feb 2019 18:58:58 +0000
Subject: [PATCH] Refactor the fsync queue for future SMGR implementations.

Previously, md.c and checkpointer.c were tightly integrated so that
regular backends could write data and the checkpointer process could
eventually call fsync.  Introduce a system of callbacks and file
tags, so that other subsystems can hand off fsync work in the same
way.  For now only md.c uses the new interface, but other users are
proposed.

Author: Shawn Debnath and Thomas Munro
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 src/backend/access/transam/twophase.c |   1 +
 src/backend/access/transam/xact.c     |   1 +
 src/backend/access/transam/xlog.c     |   7 +-
 src/backend/commands/dbcommands.c     |   7 +-
 src/backend/postmaster/checkpointer.c |  38 +-
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +-
 src/backend/storage/smgr/md.c         | 915 ++++----------------------
 src/backend/storage/smgr/smgr.c       |  55 +-
 src/backend/storage/sync/Makefile     |  17 +
 src/backend/storage/sync/sync.c       | 610 +++++++++++++++++
 src/backend/utils/init/postinit.c     |   2 +
 src/include/postmaster/bgwriter.h     |   8 +-
 src/include/storage/fd.h              |  12 +
 src/include/storage/md.h              |  51 ++
 src/include/storage/segment.h         |  28 +
 src/include/storage/smgr.h            |  38 --
 src/include/storage/sync.h            |  71 ++
 src/tools/pgindent/typedefs.list      |   4 +
 19 files changed, 972 insertions(+), 897 deletions(-)
 create mode 100644 src/backend/storage/sync/Makefile
 create mode 100644 src/backend/storage/sync/sync.c
 create mode 100644 src/include/storage/md.h
 create mode 100644 src/include/storage/segment.h
 create mode 100644 src/include/storage/sync.h

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 11992f7447d..ecc01f741d4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -98,6 +98,7 @@
 #include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e9ed92b70bb..72d54396dfa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -50,6 +50,7 @@
 #include "storage/fd.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c6ca96079c1..6a3c80aed46 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "storage/reinit.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
+#include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -6981,7 +6982,7 @@ StartupXLOG(void)
 		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			PublishStartupProcessInformation();
-			SetForwardFsyncRequests();
+			EnableSyncRequestForwarding();
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
 			bgwriterLaunched = true;
 		}
@@ -8566,7 +8567,7 @@ CreateCheckPoint(int flags)
 	 * the REDO pointer.  Note that smgr must not do anything that'd have to
 	 * be undone if we decide no checkpoint is needed.
 	 */
-	smgrpreckpt();
+	SyncPreCheckpoint();
 
 	/* Begin filling in the checkpoint WAL record */
 	MemSet(&checkPoint, 0, sizeof(checkPoint));
@@ -8856,7 +8857,7 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
-	smgrpostckpt();
+	SyncPostCheckpoint();
 
 	/*
 	 * Update the average distance between checkpoints if the prior checkpoint
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 35cad0b6294..9707afabd98 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -54,6 +54,7 @@
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
@@ -941,11 +942,11 @@ dropdb(const char *dbname, bool missing_ok)
 	 * worse, it will delete files that belong to a newly created database
 	 * with the same OID.
 	 */
-	ForgetDatabaseFsyncRequests(db_id);
+	ForgetDatabaseSyncRequests(db_id);
 
 	/*
 	 * Force a checkpoint to make sure the checkpointer has received the
-	 * message sent by ForgetDatabaseFsyncRequests. On Windows, this also
+	 * message sent by ForgetDatabaseSyncRequests. On Windows, this also
 	 * ensures that background procs don't hold any open files, which would
 	 * cause rmdir() to fail.
 	 */
@@ -2150,7 +2151,7 @@ dbase_redo(XLogReaderState *record)
 		DropDatabaseBuffers(xlrec->db_id);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
-		ForgetDatabaseFsyncRequests(xlrec->db_id);
+		ForgetDatabaseSyncRequests(xlrec->db_id);
 
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c2411081a5e..68017353955 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -108,10 +108,9 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
+	int16		handler;		/* which sync functions to call */
+	int16		type;			/* request type */
+	FileTag		ftag;			/* opaque identifier of the file to sync */
 } CheckpointerRequest;
 
 typedef struct
@@ -349,7 +348,7 @@ CheckpointerMain(void)
 		/*
 		 * Process any requests or signals received recently.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 
 		if (got_SIGHUP)
 		{
@@ -684,7 +683,7 @@ CheckpointWriteDelay(int flags, double progress)
 			UpdateSharedMemoryConfig();
 		}
 
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 
 		CheckArchiveTimeout();
@@ -709,7 +708,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * operations even when we don't sleep, to prevent overflow of the
 		 * fsync request queue.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 	}
 }
@@ -1084,7 +1083,7 @@ RequestCheckpoint(int flags)
 }
 
 /*
- * ForwardFsyncRequest
+ * ForwardSyncRequest
  *		Forward a file-fsync request from a backend to the checkpointer
  *
  * Whenever a backend is compelled to write directly to a relation
@@ -1113,10 +1112,11 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardSyncRequest(const FileTag *ftag, SyncRequestType type,
+				   SyncRequestHandler handler)
 {
 	CheckpointerRequest *request;
-	bool		too_full;
+	bool				too_full;
 
 	if (!IsUnderPostmaster)
 		return false;			/* probably shouldn't even get here */
@@ -1151,9 +1151,9 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
+	request->handler = handler;
+	request->type = type;
+	request->ftag = *ftag;
 
 	/* If queue is more than half full, nudge the checkpointer to empty it */
 	too_full = (CheckpointerShmem->num_requests >=
@@ -1190,7 +1190,7 @@ CompactCheckpointerRequestQueue(void)
 	struct CheckpointerSlotMapping
 	{
 		CheckpointerRequest request;
-		int			slot;
+		int					slot;
 	};
 
 	int			n,
@@ -1284,8 +1284,8 @@ CompactCheckpointerRequestQueue(void)
 }
 
 /*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ * AbsorbSyncRequests
+ *		Retrieve queued sync requests and pass them to sync mechanism.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1293,7 +1293,7 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbSyncRequests(void)
 {
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
@@ -1335,7 +1335,9 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberSyncRequest(&(request->ftag),
+							request->type,
+							request->handler);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index bd2d272c6ea..8376cdfca20 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr
+SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385fe..887023fc8a5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2584,7 +2584,7 @@ CheckPointBuffers(int flags)
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6ed68185edb..53e5ccce15c 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -29,45 +29,18 @@
 #include "access/xlogutils.h"
 #include "access/xlog.h"
 #include "pgstat.h"
-#include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
+#include "storage/md.h"
 #include "storage/relfilenode.h"
+#include "storage/segment.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
-
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
-
-/*
- * On Windows, we have to interpret EACCES as possibly meaning the same as
- * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
- * that's what you get.  Ugh.  This code is designed so that we don't
- * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
- */
-#ifndef WIN32
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
-#else
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
-#endif
-
 /*
  *	The magnetic disk storage manager keeps track of open file
  *	descriptors in its own descriptor pool.  This is done to make it
@@ -114,53 +87,30 @@ typedef struct _MdfdVec
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 
+/* local routines */
+static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
+			 bool isRedo);
+static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+					   MdfdVec *seg);
+static void register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno);
+static void register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno);
+static void _fdvec_resize(SMgrRelation reln,
+			  ForkNumber forknum,
+			  int nseg);
+static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
+			  BlockNumber segno, int oflags);
+static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
+			 BlockNumber blkno, bool skipFsync, int behavior);
+static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+		   MdfdVec *seg);
+
 
 /*
- * In some contexts (currently, standalone backends and the checkpointer)
- * we keep track of pending fsync operations: we need to remember all relation
- * segments that have been written since the last checkpoint, so that we can
- * fsync them down to disk before completing the next checkpoint.  This hash
- * table remembers the pending operations.  We use a hash table mostly as
- * a convenient way of merging duplicate requests.
- *
- * We use a similar mechanism to remember no-longer-needed files that can
- * be deleted after the next checkpoint, but we use a linked list instead of
- * a hash table, because we don't expect there to be any duplicate requests.
- *
- * These mechanisms are only used for non-temp relations; we never fsync
- * temp rels, nor do we need to postpone their deletion (see comments in
- * mdunlink).
- *
- * (Regular backends do not track pending operations locally, but forward
- * them to the checkpointer.)
+ * Segment handling behaviors
  */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
-
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
-
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
-
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
-
-
-/*** behavior for mdopen & _mdfd_getseg ***/
 /* ereport if segment not present */
 #define EXTENSION_FAIL				(1 << 0)
 /* return NULL if segment not present */
@@ -179,26 +129,6 @@ static CycleCtr mdckpt_cycle_ctr = 0;
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
 
 
-/* local routines */
-static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
-			 bool isRedo);
-static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
-static void _fdvec_resize(SMgrRelation reln,
-			  ForkNumber forknum,
-			  int nseg);
-static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
-			  BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
-			  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
-			 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
-		   MdfdVec *seg);
-
-
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
  */
@@ -208,64 +138,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -380,16 +252,6 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 void
 mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
-	/*
-	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
-	 * requests for a temp relation, though.  We can send just one request
-	 * even when deleting multiple forks, since the fsync queuing code accepts
-	 * the "InvalidForkNumber = all forks" convention.
-	 */
-	if (!RelFileNodeBackendIsTemp(rnode))
-		ForgetRelationFsyncRequests(rnode.node, forkNum);
-
 	/* Now do the per-fork work */
 	if (forkNum == InvalidForkNumber)
 	{
@@ -413,6 +275,11 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 	 */
 	if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
 	{
+		/* First, forget any pending sync requests for the first segment */
+		if (!RelFileNodeBackendIsTemp(rnode))
+			register_forget_request(rnode, forkNum, 0 /* first seg */ );
+
+		/* Next unlink the file */
 		ret = unlink(path);
 		if (ret < 0 && errno != ENOENT)
 			ereport(WARNING,
@@ -442,7 +309,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		register_unlink_segment(rnode, forkNum, 0 /* first seg */ );
 	}
 
 	/*
@@ -459,6 +326,13 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 		 */
 		for (segno = 1;; segno++)
 		{
+			/*
+			 * Forget any pending sync requests for the segment before we
+			 * unlink
+			 */
+			if (!RelFileNodeBackendIsTemp(rnode))
+				register_forget_request(rnode, forkNum, segno);
+
 			sprintf(segpath, "%s.%u", path, segno);
 			if (unlink(segpath) < 0)
 			{
@@ -1003,388 +877,6 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
-/*
- *	mdsync() -- Sync previous writes to stable storage.
- */
-void
-mdsync(void)
-{
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
-	}
-
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
-
-/*
- * mdpostckpt() -- Do post-checkpoint work
- *
- * Remove any lingering files that can now be safely removed.
- */
-void
-mdpostckpt(void)
-{
-	int			absorb_counter;
-
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
-
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
-}
-
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1397,19 +889,16 @@ mdpostckpt(void)
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	FileTag		tag;
+
+	INIT_FILETAG(tag, reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, SYNC_HANDLER_MD,
+							 false /* retryOnError */ ))
 	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1423,254 +912,54 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 /*
  * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
  */
 static void
-register_unlink(RelFileNodeBackend rnode)
+register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno)
 {
+	FileTag		tag;
+
+	INIT_FILETAG(tag, rnode.node, forknum, segno);
+
 	/* Should never be used with temp relations */
 	Assert(!RelFileNodeBackendIsTemp(rnode));
 
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	RegisterSyncRequest(&tag, SYNC_UNLINK_REQUEST, SYNC_HANDLER_MD,
+						true /* retryOnError */ );
 }
 
 /*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
+ * register_forget_request() -- forget any fsyncs for a relation fork's segment
  */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+static void
+register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+						SegmentNumber segno)
 {
-	Assert(pendingOpsTable);
+	FileTag		tag;
 
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
+	INIT_FILETAG(tag, rnode.node, forknum, segno);
 
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
+	RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, SYNC_HANDLER_MD,
+						true /* retryOnError */ );
 }
 
 /*
  * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
  */
 void
-ForgetDatabaseFsyncRequests(Oid dbid)
+ForgetDatabaseSyncRequests(Oid dbid)
 {
+	FileTag		tag;
 	RelFileNode rnode;
 
 	rnode.dbNode = dbid;
 	rnode.spcNode = 0;
 	rnode.relNode = 0;
 
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	INIT_FILETAG(tag, rnode, InvalidForkNumber, InvalidSegmentNumber);
+
+	RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, SYNC_HANDLER_MD,
+						true /* retryOnError */ );
 }
 
 /*
@@ -1951,3 +1240,77 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	/* note that this calculation will ignore any partial block at EOF */
 	return (BlockNumber) (len / BLCKSZ);
 }
+
+/*
+ * Sync a file to disk, given a file tag.  Write the path into an output
+ * buffer so the caller can use it in error messages.
+ *
+ * Return 0 on success, and otherwise an errno value.
+ */
+int
+mdsyncfiletag(const FileTag *ftag, char *path)
+{
+	SMgrRelation reln = smgropen(ftag->rnode, InvalidBackendId);
+	MdfdVec    *v;
+	char	   *p;
+
+	/* Provide the path for informational messages. */
+	p = _mdfd_segpath(reln, ftag->forknum, ftag->segno);
+	strlcpy(path, p, MAXPGPATH);
+	pfree(p);
+
+	/* Open the relation using the cache, for performance. */
+	reln = smgropen(ftag->rnode, InvalidBackendId);
+
+	/* Try to find open the requested segment. */
+	v = _mdfd_openseg(reln, ftag->forknum, ftag->segno, 0);
+	if (v == NULL)
+		return ENOENT;
+
+	/* Try to fsync the file. */
+	if (FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+		return errno;
+
+	return 0;
+}
+
+/*
+ * Unlink a file, given a file tag.  Write the path into an output
+ * buffer so the caller can use it in error messages.
+ *
+ * Return 0 on success, and otherwise an errno value.
+ */
+int
+mdunlinkfiletag(const FileTag *ftag, char *path)
+{
+	SMgrRelation reln = smgropen(ftag->rnode, InvalidBackendId);
+	char	   *p;
+
+	/* Compute the path. */
+	p = _mdfd_segpath(reln, ftag->forknum, ftag->segno);
+	strlcpy(path, p, MAXPGPATH);
+	pfree(p);
+
+	/* Try to unlink the file. */
+	if (unlink(path) < 0)
+		return errno;
+
+	return 0;
+}
+
+/*
+ * Check if a given candidate request matches a given tag, when processing
+ * a SYNC_FILTER_REQUEST request.  This will be called for all pending
+ * requests to find out whether to forget them.
+ */
+bool
+mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
+{
+	/*
+	 * For now we only use filter requests as a way to drop all scheduled
+	 * callbacks relating to a given database, when dropping the database.
+	 * We'll return true for all candidates that have the same database OID as
+	 * the ftag from the SYNC_FILTER_REQUEST request, so they're forgotten.
+	 */
+	return ftag->rnode.dbNode == candidate->rnode.dbNode;
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f6de9df9e61..8191118b619 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -60,12 +61,8 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
 } f_smgr;
 
-
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{
@@ -83,15 +80,11 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
-
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
@@ -705,52 +698,6 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
-}
-
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/backend/storage/sync/Makefile b/src/backend/storage/sync/Makefile
new file mode 100644
index 00000000000..cfc60cadb4c
--- /dev/null
+++ b/src/backend/storage/sync/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/sync
+#
+# IDENTIFICATION
+#    src/backend/storage/sync/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/sync
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = sync.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
new file mode 100644
index 00000000000..0d58fa4a4cf
--- /dev/null
+++ b/src/backend/storage/sync/sync.c
@@ -0,0 +1,610 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.c
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/sync/sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/file.h>
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "commands/tablespace.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/md.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+#include "utils/inval.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  This hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+typedef uint16 CycleCtr;		/* can be any convenient integer size */
+
+typedef struct
+{
+	FileTag		ftag;			/* hash table key (must be first!) */
+	SyncRequestHandler handler; /* request resolution handler */
+	CycleCtr	cycle_ctr;		/* sync_cycle_ctr of oldest request */
+	bool		canceled;		/* canceled is true if we canceled "recently" */
+}			PendingFsyncEntry;
+
+typedef struct
+{
+	FileTag		ftag;			/* tag for relation file to delete */
+	SyncRequestHandler handler; /* request resolution handler */
+	CycleCtr	cycle_ctr;		/* checkpoint_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static HTAB *pendingOps = NULL;
+static List *pendingUnlinks = NIL;
+static MemoryContext pendingOpsCxt; /* context for the above  */
+
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr checkpoint_cycle_ctr = 0;
+
+/* Intervals for calling AbsorbFsyncRequests */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * Function pointers for handling sync and unlink requests.
+ */
+typedef struct SyncOps
+{
+	int			(*sync_syncfiletag) (const FileTag *ftag, char *path);
+	int			(*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+	bool		(*sync_filetagmatches) (const FileTag *ftag,
+										const FileTag *candidate);
+} SyncOps;
+
+static const SyncOps syncsw[] = {
+	/* magnetic disk */
+	{
+		.sync_syncfiletag = mdsyncfiletag,
+		.sync_unlinkfiletag = mdunlinkfiletag,
+		.sync_filetagmatches = mdfiletagmatches
+	}
+};
+
+/*
+ * Initialize data structures for the file sync tracking.
+ */
+void
+InitSync(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(FileTag);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOps = hash_create("Pending Ops Table",
+								 100L,
+								 &hash_ctl,
+								 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+
+}
+
+/*
+ * SyncPreCheckpoint() -- Do pre-checkpoint work
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+SyncPreCheckpoint(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	checkpoint_cycle_ctr++;
+}
+
+/*
+ * SyncPostCheckpoint() -- Do post-checkpoint work
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+SyncPostCheckpoint(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char		path[MAXPGPATH];
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == checkpoint_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		errno = syncsw[entry->handler].sync_unlinkfiletag(&entry->ftag, path);
+
+		if (errno != 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in ProcessFsyncRequests, we don't want to stop absorbing fsync
+		 * requests for along time when there are many deletions to be done.
+		 * We can safelycall AbsorbFsyncRequests() at this point in the loop
+		 * (note it might try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+/*
+
+ *	ProcessSyncRequests() -- Process queued fsync requests.
+ */
+void
+ProcessSyncRequests(void)
+{
+	static bool sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	int			processed = 0;
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	uint64		longest = 0;
+	uint64		total_elapsed = 0;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOps.
+	 */
+	if (!pendingOps)
+		elog(ERROR, "cannot sync without a pendingOps table");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbSyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use sync_cycle_ctr to tell old entries apart from new
+	 * ones: new ones will have cycle_ctr equal to the incremented value of
+	 * sync_cycle_ctr.
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous ProcessFsyncRequests() failed to complete, run through the
+	 * table and forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			entry->cycle_ctr = sync_cycle_ctr;
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOps);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		int			failures;
+
+		/*
+		 * If fsync is off then we don't have to bother opening the file at
+		 * all.  (We delay checking until this point so that changing fsync on
+		 * the fly behaves sensibly.)
+		 */
+		if (!enableFsync)
+			continue;
+
+		/*
+		 * If the entry is new then don't process it this time; it might
+		 * contain multiple fsync-request bits, but they are all new.  Note
+		 * "continue" bypasses the hash-remove call at the bottom of the loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * If in checkpointer, we want to absorb pending requests every so
+		 * often to prevent overflow of the fsync request queue.  It is
+		 * unspecified whether newly-added entries will be visited by
+		 * hash_seq_search, but we don't care since we don't need to process
+		 * them anyway.
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+
+		/*
+		 * The fsync table could contain requests to fsync segments that have
+		 * been deleted (unlinked) by the time we get to them. Rather than
+		 * just hoping an ENOENT (or EACCES on Windows) error can be ignored,
+		 * what we do on error is absorb pending requests and then retry.
+		 * Since mdunlink() queues a "cancel" message before actually
+		 * unlinking, the fsync request is guaranteed to be marked canceled
+		 * after the absorb if it really was this case. DROP DATABASE likewise
+		 * has to tell us to forget fsync requests before it starts deletions.
+		 *
+		 * If the entry was cancelled after the absorb above, or within the
+		 * absorb inside the loop, exit the loop. We delete the entry right
+		 * after. Look can also exit at "break".
+		 */
+		for (failures = 0; !(entry->canceled); failures++)
+		{
+			char		path[MAXPGPATH];
+
+			INSTR_TIME_SET_CURRENT(sync_start);
+			errno = syncsw[entry->handler].sync_syncfiletag(&entry->ftag, path);
+			if (errno == 0)
+			{
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				if (log_checkpoints)
+					elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
+						 processed,
+						 path,
+						 (double) elapsed / 1000);
+
+				break;			/* out of retry loop */
+			}
+
+			/*
+			 * It is possible that the relation has been dropped or truncated
+			 * since the fsync request was entered. Therefore, allow ENOENT,
+			 * but only if we didn't fail already on this file.
+			 */
+			if (!FILE_POSSIBLY_DELETED(errno) || failures > 0)
+				ereport(data_sync_elevel(ERROR),
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								path)));
+			else
+				ereport(DEBUG1,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\" but retrying: %m",
+								path)));
+
+			/*
+			 * Absorb incoming requests and check to see if a cancel arrived
+			 * for this relation fork.
+			 */
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
+		}						/* end retry loop */
+
+		/* We are done with this entry, remove it */
+		if (hash_search(pendingOps, &entry->ftag, HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "pendingOps corrupted");
+	}							/* end loop over hashtable entries */
+
+	/* Return sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of ProcessSyncRequests */
+	sync_in_progress = false;
+}
+
+/*
+ * RememberSyncRequest() -- callback from checkpointer side of sync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * See sync.h for more information on the types of sync requests supported.
+ */
+void
+RememberSyncRequest(const FileTag *ftag, SyncRequestType type, SyncRequestHandler handler)
+{
+	Assert(pendingOps);
+
+	if (type == SYNC_FORGET_REQUEST)
+	{
+		PendingFsyncEntry *entry;
+
+		/* Cancel previously entered request */
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+												  (void *) ftag,
+												  HASH_FIND,
+												  NULL);
+		if (entry != NULL)
+			entry->canceled = true;
+	}
+	else if (type == SYNC_FILTER_REQUEST)
+	{
+		/* Cancel any pending requests that match the given tag. */
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+		ListCell   *cell,
+				   *prev,
+				   *next;
+
+		/* Cancel fsync requests */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (syncsw[entry->handler].sync_filetagmatches(ftag,
+														   &entry->ftag))
+				entry->canceled = true;
+		}
+
+		/* Remove unlink requests */
+		prev = NULL;
+		for (cell = list_head(pendingUnlinks); cell; cell = next)
+		{
+			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+			next = lnext(cell);
+			if (syncsw[entry->handler].sync_filetagmatches(ftag,
+														   &(entry->ftag)))
+			{
+				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
+				pfree(entry);
+			}
+			else
+				prev = cell;
+		}
+	}
+	else if (type == SYNC_UNLINK_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->ftag = *ftag;
+		entry->handler = handler;
+		entry->cycle_ctr = checkpoint_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		Assert(type == SYNC_REQUEST);
+
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+												  ftag,
+												  HASH_ENTER,
+												  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			entry->handler = handler;
+			entry->cycle_ctr = sync_cycle_ctr;
+			entry->canceled = false;
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+}
+
+/*
+ * RegisterSyncRequest()
+ *
+ *   Register the sync request locally, or forward it to the checkpointer.
+ *   Caller can chose to infinitely retry or return immediately on error. We
+ *   currently will wait for 10 ms before retrying.
+ */
+bool
+RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
+					SyncRequestHandler handler, bool retryOnError)
+{
+	bool		ret;
+
+	if (pendingOps != NULL)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberSyncRequest(ftag, type, handler);
+		return true;
+	}
+	else
+	{
+		while (1)
+		{
+			/*
+			 * Notify the checkpointer about it.  If we fail to queue the
+			 * cancel message, we have to sleep and try again ... ugly, but
+			 * hopefully won't happen often.
+			 *
+			 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with
+			 * an error would leave the no-longer-used file still present on
+			 * disk, which would be bad, so I'm inclined to assume that the
+			 * checkpointer will always empty the queue soon.
+			 */
+			ret = ForwardSyncRequest(ftag, type, handler);
+
+			/*
+			 * If we are successful in queueing the request, or we failed and
+			 * was instructed not to retry on error, break.
+			 */
+			if (ret || (!ret && !retryOnError))
+				break;
+
+			pg_usleep(10000L);
+		}
+
+		return ret;
+	}
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingOps during initialization of the startup
+ * process.  Calling this function drops the local pendingOps so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+EnableSyncRequestForwarding(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingOps)
+	{
+		ProcessSyncRequests();
+		hash_destroy(pendingOps);
+	}
+	pendingOps = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 752010ed276..1c2a99c9c8c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/sinvaladt.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
@@ -555,6 +556,7 @@ BaseInit(void)
 
 	/* Do local initialization of file, storage and buffer managers */
 	InitFileAccess();
+	InitSync();
 	smgrinit();
 	InitBufferPoolAccess();
 }
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 53b8f5fe3cb..40b05d46617 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -17,6 +17,8 @@
 
 #include "storage/block.h"
 #include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
 
 
 /* GUC options */
@@ -31,9 +33,9 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type,
+							   SyncRequestHandler handler);
+extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 74c34757fb5..40f46b871d7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -54,6 +54,18 @@ extern PGDLLIMPORT bool data_sync_retry;
  */
 extern int	max_safe_fds;
 
+/*
+ * On Windows, we have to interpret EACCES as possibly meaning the same as
+ * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
+ * that's what you get.  Ugh.  This code is designed so that we don't
+ * actually believe these cases are okay without further evidence (namely,
+ * a pending fsync request getting canceled ... see mdsync).
+ */
+#ifndef WIN32
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
+#else
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
+#endif
 
 /*
  * prototypes for functions in fd.c
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
new file mode 100644
index 00000000000..a6758a10dcb
--- /dev/null
+++ b/src/include/storage/md.h
@@ -0,0 +1,51 @@
+/*-------------------------------------------------------------------------
+ *
+ * md.h
+ *	  magnetic disk storage manager public interface declarations.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/md.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MD_H
+#define MD_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
+
+/* md storage manager functionality */
+extern void mdinit(void);
+extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
+extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
+extern void mdextend(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber blocknum);
+extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
+extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber nblocks);
+extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+
+extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
+
+/* md sync callbacks */
+extern int mdsyncfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
+
+#endif							/* MD_H */
diff --git a/src/include/storage/segment.h b/src/include/storage/segment.h
new file mode 100644
index 00000000000..c7af9451687
--- /dev/null
+++ b/src/include/storage/segment.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * segment.h
+ *	  POSTGRES disk segment definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/segment.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SEGMENT_H
+#define SEGMENT_H
+
+
+/*
+ * Segment Number:
+ *
+ * Each relation and its forks are divided into segments. This
+ * definition formalizes the definition of the segment number.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)
+
+#endif							/* SEGMENT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 8e982738789..770193e285e 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,7 +18,6 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
-
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -106,43 +105,6 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
-
-/* internals: move me elsewhere -- ay 7/94 */
-
-/* in md.c */
-extern void mdinit(void);
-extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
-extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
-extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
-extern void mdextend(SMgrRelation reln, ForkNumber forknum,
-		 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber blocknum);
-extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
-			BlockNumber blocknum, BlockNumber nblocks);
-extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
-extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
-extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
-
 #endif							/* SMGR_H */
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
new file mode 100644
index 00000000000..75fbc7eac8e
--- /dev/null
+++ b/src/include/storage/sync.h
@@ -0,0 +1,71 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.h
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/sync.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SYNC_H
+#define SYNC_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/segment.h"
+
+/*
+ * Type of sync request.  These can be used to manage the set of pending
+ * requests to call the handler's sync or unlink functions at the next
+ * checkpoint.
+ */
+typedef enum SyncRequestType
+{
+	SYNC_REQUEST,				/* schedule a call of handler's sync fn */
+	SYNC_FORGET_REQUEST,		/* forget a scheduled call for given tag */
+	SYNC_FILTER_REQUEST,		/* forget calls satisfying handler's match fn */
+	SYNC_UNLINK_REQUEST			/* schedule a call of handler's unlink fn */
+} SyncRequestType;
+
+/*
+ * Which set of functions to use to handle a given request.  See the function
+ * table in sync.c.
+ */
+typedef enum SyncRequestHandler
+{
+	SYNC_HANDLER_MD = 0			/* md smgr */
+} SyncRequestHandler;
+
+/*
+ * Augmenting a relfilenode with the fork and segment number provides all
+ * the information to locate the particular segment of interest for a relation.
+ */
+typedef struct FileTag
+{
+	RelFileNode rnode;
+	ForkNumber	forknum;
+	SegmentNumber segno;
+} FileTag;
+
+#define INIT_FILETAG(a,xx_rnode,xx_forknum,xx_segno) \
+( \
+	(a).rnode = (xx_rnode), \
+	(a).forknum = (xx_forknum), \
+	(a).segno = (xx_segno) \
+)
+
+/* sync forward declarations */
+extern void InitSync(void);
+extern void SyncPreCheckpoint(void);
+extern void SyncPostCheckpoint(void);
+extern void ProcessSyncRequests(void);
+extern void RememberSyncRequest(const FileTag *ftag, SyncRequestType type,
+					SyncRequestHandler handler);
+extern void EnableSyncRequestForwarding(void);
+extern bool RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
+					SyncRequestHandler handler, bool retryOnError);
+
+#endif							/* SYNC_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f31929664ac..b631b42f4ca 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -651,6 +651,7 @@ File
 FileFdwExecutionState
 FileFdwPlanState
 FileNameMap
+FileTag
 FindSplitData
 FixedParallelExecutorState
 FixedParallelState
@@ -2276,7 +2277,10 @@ Subscription
 SubscriptionInfo
 SubscriptionRelState
 Syn
+SyncHandlerType
+SyncOps
 SyncRepConfigData
+SyncRequestType
 SysScanDesc
 SyscacheCallbackFunction
 SystemRowsSamplerData
-- 
2.21.0

#79

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Thomas Munro (#78)

2 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Tue, Apr 2, 2019 at 11:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I'm going to do some more testing and tidying tomorrow (for example I
think the segment.h header is silly and I'd like to remove that), and
commit this.

As a sanity check on the programming interface this thing gives you, I
tried teaching the SLRUs to use the fsync queue. I finished up making
a few small improvements, but the main thing I learned is that
"handler" needs to be part of the hash table key. I suppose the
discriminator could even be inside FileTag itself, but I chose to keep
it separate and introduce a new struct to hold hander enum + FileTag
in the hash table.

The 0001 patch is what I'm going to commit soon. I don't know why
Shawn measured a performance change -- a bug in an earlier version? --
I can't see it, but I'm planning to look into that a bit more first.
I've attached the 0002 SLRU patch for interest, but I'm not planning
to commit that one.

--
Thomas Munro
https://enterprisedb.com

Attachments:

0001-Refactor-the-fsync-queue-for-wider-use-v15.patchapplication/octet-stream; name=0001-Refactor-the-fsync-queue-for-wider-use-v15.patchDownload

From 463ebbec28b7965869e01f8fd3c4d5a61b5feb31 Mon Sep 17 00:00:00 2001
From: Shawn Debnath <sdn@amazon.com>
Date: Wed, 27 Feb 2019 18:58:58 +0000
Subject: [PATCH 1/2] Refactor the fsync queue for wider use.

Previously, md.c and checkpointer.c were tightly integrated so that
fsync calls could be handed off and processed in the background.
Introduce a system of callbacks and file tags, so that other modules
can hand off work in the same way.

For now only md.c uses the new interface, but other users are being
proposed.  Since there may be use cases that are not strictly SMGR
implementations, use a new function table rather than the traditional
SMGR one.

Instead of using a bitmapset of segment numbers for each RefFileNode
in the checkpointer's hash table, make the segment number part of the
key.  This requires sending explicit "forget" requests for every
segment individually when relations are dropped, but suits the file
layout schemes of proposed future users better (ie sparse, high
segment numbers).

Author: Shawn Debnath and Thomas Munro
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 src/backend/access/transam/twophase.c |   1 +
 src/backend/access/transam/xact.c     |   1 +
 src/backend/access/transam/xlog.c     |   7 +-
 src/backend/commands/dbcommands.c     |   7 +-
 src/backend/postmaster/checkpointer.c |  47 +-
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +-
 src/backend/storage/smgr/md.c         | 915 ++++----------------------
 src/backend/storage/smgr/smgr.c       |  55 +-
 src/backend/storage/sync/Makefile     |  17 +
 src/backend/storage/sync/sync.c       | 612 +++++++++++++++++
 src/backend/utils/init/postinit.c     |   2 +
 src/include/postmaster/bgwriter.h     |   8 +-
 src/include/storage/fd.h              |  12 +
 src/include/storage/md.h              |  51 ++
 src/include/storage/smgr.h            |  38 --
 src/include/storage/sync.h            |  64 ++
 src/tools/pgindent/typedefs.list      |   7 +-
 18 files changed, 943 insertions(+), 905 deletions(-)
 create mode 100644 src/backend/storage/sync/Makefile
 create mode 100644 src/backend/storage/sync/sync.c
 create mode 100644 src/include/storage/md.h
 create mode 100644 src/include/storage/sync.h

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 11992f7447d..ecc01f741d4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -98,6 +98,7 @@
 #include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index e9ed92b70bb..72d54396dfa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -50,6 +50,7 @@
 #include "storage/fd.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c6ca96079c1..6a3c80aed46 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "storage/reinit.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
+#include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -6981,7 +6982,7 @@ StartupXLOG(void)
 		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			PublishStartupProcessInformation();
-			SetForwardFsyncRequests();
+			EnableSyncRequestForwarding();
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
 			bgwriterLaunched = true;
 		}
@@ -8566,7 +8567,7 @@ CreateCheckPoint(int flags)
 	 * the REDO pointer.  Note that smgr must not do anything that'd have to
 	 * be undone if we decide no checkpoint is needed.
 	 */
-	smgrpreckpt();
+	SyncPreCheckpoint();
 
 	/* Begin filling in the checkpoint WAL record */
 	MemSet(&checkPoint, 0, sizeof(checkPoint));
@@ -8856,7 +8857,7 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
-	smgrpostckpt();
+	SyncPostCheckpoint();
 
 	/*
 	 * Update the average distance between checkpoints if the prior checkpoint
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 35cad0b6294..9707afabd98 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -54,6 +54,7 @@
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
@@ -941,11 +942,11 @@ dropdb(const char *dbname, bool missing_ok)
 	 * worse, it will delete files that belong to a newly created database
 	 * with the same OID.
 	 */
-	ForgetDatabaseFsyncRequests(db_id);
+	ForgetDatabaseSyncRequests(db_id);
 
 	/*
 	 * Force a checkpoint to make sure the checkpointer has received the
-	 * message sent by ForgetDatabaseFsyncRequests. On Windows, this also
+	 * message sent by ForgetDatabaseSyncRequests. On Windows, this also
 	 * ensures that background procs don't hold any open files, which would
 	 * cause rmdir() to fail.
 	 */
@@ -2150,7 +2151,7 @@ dbase_redo(XLogReaderState *record)
 		DropDatabaseBuffers(xlrec->db_id);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
-		ForgetDatabaseFsyncRequests(xlrec->db_id);
+		ForgetDatabaseSyncRequests(xlrec->db_id);
 
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c2411081a5e..7e74a802289 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -108,10 +108,9 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
+	int16		handler;		/* which sync functions to call */
+	int16		type;			/* request type */
+	FileTag		ftag;			/* opaque identifier of the file to sync */
 } CheckpointerRequest;
 
 typedef struct
@@ -349,7 +348,7 @@ CheckpointerMain(void)
 		/*
 		 * Process any requests or signals received recently.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 
 		if (got_SIGHUP)
 		{
@@ -684,7 +683,7 @@ CheckpointWriteDelay(int flags, double progress)
 			UpdateSharedMemoryConfig();
 		}
 
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 
 		CheckArchiveTimeout();
@@ -709,7 +708,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * operations even when we don't sleep, to prevent overflow of the
 		 * fsync request queue.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 	}
 }
@@ -1084,7 +1083,7 @@ RequestCheckpoint(int flags)
 }
 
 /*
- * ForwardFsyncRequest
+ * ForwardSyncRequest
  *		Forward a file-fsync request from a backend to the checkpointer
  *
  * Whenever a backend is compelled to write directly to a relation
@@ -1093,15 +1092,6 @@ RequestCheckpoint(int flags)
  * is dirty and must be fsync'd before next checkpoint.  We also use this
  * opportunity to count such writes for statistical purposes.
  *
- * This functionality is only supported for regular (not backend-local)
- * relations, so the rnode argument is intentionally RelFileNode not
- * RelFileNodeBackend.
- *
- * segno specifies which segment (not block!) of the relation needs to be
- * fsync'd.  (Since the valid range is much less than BlockNumber, we can
- * use high values for special flags; that's all internal to md.c, which
- * see for details.)
- *
  * To avoid holding the lock for longer than necessary, we normally write
  * to the requests[] queue without checking for duplicates.  The checkpointer
  * will have to eliminate dups internally anyway.  However, if we discover
@@ -1113,7 +1103,8 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardSyncRequest(const FileTag *ftag, SyncRequestType type,
+				   SyncRequestHandler handler)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
@@ -1122,7 +1113,7 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		return false;			/* probably shouldn't even get here */
 
 	if (AmCheckpointerProcess())
-		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
+		elog(ERROR, "ForwardSyncRequest must not be called in checkpointer");
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
@@ -1143,7 +1134,7 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
+		if (!AmBackgroundWriterProcess() && type == SYNC_REQUEST)
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
 		return false;
@@ -1151,9 +1142,9 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
+	request->handler = handler;
+	request->type = type;
+	request->ftag = *ftag;
 
 	/* If queue is more than half full, nudge the checkpointer to empty it */
 	too_full = (CheckpointerShmem->num_requests >=
@@ -1284,8 +1275,8 @@ CompactCheckpointerRequestQueue(void)
 }
 
 /*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ * AbsorbSyncRequests
+ *		Retrieve queued sync requests and pass them to sync mechanism.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1293,7 +1284,7 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbSyncRequests(void)
 {
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
@@ -1335,7 +1326,9 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberSyncRequest(&request->ftag,
+							request->type,
+							request->handler);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index bd2d272c6ea..8376cdfca20 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr
+SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385fe..887023fc8a5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2584,7 +2584,7 @@ CheckPointBuffers(int flags)
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6ed68185edb..20378c46b65 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -29,45 +29,17 @@
 #include "access/xlogutils.h"
 #include "access/xlog.h"
 #include "pgstat.h"
-#include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
+#include "storage/md.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
-
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
-
-/*
- * On Windows, we have to interpret EACCES as possibly meaning the same as
- * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
- * that's what you get.  Ugh.  This code is designed so that we don't
- * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
- */
-#ifndef WIN32
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
-#else
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
-#endif
-
 /*
  *	The magnetic disk storage manager keeps track of open file
  *	descriptors in its own descriptor pool.  This is done to make it
@@ -114,50 +86,34 @@ typedef struct _MdfdVec
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 
+/* local routines */
+static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
+			 bool isRedo);
+static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+					   MdfdVec *seg);
+static void register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						BlockNumber segno);
+static void register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+						BlockNumber segno);
+static void _fdvec_resize(SMgrRelation reln,
+			  ForkNumber forknum,
+			  int nseg);
+static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
+			  BlockNumber segno, int oflags);
+static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
+			 BlockNumber blkno, bool skipFsync, int behavior);
+static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+		   MdfdVec *seg);
 
-/*
- * In some contexts (currently, standalone backends and the checkpointer)
- * we keep track of pending fsync operations: we need to remember all relation
- * segments that have been written since the last checkpoint, so that we can
- * fsync them down to disk before completing the next checkpoint.  This hash
- * table remembers the pending operations.  We use a hash table mostly as
- * a convenient way of merging duplicate requests.
- *
- * We use a similar mechanism to remember no-longer-needed files that can
- * be deleted after the next checkpoint, but we use a linked list instead of
- * a hash table, because we don't expect there to be any duplicate requests.
- *
- * These mechanisms are only used for non-temp relations; we never fsync
- * temp rels, nor do we need to postpone their deletion (see comments in
- * mdunlink).
- *
- * (Regular backends do not track pending operations locally, but forward
- * them to the checkpointer.)
- */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
-
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
-
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
 
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
+/* Populate a file tag describing a md.c segment file. */
+#define INIT_MDFILETAG(a,xx_rnode,xx_forknum,xx_segno) \
+( \
+	(a).rnode = (xx_rnode), \
+	(a).forknum = (xx_forknum), \
+	(a).segno = (xx_segno) \
+)
 
 
 /*** behavior for mdopen & _mdfd_getseg ***/
@@ -179,26 +135,6 @@ static CycleCtr mdckpt_cycle_ctr = 0;
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
 
 
-/* local routines */
-static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
-			 bool isRedo);
-static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
-static void _fdvec_resize(SMgrRelation reln,
-			  ForkNumber forknum,
-			  int nseg);
-static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
-			  BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
-			  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
-			 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
-		   MdfdVec *seg);
-
-
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
  */
@@ -208,64 +144,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -380,16 +258,6 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 void
 mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
-	/*
-	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
-	 * requests for a temp relation, though.  We can send just one request
-	 * even when deleting multiple forks, since the fsync queuing code accepts
-	 * the "InvalidForkNumber = all forks" convention.
-	 */
-	if (!RelFileNodeBackendIsTemp(rnode))
-		ForgetRelationFsyncRequests(rnode.node, forkNum);
-
 	/* Now do the per-fork work */
 	if (forkNum == InvalidForkNumber)
 	{
@@ -413,6 +281,11 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 	 */
 	if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
 	{
+		/* First, forget any pending sync requests for the first segment */
+		if (!RelFileNodeBackendIsTemp(rnode))
+			register_forget_request(rnode, forkNum, 0 /* first seg */ );
+
+		/* Next unlink the file */
 		ret = unlink(path);
 		if (ret < 0 && errno != ENOENT)
 			ereport(WARNING,
@@ -442,7 +315,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		register_unlink_segment(rnode, forkNum, 0 /* first seg */ );
 	}
 
 	/*
@@ -459,6 +332,13 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 		 */
 		for (segno = 1;; segno++)
 		{
+			/*
+			 * Forget any pending sync requests for the segment before we
+			 * unlink
+			 */
+			if (!RelFileNodeBackendIsTemp(rnode))
+				register_forget_request(rnode, forkNum, segno);
+
 			sprintf(segpath, "%s.%u", path, segno);
 			if (unlink(segpath) < 0)
 			{
@@ -1003,388 +883,6 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
-/*
- *	mdsync() -- Sync previous writes to stable storage.
- */
-void
-mdsync(void)
-{
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
-	}
-
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
-
-/*
- * mdpostckpt() -- Do post-checkpoint work
- *
- * Remove any lingering files that can now be safely removed.
- */
-void
-mdpostckpt(void)
-{
-	int			absorb_counter;
-
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
-
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
-}
-
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1397,19 +895,16 @@ mdpostckpt(void)
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	FileTag		tag;
+
+	INIT_MDFILETAG(tag, reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, SYNC_HANDLER_MD,
+							 false /* retryOnError */ ))
 	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1423,254 +918,54 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 /*
  * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
  */
 static void
-register_unlink(RelFileNodeBackend rnode)
+register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						BlockNumber segno)
 {
+	FileTag		tag;
+
+	INIT_MDFILETAG(tag, rnode.node, forknum, segno);
+
 	/* Should never be used with temp relations */
 	Assert(!RelFileNodeBackendIsTemp(rnode));
 
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	RegisterSyncRequest(&tag, SYNC_UNLINK_REQUEST, SYNC_HANDLER_MD,
+						true /* retryOnError */ );
 }
 
 /*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
+ * register_forget_request() -- forget any fsyncs for a relation fork's segment
  */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+static void
+register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+						BlockNumber segno)
 {
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
+	FileTag		tag;
 
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
+	INIT_MDFILETAG(tag, rnode.node, forknum, segno);
 
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
+	RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, SYNC_HANDLER_MD,
+						true /* retryOnError */ );
 }
 
 /*
  * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
  */
 void
-ForgetDatabaseFsyncRequests(Oid dbid)
+ForgetDatabaseSyncRequests(Oid dbid)
 {
+	FileTag		tag;
 	RelFileNode rnode;
 
 	rnode.dbNode = dbid;
 	rnode.spcNode = 0;
 	rnode.relNode = 0;
 
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	INIT_MDFILETAG(tag, rnode, InvalidForkNumber, InvalidBlockNumber);
+
+	RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, SYNC_HANDLER_MD,
+						true /* retryOnError */ );
 }
 
 /*
@@ -1951,3 +1246,75 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	/* note that this calculation will ignore any partial block at EOF */
 	return (BlockNumber) (len / BLCKSZ);
 }
+
+/*
+ * Sync a file to disk, given a file tag.  Write the path into an output
+ * buffer so the caller can use it in error messages.
+ *
+ * Return 0 on success, -1 on failure, with errno set.
+ */
+int
+mdsyncfiletag(const FileTag *ftag, char *path)
+{
+	SMgrRelation reln = smgropen(ftag->rnode, InvalidBackendId);
+	MdfdVec    *v;
+	char	   *p;
+
+	/* Provide the path for informational messages. */
+	p = _mdfd_segpath(reln, ftag->forknum, ftag->segno);
+	strlcpy(path, p, MAXPGPATH);
+	pfree(p);
+
+	/* Open the relation using the cache, for performance. */
+	reln = smgropen(ftag->rnode, InvalidBackendId);
+
+	/* Try to find open the requested segment. */
+	v = _mdfd_getseg(reln, ftag->forknum, ftag->segno, false,
+					 EXTENSION_RETURN_NULL);
+	if (v == NULL)
+	{
+		errno = ENOENT;
+		return -1;
+	}
+
+	/* Try to fsync the file. */
+	return FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC);
+}
+
+/*
+ * Unlink a file, given a file tag.  Write the path into an output
+ * buffer so the caller can use it in error messages.
+ *
+ * Return 0 on success, -1 on failure, with errno set.
+ */
+int
+mdunlinkfiletag(const FileTag *ftag, char *path)
+{
+	SMgrRelation reln = smgropen(ftag->rnode, InvalidBackendId);
+	char	   *p;
+
+	/* Compute the path. */
+	p = _mdfd_segpath(reln, ftag->forknum, ftag->segno);
+	strlcpy(path, p, MAXPGPATH);
+	pfree(p);
+
+	/* Try to unlink the file. */
+	return unlink(path);
+}
+
+/*
+ * Check if a given candidate request matches a given tag, when processing
+ * a SYNC_FILTER_REQUEST request.  This will be called for all pending
+ * requests to find out whether to forget them.
+ */
+bool
+mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
+{
+	/*
+	 * For now we only use filter requests as a way to drop all scheduled
+	 * callbacks relating to a given database, when dropping the database.
+	 * We'll return true for all candidates that have the same database OID as
+	 * the ftag from the SYNC_FILTER_REQUEST request, so they're forgotten.
+	 */
+	return ftag->rnode.dbNode == candidate->rnode.dbNode;
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f6de9df9e61..8191118b619 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -60,12 +61,8 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
 } f_smgr;
 
-
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{
@@ -83,15 +80,11 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
-
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
@@ -705,52 +698,6 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
-}
-
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/backend/storage/sync/Makefile b/src/backend/storage/sync/Makefile
new file mode 100644
index 00000000000..cfc60cadb4c
--- /dev/null
+++ b/src/backend/storage/sync/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/sync
+#
+# IDENTIFICATION
+#    src/backend/storage/sync/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/sync
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = sync.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
new file mode 100644
index 00000000000..7bcfa6b6c90
--- /dev/null
+++ b/src/backend/storage/sync/sync.c
@@ -0,0 +1,612 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.c
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/sync/sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/file.h>
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "commands/tablespace.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/md.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+#include "utils/inval.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  This hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+typedef uint16 CycleCtr;		/* can be any convenient integer size */
+
+typedef struct
+{
+	SyncRequestHandler handler; /* which SyncOps functions to use? */
+	FileTag		ftag;			/* opaque file tag, meaningful to handler */
+} PendingOpKey;
+
+typedef struct
+{
+	PendingOpKey key;
+	CycleCtr	cycle_ctr;		/* sync_cycle_ctr of oldest request */
+	bool		canceled;		/* canceled is true if we canceled "recently" */
+} PendingFsyncEntry;
+
+typedef struct
+{
+	PendingOpKey key;
+	CycleCtr	cycle_ctr;		/* checkpoint_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static HTAB *pendingOps = NULL;
+static List *pendingUnlinks = NIL;
+static MemoryContext pendingOpsCxt; /* context for the above  */
+
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr checkpoint_cycle_ctr = 0;
+
+/* Intervals for calling AbsorbFsyncRequests */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * Function pointers for handling sync and unlink requests.
+ */
+typedef struct SyncOps
+{
+	int			(*sync_syncfiletag) (const FileTag *ftag, char *path);
+	int			(*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+	bool		(*sync_filetagmatches) (const FileTag *ftag,
+										const FileTag *candidate);
+} SyncOps;
+
+static const SyncOps syncsw[] = {
+	/* magnetic disk */
+	{
+		.sync_syncfiletag = mdsyncfiletag,
+		.sync_unlinkfiletag = mdunlinkfiletag,
+		.sync_filetagmatches = mdfiletagmatches
+	}
+};
+
+/*
+ * Initialize data structures for the file sync tracking.
+ */
+void
+InitSync(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = offsetof(PendingFsyncEntry, cycle_ctr);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOps = hash_create("Pending Ops Table",
+								 100L,
+								 &hash_ctl,
+								 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+
+}
+
+/*
+ * SyncPreCheckpoint() -- Do pre-checkpoint work
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+SyncPreCheckpoint(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	checkpoint_cycle_ctr++;
+}
+
+/*
+ * SyncPostCheckpoint() -- Do post-checkpoint work
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+SyncPostCheckpoint(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char		path[MAXPGPATH];
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == checkpoint_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		if (syncsw[entry->key.handler].sync_unlinkfiletag(&entry->key.ftag,
+														  path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in ProcessFsyncRequests, we don't want to stop absorbing fsync
+		 * requests for along time when there are many deletions to be done.
+		 * We can safelycall AbsorbFsyncRequests() at this point in the loop
+		 * (note it might try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+/*
+
+ *	ProcessSyncRequests() -- Process queued fsync requests.
+ */
+void
+ProcessSyncRequests(void)
+{
+	static bool sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	int			processed = 0;
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	uint64		longest = 0;
+	uint64		total_elapsed = 0;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOps.
+	 */
+	if (!pendingOps)
+		elog(ERROR, "cannot sync without a pendingOps table");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbSyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use sync_cycle_ctr to tell old entries apart from new
+	 * ones: new ones will have cycle_ctr equal to the incremented value of
+	 * sync_cycle_ctr.
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous ProcessFsyncRequests() failed to complete, run through the
+	 * table and forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			entry->cycle_ctr = sync_cycle_ctr;
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOps);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		int			failures;
+
+		/*
+		 * If fsync is off then we don't have to bother opening the file at
+		 * all.  (We delay checking until this point so that changing fsync on
+		 * the fly behaves sensibly.)
+		 */
+		if (!enableFsync)
+			continue;
+
+		/*
+		 * If the entry is new then don't process it this time; it is new.
+		 * Note "continue" bypasses the hash-remove call at the bottom of the
+		 * loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * If in checkpointer, we want to absorb pending requests every so
+		 * often to prevent overflow of the fsync request queue.  It is
+		 * unspecified whether newly-added entries will be visited by
+		 * hash_seq_search, but we don't care since we don't need to process
+		 * them anyway.
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+
+		/*
+		 * The fsync table could contain requests to fsync segments that have
+		 * been deleted (unlinked) by the time we get to them. Rather than
+		 * just hoping an ENOENT (or EACCES on Windows) error can be ignored,
+		 * what we do on error is absorb pending requests and then retry.
+		 * Since mdunlink() queues a "cancel" message before actually
+		 * unlinking, the fsync request is guaranteed to be marked canceled
+		 * after the absorb if it really was this case. DROP DATABASE likewise
+		 * has to tell us to forget fsync requests before it starts deletions.
+		 */
+		for (failures = 0; !entry->canceled; failures++)
+		{
+			char		path[MAXPGPATH];
+
+			INSTR_TIME_SET_CURRENT(sync_start);
+			if (syncsw[entry->key.handler].sync_syncfiletag(&entry->key.ftag,
+															path) == 0)
+			{
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				if (log_checkpoints)
+					elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
+						 processed,
+						 path,
+						 (double) elapsed / 1000);
+
+				break;			/* out of retry loop */
+			}
+
+			/*
+			 * It is possible that the relation has been dropped or truncated
+			 * since the fsync request was entered. Therefore, allow ENOENT,
+			 * but only if we didn't fail already on this file.
+			 */
+			if (!FILE_POSSIBLY_DELETED(errno) || failures > 0)
+				ereport(data_sync_elevel(ERROR),
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								path)));
+			else
+				ereport(DEBUG1,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\" but retrying: %m",
+								path)));
+
+			/*
+			 * Absorb incoming requests and check to see if a cancel arrived
+			 * for this relation fork.
+			 */
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
+		}						/* end retry loop */
+
+		/* We are done with this entry, remove it */
+		if (hash_search(pendingOps, &entry->key, HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "pendingOps corrupted");
+	}							/* end loop over hashtable entries */
+
+	/* Return sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of ProcessSyncRequests */
+	sync_in_progress = false;
+}
+
+/*
+ * RememberSyncRequest() -- callback from checkpointer side of sync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * See sync.h for more information on the types of sync requests supported.
+ */
+void
+RememberSyncRequest(const FileTag *ftag, SyncRequestType type, SyncRequestHandler handler)
+{
+	Assert(pendingOps);
+
+	if (type == SYNC_FORGET_REQUEST)
+	{
+		PendingOpKey key = {0};
+		PendingFsyncEntry *entry;
+
+		/* Cancel previously entered request */
+		key.handler = handler;
+		key.ftag = *ftag;
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+												  (void *) &key,
+												  HASH_FIND,
+												  NULL);
+		if (entry != NULL)
+			entry->canceled = true;
+	}
+	else if (type == SYNC_FILTER_REQUEST)
+	{
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+		ListCell   *cell,
+				   *prev,
+				   *next;
+
+		/* Cancel matching fsync requests */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (entry->key.handler == handler &&
+				syncsw[entry->key.handler].sync_filetagmatches(ftag,
+															   &entry->key.ftag))
+				entry->canceled = true;
+		}
+
+		/* Remove matching unlink requests */
+		prev = NULL;
+		for (cell = list_head(pendingUnlinks); cell; cell = next)
+		{
+			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+			next = lnext(cell);
+			if (entry->key.handler == handler &&
+				syncsw[entry->key.handler].sync_filetagmatches(ftag,
+															   &entry->key.ftag))
+			{
+				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
+				pfree(entry);
+			}
+			else
+				prev = cell;
+		}
+	}
+	else if (type == SYNC_UNLINK_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->key.handler = handler;
+		entry->key.ftag = *ftag;
+		entry->cycle_ctr = checkpoint_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingOpKey key = {0};
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		Assert(type == SYNC_REQUEST);
+
+		key.handler = handler;
+		key.ftag = *ftag;
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+												  &key,
+												  HASH_ENTER,
+												  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			entry->cycle_ctr = sync_cycle_ctr;
+			entry->canceled = false;
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+}
+
+/*
+ * Register the sync request locally, or forward it to the checkpointer.
+ *
+ * If retryOnError is true, we'll keep trying if there is no space in the
+ * queue.  Return true if we succeeded, or false if there wasn't space.
+ */
+bool
+RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
+					SyncRequestHandler handler, bool retryOnError)
+{
+	bool		ret;
+
+	if (pendingOps != NULL)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberSyncRequest(ftag, type, handler);
+		return true;
+	}
+
+	for (;;)
+	{
+		/*
+		 * Notify the checkpointer about it.  If we fail to queue the cancel
+		 * message, we have to sleep and try again ... ugly, but hopefully
+		 * won't happen often.
+		 *
+		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
+		 * error would leave the no-longer-used file still present on disk,
+		 * which would be bad, so I'm inclined to assume that the checkpointer
+		 * will always empty the queue soon.
+		 */
+		ret = ForwardSyncRequest(ftag, type, handler);
+
+		/*
+		 * If we are successful in queueing the request, or we failed and were
+		 * instructed not to retry on error, break.
+		 */
+		if (ret || (!ret && !retryOnError))
+			break;
+
+		pg_usleep(10000L);
+	}
+
+	return ret;
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingOps during initialization of the startup
+ * process.  Calling this function drops the local pendingOps so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+EnableSyncRequestForwarding(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingOps)
+	{
+		ProcessSyncRequests();
+		hash_destroy(pendingOps);
+	}
+	pendingOps = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 752010ed276..1c2a99c9c8c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/sinvaladt.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
@@ -555,6 +556,7 @@ BaseInit(void)
 
 	/* Do local initialization of file, storage and buffer managers */
 	InitFileAccess();
+	InitSync();
 	smgrinit();
 	InitBufferPoolAccess();
 }
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 53b8f5fe3cb..40b05d46617 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -17,6 +17,8 @@
 
 #include "storage/block.h"
 #include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
 
 
 /* GUC options */
@@ -31,9 +33,9 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type,
+							   SyncRequestHandler handler);
+extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 74c34757fb5..40f46b871d7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -54,6 +54,18 @@ extern PGDLLIMPORT bool data_sync_retry;
  */
 extern int	max_safe_fds;
 
+/*
+ * On Windows, we have to interpret EACCES as possibly meaning the same as
+ * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
+ * that's what you get.  Ugh.  This code is designed so that we don't
+ * actually believe these cases are okay without further evidence (namely,
+ * a pending fsync request getting canceled ... see mdsync).
+ */
+#ifndef WIN32
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
+#else
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
+#endif
 
 /*
  * prototypes for functions in fd.c
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
new file mode 100644
index 00000000000..a6758a10dcb
--- /dev/null
+++ b/src/include/storage/md.h
@@ -0,0 +1,51 @@
+/*-------------------------------------------------------------------------
+ *
+ * md.h
+ *	  magnetic disk storage manager public interface declarations.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/md.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MD_H
+#define MD_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
+
+/* md storage manager functionality */
+extern void mdinit(void);
+extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
+extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
+extern void mdextend(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber blocknum);
+extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
+extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber nblocks);
+extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+
+extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
+
+/* md sync callbacks */
+extern int mdsyncfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
+
+#endif							/* MD_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 8e982738789..770193e285e 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,7 +18,6 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
-
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -106,43 +105,6 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
-
-/* internals: move me elsewhere -- ay 7/94 */
-
-/* in md.c */
-extern void mdinit(void);
-extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
-extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
-extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
-extern void mdextend(SMgrRelation reln, ForkNumber forknum,
-		 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber blocknum);
-extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
-			BlockNumber blocknum, BlockNumber nblocks);
-extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
-extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
-extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
-
 #endif							/* SMGR_H */
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
new file mode 100644
index 00000000000..6fb066a53b7
--- /dev/null
+++ b/src/include/storage/sync.h
@@ -0,0 +1,64 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.h
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/sync.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SYNC_H
+#define SYNC_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+
+/*
+ * Type of sync request.  These are used to manage the set of pending
+ * requests to call the handler's sync or unlink functions at the next
+ * checkpoint.
+ */
+typedef enum SyncRequestType
+{
+	SYNC_REQUEST,				/* schedule a call of sync function */
+	SYNC_UNLINK_REQUEST,		/* schedule a call of unlink function */
+	SYNC_FORGET_REQUEST,		/* forget all calls for a tag */
+	SYNC_FILTER_REQUEST			/* forget all calls satisfying match fn */
+} SyncRequestType;
+
+/*
+ * Which set of functions to use to handle a given request.  See the function
+ * table in sync.c.
+ */
+typedef enum SyncRequestHandler
+{
+	SYNC_HANDLER_MD = 0			/* md smgr */
+} SyncRequestHandler;
+
+/*
+ * A tag identifying a file.  Currently it has the members required for md.c's
+ * usage, but sync.c has no knowledge of the internal structure, and it is
+ * liable to change as required by future handlers.
+ */
+typedef struct FileTag
+{
+	RelFileNode rnode;
+	ForkNumber	forknum;
+	BlockNumber segno;
+} FileTag;
+
+/* sync forward declarations */
+extern void InitSync(void);
+extern void SyncPreCheckpoint(void);
+extern void SyncPostCheckpoint(void);
+extern void ProcessSyncRequests(void);
+extern void RememberSyncRequest(const FileTag *ftag, SyncRequestType type,
+					SyncRequestHandler handler);
+extern void EnableSyncRequestForwarding(void);
+extern bool RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
+					SyncRequestHandler handler, bool retryOnError);
+
+#endif							/* SYNC_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f31929664ac..6e0459a98d9 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -651,6 +651,7 @@ File
 FileFdwExecutionState
 FileFdwPlanState
 FileNameMap
+FileTag
 FindSplitData
 FixedParallelExecutorState
 FixedParallelState
@@ -1700,7 +1701,8 @@ PathKeysComparison
 PathTarget
 Pattern_Prefix_Status
 Pattern_Type
-PendingOperationEntry
+PendingFsyncEntry
+PendingOpKey
 PendingRelDelete
 PendingUnlinkEntry
 PendingWriteback
@@ -2276,7 +2278,10 @@ Subscription
 SubscriptionInfo
 SubscriptionRelState
 Syn
+SyncHandlerType
+SyncOps
 SyncRepConfigData
+SyncRequestType
 SysScanDesc
 SyscacheCallbackFunction
 SystemRowsSamplerData
-- 
2.21.0

0002-Use-the-fsync-queue-for-SLRU-files-v15.patchapplication/octet-stream; name=0002-Use-the-fsync-queue-for-SLRU-files-v15.patchDownload

From 7b7087a34d69fe10718dd24310c8626c39a68d92 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 3 Apr 2019 22:15:19 +1300
Subject: [PATCH 2/2] Use the fsync queue for SLRU files.

Previously, we called fsync() after writing out SLRU page.  Use the
same mechanism for deferring and handing off fsync work to the
checkpointer that md.c uses.

This is a proof-of-concept only for now.
---
 src/backend/access/transam/clog.c      |  13 +++-
 src/backend/access/transam/commit_ts.c |  12 ++-
 src/backend/access/transam/multixact.c |  24 +++++-
 src/backend/access/transam/slru.c      | 104 +++++++++++++++++++------
 src/backend/access/transam/subtrans.c  |   4 +-
 src/backend/commands/async.c           |   5 +-
 src/backend/storage/lmgr/predicate.c   |   4 +-
 src/backend/storage/sync/sync.c        |  22 +++++-
 src/include/access/clog.h              |   3 +
 src/include/access/commit_ts.h         |   3 +
 src/include/access/multixact.h         |   4 +
 src/include/access/slru.h              |  12 ++-
 src/include/storage/sync.h             |   7 +-
 13 files changed, 173 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 3bd55fbdd33..a3d3f9a304e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -699,7 +700,8 @@ CLOGShmemInit(void)
 {
 	ClogCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(ClogCtl, "clog", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-				  CLogControlLock, "pg_xact", LWTRANCHE_CLOG_BUFFERS);
+				  CLogControlLock, "pg_xact", LWTRANCHE_CLOG_BUFFERS,
+				  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -1041,3 +1043,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(ClogCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 8162f884bd1..e35480d89ec 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -494,7 +494,8 @@ CommitTsShmemInit(void)
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, "commit_timestamp", CommitTsShmemBuffers(), 0,
 				  CommitTsControlLock, "pg_commit_ts",
-				  LWTRANCHE_COMMITTS_BUFFERS);
+				  LWTRANCHE_COMMITTS_BUFFERS,
+				  SYNC_HANDLER_COMMIT_TS);
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
 									 sizeof(CommitTimestampShared),
@@ -1022,3 +1023,12 @@ commit_ts_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 763b9997071..bf2e9886032 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1829,11 +1829,13 @@ MultiXactShmemInit(void)
 	SimpleLruInit(MultiXactOffsetCtl,
 				  "multixact_offset", NUM_MXACTOFFSET_BUFFERS, 0,
 				  MultiXactOffsetControlLock, "pg_multixact/offsets",
-				  LWTRANCHE_MXACTOFFSET_BUFFERS);
+				  LWTRANCHE_MXACTOFFSET_BUFFERS,
+				  SYNC_HANDLER_MULTIXACT_OFFSET);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", NUM_MXACTMEMBER_BUFFERS, 0,
 				  MultiXactMemberControlLock, "pg_multixact/members",
-				  LWTRANCHE_MXACTMEMBER_BUFFERS);
+				  LWTRANCHE_MXACTMEMBER_BUFFERS,
+				  SYNC_HANDLER_MULTIXACT_MEMBER);
 
 	/* Initialize our shared state struct */
 	MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -3392,3 +3394,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
 
 	SRF_RETURN_DONE(funccxt);
 }
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 974d42fc866..467e265ff8c 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -81,6 +81,20 @@ typedef struct SlruFlushData
 
 typedef struct SlruFlushData *SlruFlush;
 
+/*
+ * Populate a file tag describing a segment file.  We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_segno) \
+( \
+	(a).rnode.spcNode = 0, \
+	(a).rnode.dbNode = 0, \
+	(a).rnode.relNode = 0, \
+	(a).forknum = 0, \
+	(a).segno = (xx_segno) \
+)
+
 /*
  * Macro to mark a buffer slot "most recently used".  Note multiple evaluation
  * of arguments!
@@ -163,7 +177,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
 
 void
 SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
-			  LWLock *ctllock, const char *subdir, int tranche_id)
+			  LWLock *ctllock, const char *subdir, int tranche_id,
+			  SyncRequestHandler sync_handler)
 {
 	SlruShared	shared;
 	bool		found;
@@ -247,7 +262,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
 	 * assume caller set PagePrecedes.
 	 */
 	ctl->shared = shared;
-	ctl->do_fsync = true;		/* default behavior */
+	ctl->sync_handler = sync_handler;
 	StrNCpy(ctl->Dir, subdir, sizeof(ctl->Dir));
 }
 
@@ -862,23 +877,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 	}
 	pgstat_report_wait_end();
 
-	/*
-	 * If not part of Flush, need to fsync now.  We assume this happens
-	 * infrequently enough that it's not a performance issue.
-	 */
-	if (!fdata)
+	/* Queue up a sync request for the checkpointer. */
+	if (ctl->sync_handler != SYNC_HANDLER_NONE)
 	{
-		pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
-		if (ctl->do_fsync && pg_fsync(fd))
+		FileTag		tag;
+
+		INIT_SLRUFILETAG(tag, segno);
+		if (!RegisterSyncRequest(&tag, SYNC_REQUEST, ctl->sync_handler, false))
 		{
+			/* No space to enqueue sync request.  Do it synchronously. */
+			pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+			if (pg_fsync(fd) < 0)
+			{
+				pgstat_report_wait_end();
+				slru_errcause = SLRU_FSYNC_FAILED;
+				slru_errno = errno;
+				CloseTransientFile(fd);
+				return false;
+			}
 			pgstat_report_wait_end();
-			slru_errcause = SLRU_FSYNC_FAILED;
-			slru_errno = errno;
-			CloseTransientFile(fd);
-			return false;
 		}
-		pgstat_report_wait_end();
+	}
 
+	/* Close file, unless part of flush request. */
+	if (!fdata)
+	{
 		if (CloseTransientFile(fd))
 		{
 			slru_errcause = SLRU_CLOSE_FAILED;
@@ -1140,21 +1163,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 	LWLockRelease(shared->ControlLock);
 
 	/*
-	 * Now fsync and close any files that were open
+	 * Now close any files that were open
 	 */
 	ok = true;
 	for (i = 0; i < fdata.num_files; i++)
 	{
-		pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
-		if (ctl->do_fsync && pg_fsync(fdata.fd[i]))
-		{
-			slru_errcause = SLRU_FSYNC_FAILED;
-			slru_errno = errno;
-			pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
-			ok = false;
-		}
-		pgstat_report_wait_end();
-
 		if (CloseTransientFile(fdata.fd[i]))
 		{
 			slru_errcause = SLRU_CLOSE_FAILED;
@@ -1270,6 +1283,7 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
 	int			slotno;
 	char		path[MAXPGPATH];
 	bool		did_write;
+	FileTag		tag;
 
 	/* Clean out any possibly existing references to the segment. */
 	LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
@@ -1313,6 +1327,18 @@ restart:
 	snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
 	ereport(DEBUG2,
 			(errmsg("removing file \"%s\"", path)));
+
+	/*
+	 * Tell the checkpointer to forget any sync requests, before we unlink the
+	 * file.
+	 */
+	if (ctl->sync_handler != SYNC_HANDLER_NONE)
+	{
+		INIT_SLRUFILETAG(tag, segno);
+		RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, ctl->sync_handler,
+							true);
+	}
+
 	unlink(path);
 
 	LWLockRelease(shared->ControlLock);
@@ -1411,3 +1437,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
 
 	return retval;
 }
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+	int			fd;
+	int			save_errno;
+	int			result;
+
+	SlruFileName(ctl, path, ftag->segno);
+
+	fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		return -1;
+
+	result = pg_fsync(fd);
+	save_errno = errno;
+
+	CloseTransientFile(fd);
+
+	errno = save_errno;
+	return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index e667fd02385..aa71a6ddbc6 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
 	SubTransCtl->PagePrecedes = SubTransPagePrecedes;
 	SimpleLruInit(SubTransCtl, "subtrans", NUM_SUBTRANS_BUFFERS, 0,
 				  SubtransControlLock, "pg_subtrans",
-				  LWTRANCHE_SUBTRANS_BUFFERS);
-	/* Override default assumption that writes should be fsync'd */
-	SubTransCtl->do_fsync = false;
+				  LWTRANCHE_SUBTRANS_BUFFERS, SYNC_HANDLER_NONE);
 }
 
 /*
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 5a7ee0de4cf..9c68358da6a 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -479,9 +479,8 @@ AsyncShmemInit(void)
 	 */
 	AsyncCtl->PagePrecedes = asyncQueuePagePrecedes;
 	SimpleLruInit(AsyncCtl, "async", NUM_ASYNC_BUFFERS, 0,
-				  AsyncCtlLock, "pg_notify", LWTRANCHE_ASYNC_BUFFERS);
-	/* Override default assumption that writes should be fsync'd */
-	AsyncCtl->do_fsync = false;
+				  AsyncCtlLock, "pg_notify", LWTRANCHE_ASYNC_BUFFERS,
+				  SYNC_HANDLER_NONE);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 4e4d04bae37..ab41194930f 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -824,9 +824,7 @@ OldSerXidInit(void)
 	OldSerXidSlruCtl->PagePrecedes = OldSerXidPagePrecedesLogically;
 	SimpleLruInit(OldSerXidSlruCtl, "oldserxid",
 				  NUM_OLDSERXID_BUFFERS, 0, OldSerXidLock, "pg_serial",
-				  LWTRANCHE_OLDSERXID_BUFFERS);
-	/* Override default assumption that writes should be fsync'd */
-	OldSerXidSlruCtl->do_fsync = false;
+				  LWTRANCHE_OLDSERXID_BUFFERS, SYNC_HANDLER_NONE);
 
 	/*
 	 * Create or attach to the OldSerXidControl structure.
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 7bcfa6b6c90..80838ef3dfa 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -20,6 +20,9 @@
 
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
 #include "commands/tablespace.h"
@@ -96,13 +99,30 @@ typedef struct SyncOps
 										const FileTag *candidate);
 } SyncOps;
 
+/* These indexes must correspond to the values of the SyncRequestType enum. */
 static const SyncOps syncsw[] = {
 	/* magnetic disk */
 	{
 		.sync_syncfiletag = mdsyncfiletag,
 		.sync_unlinkfiletag = mdunlinkfiletag,
 		.sync_filetagmatches = mdfiletagmatches
-	}
+	},
+	/* pg_xact */
+	{
+		.sync_syncfiletag = clogsyncfiletag
+	},
+	/* pg_commit_ts */
+	{
+		.sync_syncfiletag = committssyncfiletag
+	},
+	/* pg_multixact/offsets */
+	{
+		.sync_syncfiletag = multixactoffsetssyncfiletag
+	},
+	/* pg_multixact/members */
+	{
+		.sync_syncfiletag = multixactmemberssyncfiletag
+	},
 };
 
 /*
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 57ef9fe858e..f55391c73e3 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
 #define CLOG_H
 
 #include "access/xlogreader.h"
+#include "storage/sync.h"
 #include "lib/stringinfo.h"
 
 /*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
 extern void ExtendCLOG(TransactionId newestXact);
 extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
 
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
 /* XLOG stuff */
 #define CLOG_ZEROPAGE		0x00
 #define CLOG_TRUNCATE		0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 123c91128b8..1f32196873d 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
 #include "access/xlog.h"
 #include "datatype/timestamp.h"
 #include "replication/origin.h"
+#include "storage/sync.h"
 #include "utils/guc.h"
 
 
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
 				 TransactionId newestXact);
 extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
 
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
 /* XLOG stuff */
 #define COMMIT_TS_ZEROPAGE		0x00
 #define COMMIT_TS_TRUNCATE		0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 83ae5b6b795..05dcbc8ae35 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "storage/sync.h"
 
 
 /*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
 extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
 							MultiXactId multi2);
 
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
 extern void AtEOXact_MultiXact(void);
 extern void AtPrepare_MultiXact(void);
 extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index b6e66f56a0a..deccde4cc44 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "storage/lwlock.h"
+#include "storage/sync.h"
 
 
 /*
@@ -115,10 +116,10 @@ typedef struct SlruCtlData
 	SlruShared	shared;
 
 	/*
-	 * This flag tells whether to fsync writes (true for pg_xact and multixact
-	 * stuff, false for pg_subtrans and pg_notify).
+	 * Which sync handler function to use when handing sync requests over to
+	 * the checkpointer.  SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
 	 */
-	bool		do_fsync;
+	SyncRequestHandler sync_handler;
 
 	/*
 	 * Decide which of two page numbers is "older" for truncation purposes. We
@@ -139,7 +140,8 @@ typedef SlruCtlData *SlruCtl;
 
 extern Size SimpleLruShmemSize(int nslots, int nlsns);
 extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
-			  LWLock *ctllock, const char *subdir, int tranche_id);
+			  LWLock *ctllock, const char *subdir, int tranche_id,
+			  SyncRequestHandler sync_handler);
 extern int	SimpleLruZeroPage(SlruCtl ctl, int pageno);
 extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
 				  TransactionId xid);
@@ -155,6 +157,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
 extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
 extern void SlruDeleteSegment(SlruCtl ctl, int segno);
 
+extern int slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path);
+
 /* SlruScanDirectory public callbacks */
 extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
 							int segpage, void *data);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index 6fb066a53b7..53d39926339 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -35,7 +35,12 @@ typedef enum SyncRequestType
  */
 typedef enum SyncRequestHandler
 {
-	SYNC_HANDLER_MD = 0			/* md smgr */
+	SYNC_HANDLER_MD = 0,
+	SYNC_HANDLER_CLOG,
+	SYNC_HANDLER_COMMIT_TS,
+	SYNC_HANDLER_MULTIXACT_OFFSET,
+	SYNC_HANDLER_MULTIXACT_MEMBER,
+	SYNC_HANDLER_NONE
 } SyncRequestHandler;
 
 /*
-- 
2.21.0

#80

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Thomas Munro (#79)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Apr 04, 2019 at 01:08:31AM +1300, Thomas Munro wrote:

On Tue, Apr 2, 2019 at 11:09 PM Thomas Munro <thomas.munro@gmail.com> wrote:

I'm going to do some more testing and tidying tomorrow (for example I
think the segment.h header is silly and I'd like to remove that), and
commit this.

Given the dislike in the thread for introducing the concept of segments
at any layer higher than the storage manager itself, I thought it would
be better to leave the existing header files alone and introduce a new
one to separate the concept. I am fine either way we go.

As a sanity check on the programming interface this thing gives you, I
tried teaching the SLRUs to use the fsync queue. I finished up making
a few small improvements, but the main thing I learned is that
"handler" needs to be part of the hash table key. I suppose the
discriminator could even be inside FileTag itself, but I chose to keep
it separate and introduce a new struct to hold hander enum + FileTag
in the hash table.

I think this is fine, but can you elaborate a bit more on why we need to
include the handler in the key for the hash table? We are de-duping
relfilenodes here and these should never collide with files from another
component. The component separation would be encoded in the RelFileNode
that the target smgr, or SLRU in this case, would be able to decipher.
Do you foresee, or know of, use cases where FileTag alone will result in
conflicts on the same file but from different handlers?

+/*
+ * Populate a file tag describing a segment file.  We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_segno) \
+( \
+	(a).rnode.spcNode = 0, \
+	(a).rnode.dbNode = 0, \
+	(a).rnode.relNode = 0, \
+	(a).forknum = 0, \
+	(a).segno = (xx_segno) \
+)

Based on the definition of INIT_SLRUFILETAG in your patch, it seems you
are trying to only use the segno to identify the file. Not sure why we
can't use other unused fields in FileTag to identify the component? You
could for example in the current SLRU implementation have the handler
set to SYNC_HANDLER_SLRU when invoking RegisterSyncRequest() and use
relNode to distinguish between each SLRU component in a wrapper function
and call slrusyncfiletag() with the right SlruCtl.

For the SLRU to buffer cache work, I was planning on using the relNode
field to identify which specific component this tag belongs to. dbNode
would be pointing to the type of smgr (as discussed earlier in the
thread and still TBD).

I would prefer not to expand the hash key unnecessarily and given this
isn't persisted, we can expand the key in the future if needed. Keeps
the code simpler for now.

The 0001 patch is what I'm going to commit soon. I don't know why
Shawn measured a performance change -- a bug in an earlier version? --
I can't see it, but I'm planning to look into that a bit more first.
I've attached the 0002 SLRU patch for interest, but I'm not planning
to commit that one.

Question is which block of code did you measure? I can redo the
instrumentation on the latest patch and re-validate and share the
results. I previously measured the average time it took mdsync() and
ProcessSyncRequests() in the patch to complete under similar workloads.

I found a few more things that I thought needed adjustment:

* Packing handler and request type into a uint8 is cute but a waste of
time if we're just going to put it in a struct next to a member that
requires word-alignment. So I changed it to a pair of plain old int16
members. The ftag member starts at offset 4 either way, on my system.

Good catch! For posterity, using packed attribute here would be bad well
as it would result RelFileNode's spcNode in FileTag in subsequent
entries to be misaligned given our usage of the requests as an array.
This is potentially unsafe on platforms other than x86. Re-arranging the
fields would lead to the same result. Thanks for catching and fixing
this!

* I didn't really like the use of the word HIERARCHY in the name of
the request type, and changed it to SYNC_FILTER_REQUEST. That word
came up because we were implementing a kind of hierarchy, where if you
drop a database you want to forget things for all segments inside all
relations inside that database, but the whole point of this new API is
that it doesn't understand that, it calls a filter function to decide
which requests to keep. So I preferred "filter" as a name for the
type of request.

Yeah, I never liked the word hierarchy too much, but couldn't think of a
better one either. Filter is perfect.

--
Shawn Debnath
Amazon Web Services (AWS)

#81

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#80)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Apr 4, 2019 at 10:44 AM Shawn Debnath <sdn@amazon.com> wrote:

On Thu, Apr 04, 2019 at 01:08:31AM +1300, Thomas Munro wrote:

As a sanity check on the programming interface this thing gives you, I
tried teaching the SLRUs to use the fsync queue. I finished up making
a few small improvements, but the main thing I learned is that
"handler" needs to be part of the hash table key. I suppose the
discriminator could even be inside FileTag itself, but I chose to keep
it separate and introduce a new struct to hold hander enum + FileTag
in the hash table.

I think this is fine, but can you elaborate a bit more on why we need to
include the handler in the key for the hash table? We are de-duping
relfilenodes here and these should never collide with files from another
component. The component separation would be encoded in the RelFileNode
that the target smgr, or SLRU in this case, would be able to decipher.
Do you foresee, or know of, use cases where FileTag alone will result in
conflicts on the same file but from different handlers?

Well it depends how we think of the FileTag namespace. I didn't worry
before because md.c and my development version of undofile.c generate
FileTag objects that never collide, because undofile.c sets the dbNode
to invalid (and in earlier versions had a special magic DB OID), while
md.c always uses valid values. But when I did the SLRU experiment
with that 0002 patch, I figured it was reasonable to want to set only
the segment number part of the FileTag. Then sync requests for
pg_xact, pg_multixact/offsets and pg_multixact/members got merged.
Oops.

As you say, we could decide to assign bogus reserved dbNode values to
keep them the keyspaces apart and not allow future modules to deviate
from using a RelFileNode in the tag, but we already more-or-less
decided in another thread that we don't want to do that for buffer
tags. People seemed keen to see another discriminator for smgr ID,
rather than requiring eg md.c and undo.c to use non-colliding
RelFileNode values via a fake dbNode scheme. This problem is
essentially the same, except that we decided we want the set of sync
handlers to be potentially larger than the set of smgr handlers. For
example, until we have a patch that moves SLRUs into shared buffers,
SLRUs don't have SMGR IDs, and get as the 0002 patch shows, they could
theoretically still benefit from handing off fsync work; there may be
other cases like that. Generally, the question is similar in both
places: (1) do sync handlers each get their own separate namespace of
FileTags?, (2) do smgrs get their own namespace of BufferTag? Perhaps
that is an argument for putting the sync handler number *inside* the
FileTag, since we currently intend to do that with smgr IDs in
BufferTag (stealing space from ForkNumber).

+/*
+ * Populate a file tag describing a segment file.  We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_segno) \
+( \
+       (a).rnode.spcNode = 0, \
+       (a).rnode.dbNode = 0, \
+       (a).rnode.relNode = 0, \
+       (a).forknum = 0, \
+       (a).segno = (xx_segno) \
+)
Based on the definition of INIT_SLRUFILETAG in your patch, it seems you
are trying to only use the segno to identify the file. Not sure why we
can't use other unused fields in FileTag to identify the component? You
could for example in the current SLRU implementation have the handler
set to SYNC_HANDLER_SLRU when invoking RegisterSyncRequest() and use
relNode to distinguish between each SLRU component in a wrapper function
and call slrusyncfiletag() with the right SlruCtl.

Yes, that would work too, though then slru.c would have to know about
clog, multixact etc so it could find their SlruCtlData objects, which
are currently static in the eg clog.c, multixact.c etc. That's why I
put a tiny callback into each of those that could call slru.c to do
the work. I'm not really proposing anything here, and I understand
that you might want to refactor this stuff completely, I just wanted a
quick sanity check of *something* else using this interface, other
than md.c and my undo patch, in a particular a
thing-that-isn't-connected-to-smgr. If it turns out to be useful for
later work, good, if not, I won't be sad. I also suppressed the urge
to teach it to use fd.c files and keep them open in a small cache --
all topics for future threads.

For the SLRU to buffer cache work, I was planning on using the relNode
field to identify which specific component this tag belongs to. dbNode
would be pointing to the type of smgr (as discussed earlier in the
thread and still TBD).

I look forward to the patches :-) But as mentioned, there was some
serious push-back on hijacking dbNode like that.

The 0001 patch is what I'm going to commit soon. I don't know why
Shawn measured a performance change -- a bug in an earlier version? --
I can't see it, but I'm planning to look into that a bit more first.
I've attached the 0002 SLRU patch for interest, but I'm not planning
to commit that one.

Question is which block of code did you measure? I can redo the
instrumentation on the latest patch and re-validate and share the
results. I previously measured the average time it took mdsync() and
ProcessSyncRequests() in the patch to complete under similar workloads.

Ok. Let me try that again here too.

--
Thomas Munro
https://enterprisedb.com

#82

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Thomas Munro (#81)

2 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Apr 4, 2019 at 11:39 AM Thomas Munro <thomas.munro@gmail.com> wrote:

... Perhaps
that is an argument for putting the sync handler number *inside* the
FileTag, since we currently intend to do that with smgr IDs in
BufferTag (stealing space from ForkNumber).

Here is a version like that. I like it better this way, and the extra
space can be clawed back by using 16 bit types to hold the fork number
and sync handler number.

Here again is the straw-man 0002 patch updated to show how it might
look for another potential user.

--
Thomas Munro
https://enterprisedb.com

Attachments:

0001-Refactor-the-fsync-queue-for-wider-use-v16.patchapplication/octet-stream; name=0001-Refactor-the-fsync-queue-for-wider-use-v16.patchDownload

From 59d492312e1c680cb3b5d49251dd42ff9365e0f8 Mon Sep 17 00:00:00 2001
From: Shawn Debnath <sdn@amazon.com>
Date: Wed, 27 Feb 2019 18:58:58 +0000
Subject: [PATCH 1/2] Refactor the fsync queue for wider use.

Previously, md.c and checkpointer.c were tightly integrated so that
fsync calls could be handed off and processed in the background.
Introduce a system of callbacks and file tags, so that other modules
can hand off work in the same way.

For now only md.c uses the new interface, but other users are being
proposed.  Since there may be use cases that are not strictly SMGR
implementations, use a new function table rather than the traditional
SMGR one.

Instead of using a bitmapset of segment numbers for each RefFileNode
in the checkpointer's hash table, make the segment number part of the
key.  This requires sending explicit "forget" requests for every
segment individually when relations are dropped, but suits the file
layout schemes of proposed future users better (ie sparse, high
segment numbers).

Author: Shawn Debnath and Thomas Munro
Reviewed-by: Thomas Munro, Andres Freund
Discussion: https://postgr.es/m/CAEepm=2gTANm=e3ARnJT=n0h8hf88wqmaZxk0JYkxw+b21fNrw@mail.gmail.com
---
 src/backend/access/transam/twophase.c |   1 +
 src/backend/access/transam/xact.c     |   1 +
 src/backend/access/transam/xlog.c     |   7 +-
 src/backend/commands/dbcommands.c     |   7 +-
 src/backend/postmaster/checkpointer.c |  42 +-
 src/backend/storage/Makefile          |   2 +-
 src/backend/storage/buffer/bufmgr.c   |   2 +-
 src/backend/storage/smgr/md.c         | 913 ++++----------------------
 src/backend/storage/smgr/smgr.c       |  55 +-
 src/backend/storage/sync/Makefile     |  17 +
 src/backend/storage/sync/sync.c       | 598 +++++++++++++++++
 src/backend/utils/init/postinit.c     |   2 +
 src/include/postmaster/bgwriter.h     |   8 +-
 src/include/storage/fd.h              |  12 +
 src/include/storage/md.h              |  51 ++
 src/include/storage/smgr.h            |  38 --
 src/include/storage/sync.h            |  64 ++
 src/tools/pgindent/typedefs.list      |   6 +-
 18 files changed, 921 insertions(+), 905 deletions(-)
 create mode 100644 src/backend/storage/sync/Makefile
 create mode 100644 src/backend/storage/sync/sync.c
 create mode 100644 src/include/storage/md.h
 create mode 100644 src/include/storage/sync.h

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index 11992f7447d..ecc01f741d4 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -98,6 +98,7 @@
 #include "replication/walsender.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index b04fdb5d5ed..bd5024ef00a 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -50,6 +50,7 @@
 #include "storage/fd.h"
 #include "storage/freespace.h"
 #include "storage/lmgr.h"
+#include "storage/md.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
 #include "storage/procarray.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e3a3110716d..c00b63c751c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -66,6 +66,7 @@
 #include "storage/reinit.h"
 #include "storage/smgr.h"
 #include "storage/spin.h"
+#include "storage/sync.h"
 #include "utils/builtins.h"
 #include "utils/guc.h"
 #include "utils/memutils.h"
@@ -6981,7 +6982,7 @@ StartupXLOG(void)
 		if (ArchiveRecoveryRequested && IsUnderPostmaster)
 		{
 			PublishStartupProcessInformation();
-			SetForwardFsyncRequests();
+			EnableSyncRequestForwarding();
 			SendPostmasterSignal(PMSIGNAL_RECOVERY_STARTED);
 			bgwriterLaunched = true;
 		}
@@ -8566,7 +8567,7 @@ CreateCheckPoint(int flags)
 	 * the REDO pointer.  Note that smgr must not do anything that'd have to
 	 * be undone if we decide no checkpoint is needed.
 	 */
-	smgrpreckpt();
+	SyncPreCheckpoint();
 
 	/* Begin filling in the checkpoint WAL record */
 	MemSet(&checkPoint, 0, sizeof(checkPoint));
@@ -8856,7 +8857,7 @@ CreateCheckPoint(int flags)
 	/*
 	 * Let smgr do post-checkpoint cleanup (eg, deleting old files).
 	 */
-	smgrpostckpt();
+	SyncPostCheckpoint();
 
 	/*
 	 * Update the average distance between checkpoints if the prior checkpoint
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 35cad0b6294..9707afabd98 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -54,6 +54,7 @@
 #include "storage/fd.h"
 #include "storage/lmgr.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/procarray.h"
 #include "storage/smgr.h"
 #include "utils/acl.h"
@@ -941,11 +942,11 @@ dropdb(const char *dbname, bool missing_ok)
 	 * worse, it will delete files that belong to a newly created database
 	 * with the same OID.
 	 */
-	ForgetDatabaseFsyncRequests(db_id);
+	ForgetDatabaseSyncRequests(db_id);
 
 	/*
 	 * Force a checkpoint to make sure the checkpointer has received the
-	 * message sent by ForgetDatabaseFsyncRequests. On Windows, this also
+	 * message sent by ForgetDatabaseSyncRequests. On Windows, this also
 	 * ensures that background procs don't hold any open files, which would
 	 * cause rmdir() to fail.
 	 */
@@ -2150,7 +2151,7 @@ dbase_redo(XLogReaderState *record)
 		DropDatabaseBuffers(xlrec->db_id);
 
 		/* Also, clean out any fsync requests that might be pending in md.c */
-		ForgetDatabaseFsyncRequests(xlrec->db_id);
+		ForgetDatabaseSyncRequests(xlrec->db_id);
 
 		/* Clean out the xlog relcache too */
 		XLogDropDatabase(xlrec->db_id);
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index c2411081a5e..f12dd892d12 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -108,10 +108,8 @@
  */
 typedef struct
 {
-	RelFileNode rnode;
-	ForkNumber	forknum;
-	BlockNumber segno;			/* see md.c for special values */
-	/* might add a real request-type field later; not needed yet */
+	SyncRequestType type;		/* request type */
+	FileTag		ftag;			/* file identifier */
 } CheckpointerRequest;
 
 typedef struct
@@ -349,7 +347,7 @@ CheckpointerMain(void)
 		/*
 		 * Process any requests or signals received recently.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 
 		if (got_SIGHUP)
 		{
@@ -684,7 +682,7 @@ CheckpointWriteDelay(int flags, double progress)
 			UpdateSharedMemoryConfig();
 		}
 
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 
 		CheckArchiveTimeout();
@@ -709,7 +707,7 @@ CheckpointWriteDelay(int flags, double progress)
 		 * operations even when we don't sleep, to prevent overflow of the
 		 * fsync request queue.
 		 */
-		AbsorbFsyncRequests();
+		AbsorbSyncRequests();
 		absorb_counter = WRITES_PER_ABSORB;
 	}
 }
@@ -1084,7 +1082,7 @@ RequestCheckpoint(int flags)
 }
 
 /*
- * ForwardFsyncRequest
+ * ForwardSyncRequest
  *		Forward a file-fsync request from a backend to the checkpointer
  *
  * Whenever a backend is compelled to write directly to a relation
@@ -1093,15 +1091,6 @@ RequestCheckpoint(int flags)
  * is dirty and must be fsync'd before next checkpoint.  We also use this
  * opportunity to count such writes for statistical purposes.
  *
- * This functionality is only supported for regular (not backend-local)
- * relations, so the rnode argument is intentionally RelFileNode not
- * RelFileNodeBackend.
- *
- * segno specifies which segment (not block!) of the relation needs to be
- * fsync'd.  (Since the valid range is much less than BlockNumber, we can
- * use high values for special flags; that's all internal to md.c, which
- * see for details.)
- *
  * To avoid holding the lock for longer than necessary, we normally write
  * to the requests[] queue without checking for duplicates.  The checkpointer
  * will have to eliminate dups internally anyway.  However, if we discover
@@ -1113,7 +1102,7 @@ RequestCheckpoint(int flags)
  * let the backend know by returning false.
  */
 bool
-ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+ForwardSyncRequest(const FileTag *ftag, SyncRequestType type)
 {
 	CheckpointerRequest *request;
 	bool		too_full;
@@ -1122,7 +1111,7 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		return false;			/* probably shouldn't even get here */
 
 	if (AmCheckpointerProcess())
-		elog(ERROR, "ForwardFsyncRequest must not be called in checkpointer");
+		elog(ERROR, "ForwardSyncRequest must not be called in checkpointer");
 
 	LWLockAcquire(CheckpointerCommLock, LW_EXCLUSIVE);
 
@@ -1143,7 +1132,7 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 		 * Count the subset of writes where backends have to do their own
 		 * fsync
 		 */
-		if (!AmBackgroundWriterProcess())
+		if (!AmBackgroundWriterProcess() && type == SYNC_REQUEST)
 			CheckpointerShmem->num_backend_fsync++;
 		LWLockRelease(CheckpointerCommLock);
 		return false;
@@ -1151,9 +1140,8 @@ ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
 
 	/* OK, insert request */
 	request = &CheckpointerShmem->requests[CheckpointerShmem->num_requests++];
-	request->rnode = rnode;
-	request->forknum = forknum;
-	request->segno = segno;
+	request->ftag = *ftag;
+	request->type = type;
 
 	/* If queue is more than half full, nudge the checkpointer to empty it */
 	too_full = (CheckpointerShmem->num_requests >=
@@ -1284,8 +1272,8 @@ CompactCheckpointerRequestQueue(void)
 }
 
 /*
- * AbsorbFsyncRequests
- *		Retrieve queued fsync requests and pass them to local smgr.
+ * AbsorbSyncRequests
+ *		Retrieve queued sync requests and pass them to sync mechanism.
  *
  * This is exported because it must be called during CreateCheckPoint;
  * we have to be sure we have accepted all pending requests just before
@@ -1293,7 +1281,7 @@ CompactCheckpointerRequestQueue(void)
  * non-checkpointer processes, do nothing if not checkpointer.
  */
 void
-AbsorbFsyncRequests(void)
+AbsorbSyncRequests(void)
 {
 	CheckpointerRequest *requests = NULL;
 	CheckpointerRequest *request;
@@ -1335,7 +1323,7 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		RememberSyncRequest(&request->ftag, request->type);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/storage/Makefile b/src/backend/storage/Makefile
index bd2d272c6ea..8376cdfca20 100644
--- a/src/backend/storage/Makefile
+++ b/src/backend/storage/Makefile
@@ -8,6 +8,6 @@ subdir = src/backend/storage
 top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
-SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr
+SUBDIRS     = buffer file freespace ipc large_object lmgr page smgr sync
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f385fe..887023fc8a5 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2584,7 +2584,7 @@ CheckPointBuffers(int flags)
 	BufferSync(flags);
 	CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
-	smgrsync();
+	ProcessSyncRequests();
 	CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
 	TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
 }
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 6ed68185edb..6b2e5719a08 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -29,45 +29,17 @@
 #include "access/xlogutils.h"
 #include "access/xlog.h"
 #include "pgstat.h"
-#include "portability/instr_time.h"
 #include "postmaster/bgwriter.h"
 #include "storage/fd.h"
 #include "storage/bufmgr.h"
+#include "storage/md.h"
 #include "storage/relfilenode.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "utils/hsearch.h"
 #include "utils/memutils.h"
 #include "pg_trace.h"
 
-
-/* intervals for calling AbsorbFsyncRequests in mdsync and mdpostckpt */
-#define FSYNCS_PER_ABSORB		10
-#define UNLINKS_PER_ABSORB		10
-
-/*
- * Special values for the segno arg to RememberFsyncRequest.
- *
- * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
- * fsync request from the queue if an identical, subsequent request is found.
- * See comments there before making changes here.
- */
-#define FORGET_RELATION_FSYNC	(InvalidBlockNumber)
-#define FORGET_DATABASE_FSYNC	(InvalidBlockNumber-1)
-#define UNLINK_RELATION_REQUEST (InvalidBlockNumber-2)
-
-/*
- * On Windows, we have to interpret EACCES as possibly meaning the same as
- * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
- * that's what you get.  Ugh.  This code is designed so that we don't
- * actually believe these cases are okay without further evidence (namely,
- * a pending fsync request getting canceled ... see mdsync).
- */
-#ifndef WIN32
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
-#else
-#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
-#endif
-
 /*
  *	The magnetic disk storage manager keeps track of open file
  *	descriptors in its own descriptor pool.  This is done to make it
@@ -114,50 +86,36 @@ typedef struct _MdfdVec
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
 
+/* local routines */
+static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
+			 bool isRedo);
+static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
+static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
+					   MdfdVec *seg);
+static void register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						BlockNumber segno);
+static void register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+						BlockNumber segno);
+static void _fdvec_resize(SMgrRelation reln,
+			  ForkNumber forknum,
+			  int nseg);
+static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
+			  BlockNumber segno, int oflags);
+static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
+			 BlockNumber blkno, bool skipFsync, int behavior);
+static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
+		   MdfdVec *seg);
 
-/*
- * In some contexts (currently, standalone backends and the checkpointer)
- * we keep track of pending fsync operations: we need to remember all relation
- * segments that have been written since the last checkpoint, so that we can
- * fsync them down to disk before completing the next checkpoint.  This hash
- * table remembers the pending operations.  We use a hash table mostly as
- * a convenient way of merging duplicate requests.
- *
- * We use a similar mechanism to remember no-longer-needed files that can
- * be deleted after the next checkpoint, but we use a linked list instead of
- * a hash table, because we don't expect there to be any duplicate requests.
- *
- * These mechanisms are only used for non-temp relations; we never fsync
- * temp rels, nor do we need to postpone their deletion (see comments in
- * mdunlink).
- *
- * (Regular backends do not track pending operations locally, but forward
- * them to the checkpointer.)
- */
-typedef uint16 CycleCtr;		/* can be any convenient integer size */
-
-typedef struct
-{
-	RelFileNode rnode;			/* hash table key (must be first!) */
-	CycleCtr	cycle_ctr;		/* mdsync_cycle_ctr of oldest request */
-	/* requests[f] has bit n set if we need to fsync segment n of fork f */
-	Bitmapset  *requests[MAX_FORKNUM + 1];
-	/* canceled[f] is true if we canceled fsyncs for fork "recently" */
-	bool		canceled[MAX_FORKNUM + 1];
-} PendingOperationEntry;
-
-typedef struct
-{
-	RelFileNode rnode;			/* the dead relation to delete */
-	CycleCtr	cycle_ctr;		/* mdckpt_cycle_ctr when request was made */
-} PendingUnlinkEntry;
-
-static HTAB *pendingOpsTable = NULL;
-static List *pendingUnlinks = NIL;
-static MemoryContext pendingOpsCxt; /* context for the above  */
 
-static CycleCtr mdsync_cycle_ctr = 0;
-static CycleCtr mdckpt_cycle_ctr = 0;
+/* Populate a file tag describing a md.c segment file. */
+#define INIT_MDFILETAG(a,xx_rnode,xx_forknum,xx_segno) \
+( \
+	memset(&(a), 0, sizeof(FileTag)), \
+	(a).handler = SYNC_HANDLER_MD, \
+	(a).rnode = (xx_rnode), \
+	(a).forknum = (xx_forknum), \
+	(a).segno = (xx_segno) \
+)
 
 
 /*** behavior for mdopen & _mdfd_getseg ***/
@@ -179,26 +137,6 @@ static CycleCtr mdckpt_cycle_ctr = 0;
 #define EXTENSION_DONT_CHECK_SIZE	(1 << 4)
 
 
-/* local routines */
-static void mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum,
-			 bool isRedo);
-static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
-static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
-					   MdfdVec *seg);
-static void register_unlink(RelFileNodeBackend rnode);
-static void _fdvec_resize(SMgrRelation reln,
-			  ForkNumber forknum,
-			  int nseg);
-static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
-			  BlockNumber segno);
-static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
-			  BlockNumber segno, int oflags);
-static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
-			 BlockNumber blkno, bool skipFsync, int behavior);
-static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
-		   MdfdVec *seg);
-
-
 /*
  *	mdinit() -- Initialize private state for magnetic disk storage manager.
  */
@@ -208,64 +146,6 @@ mdinit(void)
 	MdCxt = AllocSetContextCreate(TopMemoryContext,
 								  "MdSmgr",
 								  ALLOCSET_DEFAULT_SIZES);
-
-	/*
-	 * Create pending-operations hashtable if we need it.  Currently, we need
-	 * it if we are standalone (not under a postmaster) or if we are a startup
-	 * or checkpointer auxiliary process.
-	 */
-	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
-	{
-		HASHCTL		hash_ctl;
-
-		/*
-		 * XXX: The checkpointer needs to add entries to the pending ops table
-		 * when absorbing fsync requests.  That is done within a critical
-		 * section, which isn't usually allowed, but we make an exception. It
-		 * means that there's a theoretical possibility that you run out of
-		 * memory while absorbing fsync requests, which leads to a PANIC.
-		 * Fortunately the hash table is small so that's unlikely to happen in
-		 * practice.
-		 */
-		pendingOpsCxt = AllocSetContextCreate(MdCxt,
-											  "Pending ops context",
-											  ALLOCSET_DEFAULT_SIZES);
-		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
-
-		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
-		hash_ctl.keysize = sizeof(RelFileNode);
-		hash_ctl.entrysize = sizeof(PendingOperationEntry);
-		hash_ctl.hcxt = pendingOpsCxt;
-		pendingOpsTable = hash_create("Pending Ops Table",
-									  100L,
-									  &hash_ctl,
-									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
-		pendingUnlinks = NIL;
-	}
-}
-
-/*
- * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
- * already created the pendingOpsTable during initialization of the startup
- * process.  Calling this function drops the local pendingOpsTable so that
- * subsequent requests will be forwarded to checkpointer.
- */
-void
-SetForwardFsyncRequests(void)
-{
-	/* Perform any pending fsyncs we may have queued up, then drop table */
-	if (pendingOpsTable)
-	{
-		mdsync();
-		hash_destroy(pendingOpsTable);
-	}
-	pendingOpsTable = NULL;
-
-	/*
-	 * We should not have any pending unlink requests, since mdunlink doesn't
-	 * queue unlink requests when isRedo.
-	 */
-	Assert(pendingUnlinks == NIL);
 }
 
 /*
@@ -380,16 +260,6 @@ mdcreate(SMgrRelation reln, ForkNumber forkNum, bool isRedo)
 void
 mdunlink(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 {
-	/*
-	 * We have to clean out any pending fsync requests for the doomed
-	 * relation, else the next mdsync() will fail.  There can't be any such
-	 * requests for a temp relation, though.  We can send just one request
-	 * even when deleting multiple forks, since the fsync queuing code accepts
-	 * the "InvalidForkNumber = all forks" convention.
-	 */
-	if (!RelFileNodeBackendIsTemp(rnode))
-		ForgetRelationFsyncRequests(rnode.node, forkNum);
-
 	/* Now do the per-fork work */
 	if (forkNum == InvalidForkNumber)
 	{
@@ -413,6 +283,11 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 	 */
 	if (isRedo || forkNum != MAIN_FORKNUM || RelFileNodeBackendIsTemp(rnode))
 	{
+		/* First, forget any pending sync requests for the first segment */
+		if (!RelFileNodeBackendIsTemp(rnode))
+			register_forget_request(rnode, forkNum, 0 /* first seg */ );
+
+		/* Next unlink the file */
 		ret = unlink(path);
 		if (ret < 0 && errno != ENOENT)
 			ereport(WARNING,
@@ -442,7 +317,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 					 errmsg("could not truncate file \"%s\": %m", path)));
 
 		/* Register request to unlink first segment later */
-		register_unlink(rnode);
+		register_unlink_segment(rnode, forkNum, 0 /* first seg */ );
 	}
 
 	/*
@@ -459,6 +334,13 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 		 */
 		for (segno = 1;; segno++)
 		{
+			/*
+			 * Forget any pending sync requests for the segment before we
+			 * unlink
+			 */
+			if (!RelFileNodeBackendIsTemp(rnode))
+				register_forget_request(rnode, forkNum, segno);
+
 			sprintf(segpath, "%s.%u", path, segno);
 			if (unlink(segpath) < 0)
 			{
@@ -1003,388 +885,6 @@ mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 	}
 }
 
-/*
- *	mdsync() -- Sync previous writes to stable storage.
- */
-void
-mdsync(void)
-{
-	static bool mdsync_in_progress = false;
-
-	HASH_SEQ_STATUS hstat;
-	PendingOperationEntry *entry;
-	int			absorb_counter;
-
-	/* Statistics on sync times */
-	int			processed = 0;
-	instr_time	sync_start,
-				sync_end,
-				sync_diff;
-	uint64		elapsed;
-	uint64		longest = 0;
-	uint64		total_elapsed = 0;
-
-	/*
-	 * This is only called during checkpoints, and checkpoints should only
-	 * occur in processes that have created a pendingOpsTable.
-	 */
-	if (!pendingOpsTable)
-		elog(ERROR, "cannot sync without a pendingOpsTable");
-
-	/*
-	 * If we are in the checkpointer, the sync had better include all fsync
-	 * requests that were queued by backends up to this point.  The tightest
-	 * race condition that could occur is that a buffer that must be written
-	 * and fsync'd for the checkpoint could have been dumped by a backend just
-	 * before it was visited by BufferSync().  We know the backend will have
-	 * queued an fsync request before clearing the buffer's dirtybit, so we
-	 * are safe as long as we do an Absorb after completing BufferSync().
-	 */
-	AbsorbFsyncRequests();
-
-	/*
-	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
-	 * checkpoint), we want to ignore fsync requests that are entered into the
-	 * hashtable after this point --- they should be processed next time,
-	 * instead.  We use mdsync_cycle_ctr to tell old entries apart from new
-	 * ones: new ones will have cycle_ctr equal to the incremented value of
-	 * mdsync_cycle_ctr.
-	 *
-	 * In normal circumstances, all entries present in the table at this point
-	 * will have cycle_ctr exactly equal to the current (about to be old)
-	 * value of mdsync_cycle_ctr.  However, if we fail partway through the
-	 * fsync'ing loop, then older values of cycle_ctr might remain when we
-	 * come back here to try again.  Repeated checkpoint failures would
-	 * eventually wrap the counter around to the point where an old entry
-	 * might appear new, causing us to skip it, possibly allowing a checkpoint
-	 * to succeed that should not have.  To forestall wraparound, any time the
-	 * previous mdsync() failed to complete, run through the table and
-	 * forcibly set cycle_ctr = mdsync_cycle_ctr.
-	 *
-	 * Think not to merge this loop with the main loop, as the problem is
-	 * exactly that that loop may fail before having visited all the entries.
-	 * From a performance point of view it doesn't matter anyway, as this path
-	 * will never be taken in a system that's functioning normally.
-	 */
-	if (mdsync_in_progress)
-	{
-		/* prior try failed, so update any stale cycle_ctr values */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		}
-	}
-
-	/* Advance counter so that new hashtable entries are distinguishable */
-	mdsync_cycle_ctr++;
-
-	/* Set flag to detect failure if we don't reach the end of the loop */
-	mdsync_in_progress = true;
-
-	/* Now scan the hashtable for fsync requests to process */
-	absorb_counter = FSYNCS_PER_ABSORB;
-	hash_seq_init(&hstat, pendingOpsTable);
-	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-	{
-		ForkNumber	forknum;
-
-		/*
-		 * If the entry is new then don't process it this time; it might
-		 * contain multiple fsync-request bits, but they are all new.  Note
-		 * "continue" bypasses the hash-remove call at the bottom of the loop.
-		 */
-		if (entry->cycle_ctr == mdsync_cycle_ctr)
-			continue;
-
-		/* Else assert we haven't missed it */
-		Assert((CycleCtr) (entry->cycle_ctr + 1) == mdsync_cycle_ctr);
-
-		/*
-		 * Scan over the forks and segments represented by the entry.
-		 *
-		 * The bitmap manipulations are slightly tricky, because we can call
-		 * AbsorbFsyncRequests() inside the loop and that could result in
-		 * bms_add_member() modifying and even re-palloc'ing the bitmapsets.
-		 * So we detach it, but if we fail we'll merge it with any new
-		 * requests that have arrived in the meantime.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			Bitmapset  *requests = entry->requests[forknum];
-			int			segno;
-
-			entry->requests[forknum] = NULL;
-			entry->canceled[forknum] = false;
-
-			segno = -1;
-			while ((segno = bms_next_member(requests, segno)) >= 0)
-			{
-				int			failures;
-
-				/*
-				 * If fsync is off then we don't have to bother opening the
-				 * file at all.  (We delay checking until this point so that
-				 * changing fsync on the fly behaves sensibly.)
-				 */
-				if (!enableFsync)
-					continue;
-
-				/*
-				 * If in checkpointer, we want to absorb pending requests
-				 * every so often to prevent overflow of the fsync request
-				 * queue.  It is unspecified whether newly-added entries will
-				 * be visited by hash_seq_search, but we don't care since we
-				 * don't need to process them anyway.
-				 */
-				if (--absorb_counter <= 0)
-				{
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB;
-				}
-
-				/*
-				 * The fsync table could contain requests to fsync segments
-				 * that have been deleted (unlinked) by the time we get to
-				 * them. Rather than just hoping an ENOENT (or EACCES on
-				 * Windows) error can be ignored, what we do on error is
-				 * absorb pending requests and then retry.  Since mdunlink()
-				 * queues a "cancel" message before actually unlinking, the
-				 * fsync request is guaranteed to be marked canceled after the
-				 * absorb if it really was this case. DROP DATABASE likewise
-				 * has to tell us to forget fsync requests before it starts
-				 * deletions.
-				 */
-				for (failures = 0;; failures++) /* loop exits at "break" */
-				{
-					SMgrRelation reln;
-					MdfdVec    *seg;
-					char	   *path;
-					int			save_errno;
-
-					/*
-					 * Find or create an smgr hash entry for this relation.
-					 * This may seem a bit unclean -- md calling smgr?	But
-					 * it's really the best solution.  It ensures that the
-					 * open file reference isn't permanently leaked if we get
-					 * an error here. (You may say "but an unreferenced
-					 * SMgrRelation is still a leak!" Not really, because the
-					 * only case in which a checkpoint is done by a process
-					 * that isn't about to shut down is in the checkpointer,
-					 * and it will periodically do smgrcloseall(). This fact
-					 * justifies our not closing the reln in the success path
-					 * either, which is a good thing since in non-checkpointer
-					 * cases we couldn't safely do that.)
-					 */
-					reln = smgropen(entry->rnode, InvalidBackendId);
-
-					/* Attempt to open and fsync the target segment */
-					seg = _mdfd_getseg(reln, forknum,
-									   (BlockNumber) segno * (BlockNumber) RELSEG_SIZE,
-									   false,
-									   EXTENSION_RETURN_NULL
-									   | EXTENSION_DONT_CHECK_SIZE);
-
-					INSTR_TIME_SET_CURRENT(sync_start);
-
-					if (seg != NULL &&
-						FileSync(seg->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC) >= 0)
-					{
-						/* Success; update statistics about sync timing */
-						INSTR_TIME_SET_CURRENT(sync_end);
-						sync_diff = sync_end;
-						INSTR_TIME_SUBTRACT(sync_diff, sync_start);
-						elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
-						if (elapsed > longest)
-							longest = elapsed;
-						total_elapsed += elapsed;
-						processed++;
-						requests = bms_del_member(requests, segno);
-						if (log_checkpoints)
-							elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
-								 processed,
-								 FilePathName(seg->mdfd_vfd),
-								 (double) elapsed / 1000);
-
-						break;	/* out of retry loop */
-					}
-
-					/* Compute file name for use in message */
-					save_errno = errno;
-					path = _mdfd_segpath(reln, forknum, (BlockNumber) segno);
-					errno = save_errno;
-
-					/*
-					 * It is possible that the relation has been dropped or
-					 * truncated since the fsync request was entered.
-					 * Therefore, allow ENOENT, but only if we didn't fail
-					 * already on this file.  This applies both for
-					 * _mdfd_getseg() and for FileSync, since fd.c might have
-					 * closed the file behind our back.
-					 *
-					 * XXX is there any point in allowing more than one retry?
-					 * Don't see one at the moment, but easy to change the
-					 * test here if so.
-					 */
-					if (!FILE_POSSIBLY_DELETED(errno) ||
-						failures > 0)
-					{
-						Bitmapset  *new_requests;
-
-						/*
-						 * We need to merge these unsatisfied requests with
-						 * any others that have arrived since we started.
-						 */
-						new_requests = entry->requests[forknum];
-						entry->requests[forknum] =
-							bms_join(new_requests, requests);
-
-						errno = save_errno;
-						ereport(data_sync_elevel(ERROR),
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\": %m",
-										path)));
-					}
-					else
-						ereport(DEBUG1,
-								(errcode_for_file_access(),
-								 errmsg("could not fsync file \"%s\" but retrying: %m",
-										path)));
-					pfree(path);
-
-					/*
-					 * Absorb incoming requests and check to see if a cancel
-					 * arrived for this relation fork.
-					 */
-					AbsorbFsyncRequests();
-					absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
-
-					if (entry->canceled[forknum])
-						break;
-				}				/* end retry loop */
-			}
-			bms_free(requests);
-		}
-
-		/*
-		 * We've finished everything that was requested before we started to
-		 * scan the entry.  If no new requests have been inserted meanwhile,
-		 * remove the entry.  Otherwise, update its cycle counter, as all the
-		 * requests now in it must have arrived during this cycle.
-		 */
-		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-		{
-			if (entry->requests[forknum] != NULL)
-				break;
-		}
-		if (forknum <= MAX_FORKNUM)
-			entry->cycle_ctr = mdsync_cycle_ctr;
-		else
-		{
-			/* Okay to remove it */
-			if (hash_search(pendingOpsTable, &entry->rnode,
-							HASH_REMOVE, NULL) == NULL)
-				elog(ERROR, "pendingOpsTable corrupted");
-		}
-	}							/* end loop over hashtable entries */
-
-	/* Return sync performance metrics for report at checkpoint end */
-	CheckpointStats.ckpt_sync_rels = processed;
-	CheckpointStats.ckpt_longest_sync = longest;
-	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
-
-	/* Flag successful completion of mdsync */
-	mdsync_in_progress = false;
-}
-
-/*
- * mdpreckpt() -- Do pre-checkpoint work
- *
- * To distinguish unlink requests that arrived before this checkpoint
- * started from those that arrived during the checkpoint, we use a cycle
- * counter similar to the one we use for fsync requests. That cycle
- * counter is incremented here.
- *
- * This must be called *before* the checkpoint REDO point is determined.
- * That ensures that we won't delete files too soon.
- *
- * Note that we can't do anything here that depends on the assumption
- * that the checkpoint will be completed.
- */
-void
-mdpreckpt(void)
-{
-	/*
-	 * Any unlink requests arriving after this point will be assigned the next
-	 * cycle counter, and won't be unlinked until next checkpoint.
-	 */
-	mdckpt_cycle_ctr++;
-}
-
-/*
- * mdpostckpt() -- Do post-checkpoint work
- *
- * Remove any lingering files that can now be safely removed.
- */
-void
-mdpostckpt(void)
-{
-	int			absorb_counter;
-
-	absorb_counter = UNLINKS_PER_ABSORB;
-	while (pendingUnlinks != NIL)
-	{
-		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
-		char	   *path;
-
-		/*
-		 * New entries are appended to the end, so if the entry is new we've
-		 * reached the end of old entries.
-		 *
-		 * Note: if just the right number of consecutive checkpoints fail, we
-		 * could be fooled here by cycle_ctr wraparound.  However, the only
-		 * consequence is that we'd delay unlinking for one more checkpoint,
-		 * which is perfectly tolerable.
-		 */
-		if (entry->cycle_ctr == mdckpt_cycle_ctr)
-			break;
-
-		/* Unlink the file */
-		path = relpathperm(entry->rnode, MAIN_FORKNUM);
-		if (unlink(path) < 0)
-		{
-			/*
-			 * There's a race condition, when the database is dropped at the
-			 * same time that we process the pending unlink requests. If the
-			 * DROP DATABASE deletes the file before we do, we will get ENOENT
-			 * here. rmtree() also has to ignore ENOENT errors, to deal with
-			 * the possibility that we delete the file first.
-			 */
-			if (errno != ENOENT)
-				ereport(WARNING,
-						(errcode_for_file_access(),
-						 errmsg("could not remove file \"%s\": %m", path)));
-		}
-		pfree(path);
-
-		/* And remove the list entry */
-		pendingUnlinks = list_delete_first(pendingUnlinks);
-		pfree(entry);
-
-		/*
-		 * As in mdsync, we don't want to stop absorbing fsync requests for a
-		 * long time when there are many deletions to be done.  We can safely
-		 * call AbsorbFsyncRequests() at this point in the loop (note it might
-		 * try to delete list entries).
-		 */
-		if (--absorb_counter <= 0)
-		{
-			AbsorbFsyncRequests();
-			absorb_counter = UNLINKS_PER_ABSORB;
-		}
-	}
-}
-
 /*
  * register_dirty_segment() -- Mark a relation segment as needing fsync
  *
@@ -1397,19 +897,15 @@ mdpostckpt(void)
 static void
 register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 {
+	FileTag		tag;
+
+	INIT_MDFILETAG(tag, reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+
 	/* Temp relations should never be fsync'd */
 	Assert(!SmgrIsTemp(reln));
 
-	if (pendingOpsTable)
+	if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false /* retryOnError */ ))
 	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
-	}
-	else
-	{
-		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno))
-			return;				/* passed it off successfully */
-
 		ereport(DEBUG1,
 				(errmsg("could not forward fsync request because request queue is full")));
 
@@ -1423,254 +919,51 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 
 /*
  * register_unlink() -- Schedule a file to be deleted after next checkpoint
- *
- * We don't bother passing in the fork number, because this is only used
- * with main forks.
- *
- * As with register_dirty_segment, this could involve either a local or
- * a remote pending-ops table.
  */
 static void
-register_unlink(RelFileNodeBackend rnode)
+register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
+						BlockNumber segno)
 {
+	FileTag		tag;
+
+	INIT_MDFILETAG(tag, rnode.node, forknum, segno);
+
 	/* Should never be used with temp relations */
 	Assert(!RelFileNodeBackendIsTemp(rnode));
 
-	if (pendingOpsTable)
-	{
-		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
-	}
-	else
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the request
-		 * message, we have to sleep and try again, because we can't simply
-		 * delete the file now.  Ugly, but hopefully won't happen often.
-		 *
-		 * XXX should we just leave the file orphaned instead?
-		 */
-		Assert(IsUnderPostmaster);
-		while (!ForwardFsyncRequest(rnode.node, MAIN_FORKNUM,
-									UNLINK_RELATION_REQUEST))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	RegisterSyncRequest(&tag, SYNC_UNLINK_REQUEST, true /* retryOnError */ );
 }
 
 /*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
- *
- * We stuff fsync requests into the local hash table for execution
- * during the checkpointer's next checkpoint.  UNLINK requests go into a
- * separate linked list, however, because they get processed separately.
- *
- * The range of possible segment numbers is way less than the range of
- * BlockNumber, so we can reserve high values of segno for special purposes.
- * We define three:
- * - FORGET_RELATION_FSYNC means to cancel pending fsyncs for a relation,
- *	 either for one fork, or all forks if forknum is InvalidForkNumber
- * - FORGET_DATABASE_FSYNC means to cancel pending fsyncs for a whole database
- * - UNLINK_RELATION_REQUEST is a request to delete the file after the next
- *	 checkpoint.
- * Note also that we're assuming real segment numbers don't exceed INT_MAX.
- *
- * (Handling FORGET_DATABASE_FSYNC requests is a tad slow because the hash
- * table has to be searched linearly, but dropping a database is a pretty
- * heavyweight operation anyhow, so we'll live with it.)
+ * register_forget_request() -- forget any fsyncs for a relation fork's segment
  */
-void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+static void
+register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
+						BlockNumber segno)
 {
-	Assert(pendingOpsTable);
-
-	if (segno == FORGET_RELATION_FSYNC)
-	{
-		/* Remove any pending requests for the relation (one or all forks) */
-		PendingOperationEntry *entry;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_FIND,
-													  NULL);
-		if (entry)
-		{
-			/*
-			 * We can't just delete the entry since mdsync could have an
-			 * active hashtable scan.  Instead we delete the bitmapsets; this
-			 * is safe because of the way mdsync is coded.  We also set the
-			 * "canceled" flags so that mdsync can tell that a cancel arrived
-			 * for the fork(s).
-			 */
-			if (forknum == InvalidForkNumber)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-			else
-			{
-				/* remove requests for single fork */
-				bms_free(entry->requests[forknum]);
-				entry->requests[forknum] = NULL;
-				entry->canceled[forknum] = true;
-			}
-		}
-	}
-	else if (segno == FORGET_DATABASE_FSYNC)
-	{
-		/* Remove any pending requests for the entire database */
-		HASH_SEQ_STATUS hstat;
-		PendingOperationEntry *entry;
-		ListCell   *cell,
-				   *prev,
-				   *next;
-
-		/* Remove fsync requests */
-		hash_seq_init(&hstat, pendingOpsTable);
-		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
-		{
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				/* remove requests for all forks */
-				for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
-				{
-					bms_free(entry->requests[forknum]);
-					entry->requests[forknum] = NULL;
-					entry->canceled[forknum] = true;
-				}
-			}
-		}
+	FileTag		tag;
 
-		/* Remove unlink requests */
-		prev = NULL;
-		for (cell = list_head(pendingUnlinks); cell; cell = next)
-		{
-			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
-
-			next = lnext(cell);
-			if (entry->rnode.dbNode == rnode.dbNode)
-			{
-				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
-				pfree(entry);
-			}
-			else
-				prev = cell;
-		}
-	}
-	else if (segno == UNLINK_RELATION_REQUEST)
-	{
-		/* Unlink request: put it in the linked list */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingUnlinkEntry *entry;
-
-		/* PendingUnlinkEntry doesn't store forknum, since it's always MAIN */
-		Assert(forknum == MAIN_FORKNUM);
-
-		entry = palloc(sizeof(PendingUnlinkEntry));
-		entry->rnode = rnode;
-		entry->cycle_ctr = mdckpt_cycle_ctr;
-
-		pendingUnlinks = lappend(pendingUnlinks, entry);
-
-		MemoryContextSwitchTo(oldcxt);
-	}
-	else
-	{
-		/* Normal case: enter a request to fsync this segment */
-		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
-		PendingOperationEntry *entry;
-		bool		found;
-
-		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
-													  &rnode,
-													  HASH_ENTER,
-													  &found);
-		/* if new entry, initialize it */
-		if (!found)
-		{
-			entry->cycle_ctr = mdsync_cycle_ctr;
-			MemSet(entry->requests, 0, sizeof(entry->requests));
-			MemSet(entry->canceled, 0, sizeof(entry->canceled));
-		}
-
-		/*
-		 * NB: it's intentional that we don't change cycle_ctr if the entry
-		 * already exists.  The cycle_ctr must represent the oldest fsync
-		 * request that could be in the entry.
-		 */
-
-		entry->requests[forknum] = bms_add_member(entry->requests[forknum],
-												  (int) segno);
+	INIT_MDFILETAG(tag, rnode.node, forknum, segno);
 
-		MemoryContextSwitchTo(oldcxt);
-	}
-}
-
-/*
- * ForgetRelationFsyncRequests -- forget any fsyncs for a relation fork
- *
- * forknum == InvalidForkNumber means all forks, although this code doesn't
- * actually know that, since it's just forwarding the request elsewhere.
- */
-void
-ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
-{
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/*
-		 * Notify the checkpointer about it.  If we fail to queue the cancel
-		 * message, we have to sleep and try again ... ugly, but hopefully
-		 * won't happen often.
-		 *
-		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
-		 * error would leave the no-longer-used file still present on disk,
-		 * which would be bad, so I'm inclined to assume that the checkpointer
-		 * will always empty the queue soon.
-		 */
-		while (!ForwardFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-
-		/*
-		 * Note we don't wait for the checkpointer to actually absorb the
-		 * cancel message; see mdsync() for the implications.
-		 */
-	}
+	RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true /* retryOnError */ );
 }
 
 /*
  * ForgetDatabaseFsyncRequests -- forget any fsyncs and unlinks for a DB
  */
 void
-ForgetDatabaseFsyncRequests(Oid dbid)
+ForgetDatabaseSyncRequests(Oid dbid)
 {
+	FileTag		tag;
 	RelFileNode rnode;
 
 	rnode.dbNode = dbid;
 	rnode.spcNode = 0;
 	rnode.relNode = 0;
 
-	if (pendingOpsTable)
-	{
-		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
-	}
-	else if (IsUnderPostmaster)
-	{
-		/* see notes in ForgetRelationFsyncRequests */
-		while (!ForwardFsyncRequest(rnode, InvalidForkNumber,
-									FORGET_DATABASE_FSYNC))
-			pg_usleep(10000L);	/* 10 msec seems a good number */
-	}
+	INIT_MDFILETAG(tag, rnode, InvalidForkNumber, InvalidBlockNumber);
+
+	RegisterSyncRequest(&tag, SYNC_FILTER_REQUEST, true /* retryOnError */ );
 }
 
 /*
@@ -1951,3 +1244,75 @@ _mdnblocks(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	/* note that this calculation will ignore any partial block at EOF */
 	return (BlockNumber) (len / BLCKSZ);
 }
+
+/*
+ * Sync a file to disk, given a file tag.  Write the path into an output
+ * buffer so the caller can use it in error messages.
+ *
+ * Return 0 on success, -1 on failure, with errno set.
+ */
+int
+mdsyncfiletag(const FileTag *ftag, char *path)
+{
+	SMgrRelation reln = smgropen(ftag->rnode, InvalidBackendId);
+	MdfdVec    *v;
+	char	   *p;
+
+	/* Provide the path for informational messages. */
+	p = _mdfd_segpath(reln, ftag->forknum, ftag->segno);
+	strlcpy(path, p, MAXPGPATH);
+	pfree(p);
+
+	/* Open the relation using the cache, for performance. */
+	reln = smgropen(ftag->rnode, InvalidBackendId);
+
+	/* Try to find open the requested segment. */
+	v = _mdfd_getseg(reln, ftag->forknum, ftag->segno, false,
+					 EXTENSION_RETURN_NULL);
+	if (v == NULL)
+	{
+		errno = ENOENT;
+		return -1;
+	}
+
+	/* Try to fsync the file. */
+	return FileSync(v->mdfd_vfd, WAIT_EVENT_DATA_FILE_SYNC);
+}
+
+/*
+ * Unlink a file, given a file tag.  Write the path into an output
+ * buffer so the caller can use it in error messages.
+ *
+ * Return 0 on success, -1 on failure, with errno set.
+ */
+int
+mdunlinkfiletag(const FileTag *ftag, char *path)
+{
+	SMgrRelation reln = smgropen(ftag->rnode, InvalidBackendId);
+	char	   *p;
+
+	/* Compute the path. */
+	p = _mdfd_segpath(reln, ftag->forknum, ftag->segno);
+	strlcpy(path, p, MAXPGPATH);
+	pfree(p);
+
+	/* Try to unlink the file. */
+	return unlink(path);
+}
+
+/*
+ * Check if a given candidate request matches a given tag, when processing
+ * a SYNC_FILTER_REQUEST request.  This will be called for all pending
+ * requests to find out whether to forget them.
+ */
+bool
+mdfiletagmatches(const FileTag *ftag, const FileTag *candidate)
+{
+	/*
+	 * For now we only use filter requests as a way to drop all scheduled
+	 * callbacks relating to a given database, when dropping the database.
+	 * We'll return true for all candidates that have the same database OID as
+	 * the ftag from the SYNC_FILTER_REQUEST request, so they're forgotten.
+	 */
+	return ftag->rnode.dbNode == candidate->rnode.dbNode;
+}
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index f6de9df9e61..8191118b619 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -21,6 +21,7 @@
 #include "lib/ilist.h"
 #include "storage/bufmgr.h"
 #include "storage/ipc.h"
+#include "storage/md.h"
 #include "storage/smgr.h"
 #include "utils/hsearch.h"
 #include "utils/inval.h"
@@ -60,12 +61,8 @@ typedef struct f_smgr
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
-	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
-	void		(*smgr_sync) (void);	/* may be NULL */
-	void		(*smgr_post_ckpt) (void);	/* may be NULL */
 } f_smgr;
 
-
 static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{
@@ -83,15 +80,11 @@ static const f_smgr smgrsw[] = {
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
 		.smgr_immedsync = mdimmedsync,
-		.smgr_pre_ckpt = mdpreckpt,
-		.smgr_sync = mdsync,
-		.smgr_post_ckpt = mdpostckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
-
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
  * In addition, "unowned" SMgrRelation objects are chained together in a list.
@@ -705,52 +698,6 @@ smgrimmedsync(SMgrRelation reln, ForkNumber forknum)
 	smgrsw[reln->smgr_which].smgr_immedsync(reln, forknum);
 }
 
-
-/*
- *	smgrpreckpt() -- Prepare for checkpoint.
- */
-void
-smgrpreckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_pre_ckpt)
-			smgrsw[i].smgr_pre_ckpt();
-	}
-}
-
-/*
- *	smgrsync() -- Sync files to disk during checkpoint.
- */
-void
-smgrsync(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_sync)
-			smgrsw[i].smgr_sync();
-	}
-}
-
-/*
- *	smgrpostckpt() -- Post-checkpoint cleanup.
- */
-void
-smgrpostckpt(void)
-{
-	int			i;
-
-	for (i = 0; i < NSmgr; i++)
-	{
-		if (smgrsw[i].smgr_post_ckpt)
-			smgrsw[i].smgr_post_ckpt();
-	}
-}
-
 /*
  * AtEOXact_SMgr
  *
diff --git a/src/backend/storage/sync/Makefile b/src/backend/storage/sync/Makefile
new file mode 100644
index 00000000000..cfc60cadb4c
--- /dev/null
+++ b/src/backend/storage/sync/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for storage/sync
+#
+# IDENTIFICATION
+#    src/backend/storage/sync/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/storage/sync
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = sync.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
new file mode 100644
index 00000000000..0c5712b62bd
--- /dev/null
+++ b/src/backend/storage/sync/sync.c
@@ -0,0 +1,598 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.c
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/sync/sync.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include <unistd.h>
+#include <fcntl.h>
+#include <sys/file.h>
+
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "access/xlogutils.h"
+#include "access/xlog.h"
+#include "commands/tablespace.h"
+#include "portability/instr_time.h"
+#include "postmaster/bgwriter.h"
+#include "storage/bufmgr.h"
+#include "storage/ipc.h"
+#include "storage/md.h"
+#include "utils/hsearch.h"
+#include "utils/memutils.h"
+#include "utils/inval.h"
+
+static MemoryContext pendingOpsCxt; /* context for the pending ops state  */
+
+/*
+ * In some contexts (currently, standalone backends and the checkpointer)
+ * we keep track of pending fsync operations: we need to remember all relation
+ * segments that have been written since the last checkpoint, so that we can
+ * fsync them down to disk before completing the next checkpoint.  This hash
+ * table remembers the pending operations.  We use a hash table mostly as
+ * a convenient way of merging duplicate requests.
+ *
+ * We use a similar mechanism to remember no-longer-needed files that can
+ * be deleted after the next checkpoint, but we use a linked list instead of
+ * a hash table, because we don't expect there to be any duplicate requests.
+ *
+ * These mechanisms are only used for non-temp relations; we never fsync
+ * temp rels, nor do we need to postpone their deletion (see comments in
+ * mdunlink).
+ *
+ * (Regular backends do not track pending operations locally, but forward
+ * them to the checkpointer.)
+ */
+typedef uint16 CycleCtr;		/* can be any convenient integer size */
+
+typedef struct
+{
+	FileTag		tag;			/* identifies handler and file */
+	CycleCtr	cycle_ctr;		/* sync_cycle_ctr of oldest request */
+	bool		canceled;		/* canceled is true if we canceled "recently" */
+} PendingFsyncEntry;
+
+typedef struct
+{
+	FileTag		tag;			/* identifies handler and file */
+	CycleCtr	cycle_ctr;		/* checkpoint_cycle_ctr when request was made */
+} PendingUnlinkEntry;
+
+static HTAB *pendingOps = NULL;
+static List *pendingUnlinks = NIL;
+static MemoryContext pendingOpsCxt; /* context for the above  */
+
+static CycleCtr sync_cycle_ctr = 0;
+static CycleCtr checkpoint_cycle_ctr = 0;
+
+/* Intervals for calling AbsorbFsyncRequests */
+#define FSYNCS_PER_ABSORB		10
+#define UNLINKS_PER_ABSORB		10
+
+/*
+ * Function pointers for handling sync and unlink requests.
+ */
+typedef struct SyncOps
+{
+	int			(*sync_syncfiletag) (const FileTag *ftag, char *path);
+	int			(*sync_unlinkfiletag) (const FileTag *ftag, char *path);
+	bool		(*sync_filetagmatches) (const FileTag *ftag,
+										const FileTag *candidate);
+} SyncOps;
+
+static const SyncOps syncsw[] = {
+	/* magnetic disk */
+	{
+		.sync_syncfiletag = mdsyncfiletag,
+		.sync_unlinkfiletag = mdunlinkfiletag,
+		.sync_filetagmatches = mdfiletagmatches
+	}
+};
+
+/*
+ * Initialize data structures for the file sync tracking.
+ */
+void
+InitSync(void)
+{
+	/*
+	 * Create pending-operations hashtable if we need it.  Currently, we need
+	 * it if we are standalone (not under a postmaster) or if we are a startup
+	 * or checkpointer auxiliary process.
+	 */
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		/*
+		 * XXX: The checkpointer needs to add entries to the pending ops table
+		 * when absorbing fsync requests.  That is done within a critical
+		 * section, which isn't usually allowed, but we make an exception. It
+		 * means that there's a theoretical possibility that you run out of
+		 * memory while absorbing fsync requests, which leads to a PANIC.
+		 * Fortunately the hash table is small so that's unlikely to happen in
+		 * practice.
+		 */
+		pendingOpsCxt = AllocSetContextCreate(TopMemoryContext,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(FileTag);
+		hash_ctl.entrysize = sizeof(PendingFsyncEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOps = hash_create("Pending Ops Table",
+								 100L,
+								 &hash_ctl,
+								 HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+		pendingUnlinks = NIL;
+	}
+
+}
+
+/*
+ * SyncPreCheckpoint() -- Do pre-checkpoint work
+ *
+ * To distinguish unlink requests that arrived before this checkpoint
+ * started from those that arrived during the checkpoint, we use a cycle
+ * counter similar to the one we use for fsync requests. That cycle
+ * counter is incremented here.
+ *
+ * This must be called *before* the checkpoint REDO point is determined.
+ * That ensures that we won't delete files too soon.
+ *
+ * Note that we can't do anything here that depends on the assumption
+ * that the checkpoint will be completed.
+ */
+void
+SyncPreCheckpoint(void)
+{
+	/*
+	 * Any unlink requests arriving after this point will be assigned the next
+	 * cycle counter, and won't be unlinked until next checkpoint.
+	 */
+	checkpoint_cycle_ctr++;
+}
+
+/*
+ * SyncPostCheckpoint() -- Do post-checkpoint work
+ *
+ * Remove any lingering files that can now be safely removed.
+ */
+void
+SyncPostCheckpoint(void)
+{
+	int			absorb_counter;
+
+	absorb_counter = UNLINKS_PER_ABSORB;
+	while (pendingUnlinks != NIL)
+	{
+		PendingUnlinkEntry *entry = (PendingUnlinkEntry *) linitial(pendingUnlinks);
+		char		path[MAXPGPATH];
+
+		/*
+		 * New entries are appended to the end, so if the entry is new we've
+		 * reached the end of old entries.
+		 *
+		 * Note: if just the right number of consecutive checkpoints fail, we
+		 * could be fooled here by cycle_ctr wraparound.  However, the only
+		 * consequence is that we'd delay unlinking for one more checkpoint,
+		 * which is perfectly tolerable.
+		 */
+		if (entry->cycle_ctr == checkpoint_cycle_ctr)
+			break;
+
+		/* Unlink the file */
+		if (syncsw[entry->tag.handler].sync_unlinkfiletag(&entry->tag,
+														  path) < 0)
+		{
+			/*
+			 * There's a race condition, when the database is dropped at the
+			 * same time that we process the pending unlink requests. If the
+			 * DROP DATABASE deletes the file before we do, we will get ENOENT
+			 * here. rmtree() also has to ignore ENOENT errors, to deal with
+			 * the possibility that we delete the file first.
+			 */
+			if (errno != ENOENT)
+				ereport(WARNING,
+						(errcode_for_file_access(),
+						 errmsg("could not remove file \"%s\": %m", path)));
+		}
+
+		/* And remove the list entry */
+		pendingUnlinks = list_delete_first(pendingUnlinks);
+		pfree(entry);
+
+		/*
+		 * As in ProcessFsyncRequests, we don't want to stop absorbing fsync
+		 * requests for along time when there are many deletions to be done.
+		 * We can safelycall AbsorbFsyncRequests() at this point in the loop
+		 * (note it might try to delete list entries).
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = UNLINKS_PER_ABSORB;
+		}
+	}
+}
+
+/*
+
+ *	ProcessSyncRequests() -- Process queued fsync requests.
+ */
+void
+ProcessSyncRequests(void)
+{
+	static bool sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingFsyncEntry *entry;
+	int			absorb_counter;
+
+	/* Statistics on sync times */
+	int			processed = 0;
+	instr_time	sync_start,
+				sync_end,
+				sync_diff;
+	uint64		elapsed;
+	uint64		longest = 0;
+	uint64		total_elapsed = 0;
+
+	/*
+	 * This is only called during checkpoints, and checkpoints should only
+	 * occur in processes that have created a pendingOps.
+	 */
+	if (!pendingOps)
+		elog(ERROR, "cannot sync without a pendingOps table");
+
+	/*
+	 * If we are in the checkpointer, the sync had better include all fsync
+	 * requests that were queued by backends up to this point.  The tightest
+	 * race condition that could occur is that a buffer that must be written
+	 * and fsync'd for the checkpoint could have been dumped by a backend just
+	 * before it was visited by BufferSync().  We know the backend will have
+	 * queued an fsync request before clearing the buffer's dirtybit, so we
+	 * are safe as long as we do an Absorb after completing BufferSync().
+	 */
+	AbsorbSyncRequests();
+
+	/*
+	 * To avoid excess fsync'ing (in the worst case, maybe a never-terminating
+	 * checkpoint), we want to ignore fsync requests that are entered into the
+	 * hashtable after this point --- they should be processed next time,
+	 * instead.  We use sync_cycle_ctr to tell old entries apart from new
+	 * ones: new ones will have cycle_ctr equal to the incremented value of
+	 * sync_cycle_ctr.
+	 *
+	 * In normal circumstances, all entries present in the table at this point
+	 * will have cycle_ctr exactly equal to the current (about to be old)
+	 * value of sync_cycle_ctr.  However, if we fail partway through the
+	 * fsync'ing loop, then older values of cycle_ctr might remain when we
+	 * come back here to try again.  Repeated checkpoint failures would
+	 * eventually wrap the counter around to the point where an old entry
+	 * might appear new, causing us to skip it, possibly allowing a checkpoint
+	 * to succeed that should not have.  To forestall wraparound, any time the
+	 * previous ProcessFsyncRequests() failed to complete, run through the
+	 * table and forcibly set cycle_ctr = sync_cycle_ctr.
+	 *
+	 * Think not to merge this loop with the main loop, as the problem is
+	 * exactly that that loop may fail before having visited all the entries.
+	 * From a performance point of view it doesn't matter anyway, as this path
+	 * will never be taken in a system that's functioning normally.
+	 */
+	if (sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			entry->cycle_ctr = sync_cycle_ctr;
+		}
+	}
+
+	/* Advance counter so that new hashtable entries are distinguishable */
+	sync_cycle_ctr++;
+
+	/* Set flag to detect failure if we don't reach the end of the loop */
+	sync_in_progress = true;
+
+	/* Now scan the hashtable for fsync requests to process */
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOps);
+	while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		int			failures;
+
+		/*
+		 * If fsync is off then we don't have to bother opening the file at
+		 * all.  (We delay checking until this point so that changing fsync on
+		 * the fly behaves sensibly.)
+		 */
+		if (!enableFsync)
+			continue;
+
+		/*
+		 * If the entry is new then don't process it this time; it is new.
+		 * Note "continue" bypasses the hash-remove call at the bottom of the
+		 * loop.
+		 */
+		if (entry->cycle_ctr == sync_cycle_ctr)
+			continue;
+
+		/* Else assert we haven't missed it */
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == sync_cycle_ctr);
+
+		/*
+		 * If in checkpointer, we want to absorb pending requests every so
+		 * often to prevent overflow of the fsync request queue.  It is
+		 * unspecified whether newly-added entries will be visited by
+		 * hash_seq_search, but we don't care since we don't need to process
+		 * them anyway.
+		 */
+		if (--absorb_counter <= 0)
+		{
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB;
+		}
+
+		/*
+		 * The fsync table could contain requests to fsync segments that have
+		 * been deleted (unlinked) by the time we get to them. Rather than
+		 * just hoping an ENOENT (or EACCES on Windows) error can be ignored,
+		 * what we do on error is absorb pending requests and then retry.
+		 * Since mdunlink() queues a "cancel" message before actually
+		 * unlinking, the fsync request is guaranteed to be marked canceled
+		 * after the absorb if it really was this case. DROP DATABASE likewise
+		 * has to tell us to forget fsync requests before it starts deletions.
+		 */
+		for (failures = 0; !entry->canceled; failures++)
+		{
+			char		path[MAXPGPATH];
+
+			INSTR_TIME_SET_CURRENT(sync_start);
+			if (syncsw[entry->tag.handler].sync_syncfiletag(&entry->tag,
+															path) == 0)
+			{
+				/* Success; update statistics about sync timing */
+				INSTR_TIME_SET_CURRENT(sync_end);
+				sync_diff = sync_end;
+				INSTR_TIME_SUBTRACT(sync_diff, sync_start);
+				elapsed = INSTR_TIME_GET_MICROSEC(sync_diff);
+				if (elapsed > longest)
+					longest = elapsed;
+				total_elapsed += elapsed;
+				processed++;
+
+				if (log_checkpoints)
+					elog(DEBUG1, "checkpoint sync: number=%d file=%s time=%.3f msec",
+						 processed,
+						 path,
+						 (double) elapsed / 1000);
+
+				break;			/* out of retry loop */
+			}
+
+			/*
+			 * It is possible that the relation has been dropped or truncated
+			 * since the fsync request was entered. Therefore, allow ENOENT,
+			 * but only if we didn't fail already on this file.
+			 */
+			if (!FILE_POSSIBLY_DELETED(errno) || failures > 0)
+				ereport(data_sync_elevel(ERROR),
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								path)));
+			else
+				ereport(DEBUG1,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\" but retrying: %m",
+								path)));
+
+			/*
+			 * Absorb incoming requests and check to see if a cancel arrived
+			 * for this relation fork.
+			 */
+			AbsorbSyncRequests();
+			absorb_counter = FSYNCS_PER_ABSORB; /* might as well... */
+		}						/* end retry loop */
+
+		/* We are done with this entry, remove it */
+		if (hash_search(pendingOps, &entry->tag, HASH_REMOVE, NULL) == NULL)
+			elog(ERROR, "pendingOps corrupted");
+	}							/* end loop over hashtable entries */
+
+	/* Return sync performance metrics for report at checkpoint end */
+	CheckpointStats.ckpt_sync_rels = processed;
+	CheckpointStats.ckpt_longest_sync = longest;
+	CheckpointStats.ckpt_agg_sync_time = total_elapsed;
+
+	/* Flag successful completion of ProcessSyncRequests */
+	sync_in_progress = false;
+}
+
+/*
+ * RememberSyncRequest() -- callback from checkpointer side of sync request
+ *
+ * We stuff fsync requests into the local hash table for execution
+ * during the checkpointer's next checkpoint.  UNLINK requests go into a
+ * separate linked list, however, because they get processed separately.
+ *
+ * See sync.h for more information on the types of sync requests supported.
+ */
+void
+RememberSyncRequest(const FileTag *ftag, SyncRequestType type)
+{
+	Assert(pendingOps);
+
+	if (type == SYNC_FORGET_REQUEST)
+	{
+		PendingFsyncEntry *entry;
+
+		/* Cancel previously entered request */
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+												  (void *) ftag,
+												  HASH_FIND,
+												  NULL);
+		if (entry != NULL)
+			entry->canceled = true;
+	}
+	else if (type == SYNC_FILTER_REQUEST)
+	{
+		HASH_SEQ_STATUS hstat;
+		PendingFsyncEntry *entry;
+		ListCell   *cell,
+				   *prev,
+				   *next;
+
+		/* Cancel matching fsync requests */
+		hash_seq_init(&hstat, pendingOps);
+		while ((entry = (PendingFsyncEntry *) hash_seq_search(&hstat)) != NULL)
+		{
+			if (entry->tag.handler == ftag->handler &&
+				syncsw[ftag->handler].sync_filetagmatches(ftag, &entry->tag))
+				entry->canceled = true;
+		}
+
+		/* Remove matching unlink requests */
+		prev = NULL;
+		for (cell = list_head(pendingUnlinks); cell; cell = next)
+		{
+			PendingUnlinkEntry *entry = (PendingUnlinkEntry *) lfirst(cell);
+
+			next = lnext(cell);
+			if (entry->tag.handler == ftag->handler &&
+				syncsw[ftag->handler].sync_filetagmatches(ftag, &entry->tag))
+			{
+				pendingUnlinks = list_delete_cell(pendingUnlinks, cell, prev);
+				pfree(entry);
+			}
+			else
+				prev = cell;
+		}
+	}
+	else if (type == SYNC_UNLINK_REQUEST)
+	{
+		/* Unlink request: put it in the linked list */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingUnlinkEntry *entry;
+
+		entry = palloc(sizeof(PendingUnlinkEntry));
+		entry->tag = *ftag;
+		entry->cycle_ctr = checkpoint_cycle_ctr;
+
+		pendingUnlinks = lappend(pendingUnlinks, entry);
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+	else
+	{
+		/* Normal case: enter a request to fsync this segment */
+		MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+		PendingFsyncEntry *entry;
+		bool		found;
+
+		Assert(type == SYNC_REQUEST);
+
+		entry = (PendingFsyncEntry *) hash_search(pendingOps,
+												  (void *) ftag,
+												  HASH_ENTER,
+												  &found);
+		/* if new entry, initialize it */
+		if (!found)
+		{
+			entry->cycle_ctr = sync_cycle_ctr;
+			entry->canceled = false;
+		}
+
+		/*
+		 * NB: it's intentional that we don't change cycle_ctr if the entry
+		 * already exists.  The cycle_ctr must represent the oldest fsync
+		 * request that could be in the entry.
+		 */
+
+		MemoryContextSwitchTo(oldcxt);
+	}
+}
+
+/*
+ * Register the sync request locally, or forward it to the checkpointer.
+ *
+ * If retryOnError is true, we'll keep trying if there is no space in the
+ * queue.  Return true if we succeeded, or false if there wasn't space.
+ */
+bool
+RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
+					bool retryOnError)
+{
+	bool		ret;
+
+	if (pendingOps != NULL)
+	{
+		/* standalone backend or startup process: fsync state is local */
+		RememberSyncRequest(ftag, type);
+		return true;
+	}
+
+	for (;;)
+	{
+		/*
+		 * Notify the checkpointer about it.  If we fail to queue the cancel
+		 * message, we have to sleep and try again ... ugly, but hopefully
+		 * won't happen often.
+		 *
+		 * XXX should we CHECK_FOR_INTERRUPTS in this loop?  Escaping with an
+		 * error in the case of SYNC_UNLINK_REQUEST would leave the
+		 * no-longer-used file still present on disk, which would be bad, so
+		 * I'm inclined to assume that the checkpointer will always empty the
+		 * queue soon.
+		 */
+		ret = ForwardSyncRequest(ftag, type);
+
+		/*
+		 * If we are successful in queueing the request, or we failed and were
+		 * instructed not to retry on error, break.
+		 */
+		if (ret || (!ret && !retryOnError))
+			break;
+
+		pg_usleep(10000L);
+	}
+
+	return ret;
+}
+
+/*
+ * In archive recovery, we rely on checkpointer to do fsyncs, but we will have
+ * already created the pendingOps during initialization of the startup
+ * process.  Calling this function drops the local pendingOps so that
+ * subsequent requests will be forwarded to checkpointer.
+ */
+void
+EnableSyncRequestForwarding(void)
+{
+	/* Perform any pending fsyncs we may have queued up, then drop table */
+	if (pendingOps)
+	{
+		ProcessSyncRequests();
+		hash_destroy(pendingOps);
+	}
+	pendingOps = NULL;
+
+	/*
+	 * We should not have any pending unlink requests, since mdunlink doesn't
+	 * queue unlink requests when isRedo.
+	 */
+	Assert(pendingUnlinks == NIL);
+}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 752010ed276..1c2a99c9c8c 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -51,6 +51,7 @@
 #include "storage/proc.h"
 #include "storage/sinvaladt.h"
 #include "storage/smgr.h"
+#include "storage/sync.h"
 #include "tcop/tcopprot.h"
 #include "utils/acl.h"
 #include "utils/fmgroids.h"
@@ -555,6 +556,7 @@ BaseInit(void)
 
 	/* Do local initialization of file, storage and buffer managers */
 	InitFileAccess();
+	InitSync();
 	smgrinit();
 	InitBufferPoolAccess();
 }
diff --git a/src/include/postmaster/bgwriter.h b/src/include/postmaster/bgwriter.h
index 53b8f5fe3cb..630366f49ef 100644
--- a/src/include/postmaster/bgwriter.h
+++ b/src/include/postmaster/bgwriter.h
@@ -17,6 +17,8 @@
 
 #include "storage/block.h"
 #include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
 
 
 /* GUC options */
@@ -31,9 +33,9 @@ extern void CheckpointerMain(void) pg_attribute_noreturn();
 extern void RequestCheckpoint(int flags);
 extern void CheckpointWriteDelay(int flags, double progress);
 
-extern bool ForwardFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					BlockNumber segno);
-extern void AbsorbFsyncRequests(void);
+extern bool ForwardSyncRequest(const FileTag *ftag, SyncRequestType type);
+
+extern void AbsorbSyncRequests(void);
 
 extern Size CheckpointerShmemSize(void);
 extern void CheckpointerShmemInit(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 74c34757fb5..40f46b871d7 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -54,6 +54,18 @@ extern PGDLLIMPORT bool data_sync_retry;
  */
 extern int	max_safe_fds;
 
+/*
+ * On Windows, we have to interpret EACCES as possibly meaning the same as
+ * ENOENT, because if a file is unlinked-but-not-yet-gone on that platform,
+ * that's what you get.  Ugh.  This code is designed so that we don't
+ * actually believe these cases are okay without further evidence (namely,
+ * a pending fsync request getting canceled ... see mdsync).
+ */
+#ifndef WIN32
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT)
+#else
+#define FILE_POSSIBLY_DELETED(err)	((err) == ENOENT || (err) == EACCES)
+#endif
 
 /*
  * prototypes for functions in fd.c
diff --git a/src/include/storage/md.h b/src/include/storage/md.h
new file mode 100644
index 00000000000..a6758a10dcb
--- /dev/null
+++ b/src/include/storage/md.h
@@ -0,0 +1,51 @@
+/*-------------------------------------------------------------------------
+ *
+ * md.h
+ *	  magnetic disk storage manager public interface declarations.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/md.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef MD_H
+#define MD_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+#include "storage/smgr.h"
+#include "storage/sync.h"
+
+/* md storage manager functionality */
+extern void mdinit(void);
+extern void mdclose(SMgrRelation reln, ForkNumber forknum);
+extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
+extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
+extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
+extern void mdextend(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber blocknum);
+extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+	   char *buffer);
+extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
+extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber nblocks);
+extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
+
+extern void ForgetDatabaseSyncRequests(Oid dbid);
+extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
+
+/* md sync callbacks */
+extern int mdsyncfiletag(const FileTag *ftag, char *path);
+extern int mdunlinkfiletag(const FileTag *ftag, char *path);
+extern bool mdfiletagmatches(const FileTag *ftag, const FileTag *candidate);
+
+#endif							/* MD_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 8e982738789..770193e285e 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -18,7 +18,6 @@
 #include "storage/block.h"
 #include "storage/relfilenode.h"
 
-
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
  * cached file handles.  An SMgrRelation is created (if not already present)
@@ -106,43 +105,6 @@ extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void smgrpreckpt(void);
-extern void smgrsync(void);
-extern void smgrpostckpt(void);
 extern void AtEOXact_SMgr(void);
 
-
-/* internals: move me elsewhere -- ay 7/94 */
-
-/* in md.c */
-extern void mdinit(void);
-extern void mdclose(SMgrRelation reln, ForkNumber forknum);
-extern void mdcreate(SMgrRelation reln, ForkNumber forknum, bool isRedo);
-extern bool mdexists(SMgrRelation reln, ForkNumber forknum);
-extern void mdunlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo);
-extern void mdextend(SMgrRelation reln, ForkNumber forknum,
-		 BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdprefetch(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber blocknum);
-extern void mdread(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
-	   char *buffer);
-extern void mdwrite(SMgrRelation reln, ForkNumber forknum,
-		BlockNumber blocknum, char *buffer, bool skipFsync);
-extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
-			BlockNumber blocknum, BlockNumber nblocks);
-extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
-extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
-		   BlockNumber nblocks);
-extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
-extern void mdpreckpt(void);
-extern void mdsync(void);
-extern void mdpostckpt(void);
-
-extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
-extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
-extern void ForgetDatabaseFsyncRequests(Oid dbid);
-extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
-
 #endif							/* SMGR_H */
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
new file mode 100644
index 00000000000..d063d826724
--- /dev/null
+++ b/src/include/storage/sync.h
@@ -0,0 +1,64 @@
+/*-------------------------------------------------------------------------
+ *
+ * sync.h
+ *	  File synchronization management code.
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/sync.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SYNC_H
+#define SYNC_H
+
+#include "storage/block.h"
+#include "storage/relfilenode.h"
+
+/*
+ * Type of sync request.  These are used to manage the set of pending
+ * requests to call the handler's sync or unlink functions at the next
+ * checkpoint.
+ */
+typedef enum SyncRequestType
+{
+	SYNC_REQUEST,				/* schedule a call of sync function */
+	SYNC_UNLINK_REQUEST,		/* schedule a call of unlink function */
+	SYNC_FORGET_REQUEST,		/* forget all calls for a tag */
+	SYNC_FILTER_REQUEST			/* forget all calls satisfying match fn */
+} SyncRequestType;
+
+/*
+ * Which set of functions to use to handle a given request.  See the function
+ * table in sync.c.
+ */
+typedef enum SyncRequestHandler
+{
+	SYNC_HANDLER_MD = 0			/* md smgr */
+} SyncRequestHandler;
+
+/*
+ * A tag identifying a file.  Currently it has the members required for md.c's
+ * usage, but sync.c has no knowledge of the internal structure, and it is
+ * liable to change as required by future handlers.
+ */
+typedef struct FileTag
+{
+	int16		handler;		/* SyncRequstHandler value, saving space */
+	int16		forknum;		/* ForkNumber, saving space */
+	RelFileNode rnode;
+	BlockNumber segno;
+} FileTag;
+
+/* sync forward declarations */
+extern void InitSync(void);
+extern void SyncPreCheckpoint(void);
+extern void SyncPostCheckpoint(void);
+extern void ProcessSyncRequests(void);
+extern void RememberSyncRequest(const FileTag *ftag, SyncRequestType type);
+extern void EnableSyncRequestForwarding(void);
+extern bool RegisterSyncRequest(const FileTag *ftag, SyncRequestType type,
+					bool retryOnError);
+
+#endif							/* SYNC_H */
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index f31929664ac..e09f9353ed2 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -651,6 +651,7 @@ File
 FileFdwExecutionState
 FileFdwPlanState
 FileNameMap
+FileTag
 FindSplitData
 FixedParallelExecutorState
 FixedParallelState
@@ -1700,7 +1701,7 @@ PathKeysComparison
 PathTarget
 Pattern_Prefix_Status
 Pattern_Type
-PendingOperationEntry
+PendingFsyncEntry
 PendingRelDelete
 PendingUnlinkEntry
 PendingWriteback
@@ -2276,7 +2277,10 @@ Subscription
 SubscriptionInfo
 SubscriptionRelState
 Syn
+SyncOps
 SyncRepConfigData
+SyncRequestHandler
+SyncRequestType
 SysScanDesc
 SyscacheCallbackFunction
 SystemRowsSamplerData
-- 
2.21.0

0002-Use-the-fsync-queue-for-SLRU-files-v16.patchapplication/octet-stream; name=0002-Use-the-fsync-queue-for-SLRU-files-v16.patchDownload

From 4be3852fe4c6edbef510bbdee2fad297d08cc4ac Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 3 Apr 2019 22:15:19 +1300
Subject: [PATCH 2/2] Use the fsync queue for SLRU files.

Previously, we called fsync() after writing out SLRU page.  Use the
same mechanism for deferring and handing off fsync work to the
checkpointer that md.c uses.

This is a proof-of-concept only for now.
---
 src/backend/access/transam/clog.c      |  13 +++-
 src/backend/access/transam/commit_ts.c |  12 ++-
 src/backend/access/transam/multixact.c |  24 +++++-
 src/backend/access/transam/slru.c      | 101 +++++++++++++++++++------
 src/backend/access/transam/subtrans.c  |   4 +-
 src/backend/commands/async.c           |   5 +-
 src/backend/storage/lmgr/predicate.c   |   4 +-
 src/backend/storage/sync/sync.c        |  24 +++++-
 src/include/access/clog.h              |   3 +
 src/include/access/commit_ts.h         |   3 +
 src/include/access/multixact.h         |   4 +
 src/include/access/slru.h              |  12 ++-
 src/include/storage/sync.h             |   7 +-
 13 files changed, 172 insertions(+), 44 deletions(-)

diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 3bd55fbdd33..a3d3f9a304e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
 #include "pgstat.h"
 #include "pg_trace.h"
 #include "storage/proc.h"
+#include "storage/sync.h"
 
 /*
  * Defines for CLOG page sizes.  A page is the same BLCKSZ as is used
@@ -699,7 +700,8 @@ CLOGShmemInit(void)
 {
 	ClogCtl->PagePrecedes = CLOGPagePrecedes;
 	SimpleLruInit(ClogCtl, "clog", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
-				  CLogControlLock, "pg_xact", LWTRANCHE_CLOG_BUFFERS);
+				  CLogControlLock, "pg_xact", LWTRANCHE_CLOG_BUFFERS,
+				  SYNC_HANDLER_CLOG);
 }
 
 /*
@@ -1041,3 +1043,12 @@ clog_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "clog_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(ClogCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 8162f884bd1..e35480d89ec 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -494,7 +494,8 @@ CommitTsShmemInit(void)
 	CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
 	SimpleLruInit(CommitTsCtl, "commit_timestamp", CommitTsShmemBuffers(), 0,
 				  CommitTsControlLock, "pg_commit_ts",
-				  LWTRANCHE_COMMITTS_BUFFERS);
+				  LWTRANCHE_COMMITTS_BUFFERS,
+				  SYNC_HANDLER_COMMIT_TS);
 
 	commitTsShared = ShmemInitStruct("CommitTs shared",
 									 sizeof(CommitTimestampShared),
@@ -1022,3 +1023,12 @@ commit_ts_redo(XLogReaderState *record)
 	else
 		elog(PANIC, "commit_ts_redo: unknown op code %u", info);
 }
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 763b9997071..bf2e9886032 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1829,11 +1829,13 @@ MultiXactShmemInit(void)
 	SimpleLruInit(MultiXactOffsetCtl,
 				  "multixact_offset", NUM_MXACTOFFSET_BUFFERS, 0,
 				  MultiXactOffsetControlLock, "pg_multixact/offsets",
-				  LWTRANCHE_MXACTOFFSET_BUFFERS);
+				  LWTRANCHE_MXACTOFFSET_BUFFERS,
+				  SYNC_HANDLER_MULTIXACT_OFFSET);
 	SimpleLruInit(MultiXactMemberCtl,
 				  "multixact_member", NUM_MXACTMEMBER_BUFFERS, 0,
 				  MultiXactMemberControlLock, "pg_multixact/members",
-				  LWTRANCHE_MXACTMEMBER_BUFFERS);
+				  LWTRANCHE_MXACTMEMBER_BUFFERS,
+				  SYNC_HANDLER_MULTIXACT_MEMBER);
 
 	/* Initialize our shared state struct */
 	MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -3392,3 +3394,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
 
 	SRF_RETURN_DONE(funccxt);
 }
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+	return slrusyncfiletag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 974d42fc866..8d5ceab134c 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -81,6 +81,18 @@ typedef struct SlruFlushData
 
 typedef struct SlruFlushData *SlruFlush;
 
+/*
+ * Populate a file tag describing a segment file.  We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_handler,xx_segno) \
+( \
+	memset(&(a), 0, sizeof(FileTag)), \
+	(a).handler = (xx_handler), \
+	(a).segno = (xx_segno) \
+)
+
 /*
  * Macro to mark a buffer slot "most recently used".  Note multiple evaluation
  * of arguments!
@@ -163,7 +175,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
 
 void
 SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
-			  LWLock *ctllock, const char *subdir, int tranche_id)
+			  LWLock *ctllock, const char *subdir, int tranche_id,
+			  SyncRequestHandler sync_handler)
 {
 	SlruShared	shared;
 	bool		found;
@@ -247,7 +260,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
 	 * assume caller set PagePrecedes.
 	 */
 	ctl->shared = shared;
-	ctl->do_fsync = true;		/* default behavior */
+	ctl->sync_handler = sync_handler;
 	StrNCpy(ctl->Dir, subdir, sizeof(ctl->Dir));
 }
 
@@ -862,23 +875,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
 	}
 	pgstat_report_wait_end();
 
-	/*
-	 * If not part of Flush, need to fsync now.  We assume this happens
-	 * infrequently enough that it's not a performance issue.
-	 */
-	if (!fdata)
+	/* Queue up a sync request for the checkpointer. */
+	if (ctl->sync_handler != SYNC_HANDLER_NONE)
 	{
-		pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
-		if (ctl->do_fsync && pg_fsync(fd))
+		FileTag		tag;
+
+		INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+		if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
 		{
+			/* No space to enqueue sync request.  Do it synchronously. */
+			pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+			if (pg_fsync(fd) < 0)
+			{
+				pgstat_report_wait_end();
+				slru_errcause = SLRU_FSYNC_FAILED;
+				slru_errno = errno;
+				CloseTransientFile(fd);
+				return false;
+			}
 			pgstat_report_wait_end();
-			slru_errcause = SLRU_FSYNC_FAILED;
-			slru_errno = errno;
-			CloseTransientFile(fd);
-			return false;
 		}
-		pgstat_report_wait_end();
+	}
 
+	/* Close file, unless part of flush request. */
+	if (!fdata)
+	{
 		if (CloseTransientFile(fd))
 		{
 			slru_errcause = SLRU_CLOSE_FAILED;
@@ -1140,21 +1161,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
 	LWLockRelease(shared->ControlLock);
 
 	/*
-	 * Now fsync and close any files that were open
+	 * Now close any files that were open
 	 */
 	ok = true;
 	for (i = 0; i < fdata.num_files; i++)
 	{
-		pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
-		if (ctl->do_fsync && pg_fsync(fdata.fd[i]))
-		{
-			slru_errcause = SLRU_FSYNC_FAILED;
-			slru_errno = errno;
-			pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
-			ok = false;
-		}
-		pgstat_report_wait_end();
-
 		if (CloseTransientFile(fdata.fd[i]))
 		{
 			slru_errcause = SLRU_CLOSE_FAILED;
@@ -1270,6 +1281,7 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
 	int			slotno;
 	char		path[MAXPGPATH];
 	bool		did_write;
+	FileTag		tag;
 
 	/* Clean out any possibly existing references to the segment. */
 	LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
@@ -1313,6 +1325,17 @@ restart:
 	snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
 	ereport(DEBUG2,
 			(errmsg("removing file \"%s\"", path)));
+
+	/*
+	 * Tell the checkpointer to forget any sync requests, before we unlink the
+	 * file.
+	 */
+	if (ctl->sync_handler != SYNC_HANDLER_NONE)
+	{
+		INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+		RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+	}
+
 	unlink(path);
 
 	LWLockRelease(shared->ControlLock);
@@ -1411,3 +1434,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
 
 	return retval;
 }
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+	int			fd;
+	int			save_errno;
+	int			result;
+
+	SlruFileName(ctl, path, ftag->segno);
+
+	fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		return -1;
+
+	result = pg_fsync(fd);
+	save_errno = errno;
+
+	CloseTransientFile(fd);
+
+	errno = save_errno;
+	return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index e667fd02385..aa71a6ddbc6 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
 	SubTransCtl->PagePrecedes = SubTransPagePrecedes;
 	SimpleLruInit(SubTransCtl, "subtrans", NUM_SUBTRANS_BUFFERS, 0,
 				  SubtransControlLock, "pg_subtrans",
-				  LWTRANCHE_SUBTRANS_BUFFERS);
-	/* Override default assumption that writes should be fsync'd */
-	SubTransCtl->do_fsync = false;
+				  LWTRANCHE_SUBTRANS_BUFFERS, SYNC_HANDLER_NONE);
 }
 
 /*
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 5a7ee0de4cf..9c68358da6a 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -479,9 +479,8 @@ AsyncShmemInit(void)
 	 */
 	AsyncCtl->PagePrecedes = asyncQueuePagePrecedes;
 	SimpleLruInit(AsyncCtl, "async", NUM_ASYNC_BUFFERS, 0,
-				  AsyncCtlLock, "pg_notify", LWTRANCHE_ASYNC_BUFFERS);
-	/* Override default assumption that writes should be fsync'd */
-	AsyncCtl->do_fsync = false;
+				  AsyncCtlLock, "pg_notify", LWTRANCHE_ASYNC_BUFFERS,
+				  SYNC_HANDLER_NONE);
 
 	if (!found)
 	{
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 4e4d04bae37..ab41194930f 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -824,9 +824,7 @@ OldSerXidInit(void)
 	OldSerXidSlruCtl->PagePrecedes = OldSerXidPagePrecedesLogically;
 	SimpleLruInit(OldSerXidSlruCtl, "oldserxid",
 				  NUM_OLDSERXID_BUFFERS, 0, OldSerXidLock, "pg_serial",
-				  LWTRANCHE_OLDSERXID_BUFFERS);
-	/* Override default assumption that writes should be fsync'd */
-	OldSerXidSlruCtl->do_fsync = false;
+				  LWTRANCHE_OLDSERXID_BUFFERS, SYNC_HANDLER_NONE);
 
 	/*
 	 * Create or attach to the OldSerXidControl structure.
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 0c5712b62bd..dad97a78c1f 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -20,6 +20,9 @@
 
 #include "miscadmin.h"
 #include "pgstat.h"
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
 #include "access/xlogutils.h"
 #include "access/xlog.h"
 #include "commands/tablespace.h"
@@ -90,13 +93,32 @@ typedef struct SyncOps
 										const FileTag *candidate);
 } SyncOps;
 
+/*
+ * These indexes must correspond to the values of the SyncRequestHandler enum.
+ */
 static const SyncOps syncsw[] = {
 	/* magnetic disk */
 	{
 		.sync_syncfiletag = mdsyncfiletag,
 		.sync_unlinkfiletag = mdunlinkfiletag,
 		.sync_filetagmatches = mdfiletagmatches
-	}
+	},
+	/* pg_xact */
+	{
+		.sync_syncfiletag = clogsyncfiletag
+	},
+	/* pg_commit_ts */
+	{
+		.sync_syncfiletag = committssyncfiletag
+	},
+	/* pg_multixact/offsets */
+	{
+		.sync_syncfiletag = multixactoffsetssyncfiletag
+	},
+	/* pg_multixact/members */
+	{
+		.sync_syncfiletag = multixactmemberssyncfiletag
+	},
 };
 
 /*
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 57ef9fe858e..f55391c73e3 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
 #define CLOG_H
 
 #include "access/xlogreader.h"
+#include "storage/sync.h"
 #include "lib/stringinfo.h"
 
 /*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
 extern void ExtendCLOG(TransactionId newestXact);
 extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
 
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
 /* XLOG stuff */
 #define CLOG_ZEROPAGE		0x00
 #define CLOG_TRUNCATE		0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 123c91128b8..1f32196873d 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
 #include "access/xlog.h"
 #include "datatype/timestamp.h"
 #include "replication/origin.h"
+#include "storage/sync.h"
 #include "utils/guc.h"
 
 
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
 				 TransactionId newestXact);
 extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
 
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
 /* XLOG stuff */
 #define COMMIT_TS_ZEROPAGE		0x00
 #define COMMIT_TS_TRUNCATE		0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 83ae5b6b795..05dcbc8ae35 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
 
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
+#include "storage/sync.h"
 
 
 /*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
 extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
 							MultiXactId multi2);
 
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
 extern void AtEOXact_MultiXact(void);
 extern void AtPrepare_MultiXact(void);
 extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index b6e66f56a0a..deccde4cc44 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
 
 #include "access/xlogdefs.h"
 #include "storage/lwlock.h"
+#include "storage/sync.h"
 
 
 /*
@@ -115,10 +116,10 @@ typedef struct SlruCtlData
 	SlruShared	shared;
 
 	/*
-	 * This flag tells whether to fsync writes (true for pg_xact and multixact
-	 * stuff, false for pg_subtrans and pg_notify).
+	 * Which sync handler function to use when handing sync requests over to
+	 * the checkpointer.  SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
 	 */
-	bool		do_fsync;
+	SyncRequestHandler sync_handler;
 
 	/*
 	 * Decide which of two page numbers is "older" for truncation purposes. We
@@ -139,7 +140,8 @@ typedef SlruCtlData *SlruCtl;
 
 extern Size SimpleLruShmemSize(int nslots, int nlsns);
 extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
-			  LWLock *ctllock, const char *subdir, int tranche_id);
+			  LWLock *ctllock, const char *subdir, int tranche_id,
+			  SyncRequestHandler sync_handler);
 extern int	SimpleLruZeroPage(SlruCtl ctl, int pageno);
 extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
 				  TransactionId xid);
@@ -155,6 +157,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
 extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
 extern void SlruDeleteSegment(SlruCtl ctl, int segno);
 
+extern int slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path);
+
 /* SlruScanDirectory public callbacks */
 extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
 							int segpage, void *data);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index d063d826724..3d5327fea54 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -35,7 +35,12 @@ typedef enum SyncRequestType
  */
 typedef enum SyncRequestHandler
 {
-	SYNC_HANDLER_MD = 0			/* md smgr */
+	SYNC_HANDLER_MD = 0,
+	SYNC_HANDLER_CLOG,
+	SYNC_HANDLER_COMMIT_TS,
+	SYNC_HANDLER_MULTIXACT_OFFSET,
+	SYNC_HANDLER_MULTIXACT_MEMBER,
+	SYNC_HANDLER_NONE
 } SyncRequestHandler;
 
 /*
-- 
2.21.0

#83

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Thomas Munro (#82)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Apr 04, 2019 at 02:01:14PM +1300, Thomas Munro wrote:

On Thu, Apr 4, 2019 at 11:39 AM Thomas Munro <thomas.munro@gmail.com> wrote:

... Perhaps
that is an argument for putting the sync handler number *inside* the
FileTag, since we currently intend to do that with smgr IDs in
BufferTag (stealing space from ForkNumber).

Here is a version like that. I like it better this way, and the extra
space can be clawed back by using 16 bit types to hold the fork number
and sync handler number.

+typedef struct FileTag
+{
+	int16		handler;		/* SyncRequstHandler value, saving space */
+	int16		forknum;		/* ForkNumber, saving space */
+	RelFileNode rnode;
+	BlockNumber segno;
+} FileTag;

Definitely makes sense. v16 looks good to me.

Thanks!

--
Shawn Debnath
Amazon Web Services (AWS)

#84

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Shawn Debnath (#83)

Re: Refactoring the checkpointer's fsync request queue

On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote:

On Thu, Apr 04, 2019 at 02:01:14PM +1300, Thomas Munro wrote:

On Thu, Apr 4, 2019 at 11:39 AM Thomas Munro <thomas.munro@gmail.com> wrote:

... Perhaps
that is an argument for putting the sync handler number *inside* the
FileTag, since we currently intend to do that with smgr IDs in
BufferTag (stealing space from ForkNumber).

Here is a version like that. I like it better this way, and the extra
space can be clawed back by using 16 bit types to hold the fork number
and sync handler number.
+typedef struct FileTag
+{
+	int16		handler;		/* SyncRequstHandler value, saving space */
+	int16		forknum;		/* ForkNumber, saving space */
+	RelFileNode rnode;
+	BlockNumber segno;
+} FileTag;

Seems odd to me to use BlockNumber for segno.

#85

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Andres Freund (#84)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Apr 4, 2019 at 5:36 PM Andres Freund <andres@anarazel.de> wrote:

On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote:

+typedef struct FileTag
+{
+     int16           handler;                /* SyncRequstHandler value, saving space */
+     int16           forknum;                /* ForkNumber, saving space */
+     RelFileNode rnode;
+     BlockNumber segno;
+} FileTag;

Seems odd to me to use BlockNumber for segno.

That is a tradition in md.c code. I had a new typedef SegmentNumber
in all sync.{c,h} stuff in an earlier version, but had trouble
figuring out where to define it...

--
Thomas Munro
https://enterprisedb.com

#86

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Thomas Munro (#85)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Apr 04, 2019 at 05:39:14PM +1300, Thomas Munro wrote:

On Thu, Apr 4, 2019 at 5:36 PM Andres Freund <andres@anarazel.de> wrote:
On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote:
+typedef struct FileTag
+{
+     int16           handler;                /* SyncRequstHandler value, saving space */
+     int16           forknum;                /* ForkNumber, saving space */
+     RelFileNode rnode;
+     BlockNumber segno;
+} FileTag;
Seems odd to me to use BlockNumber for segno.
That is a tradition in md.c code. I had a new typedef SegmentNumber
in all sync.{c,h} stuff in an earlier version, but had trouble
figuring out where to define it...

Thomas, this is why I had defined segment.h with the contents below :-)

+++ b/src/include/storage/segment.h
[...]
+/*
+ * Segment Number:
+ *
+ * Each relation and its forks are divided into segments. This
+ * definition formalizes the definition of the segment number.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)

My last iteration, v12, patch had it. See [1]/messages/by-id/20190403214423.GA45392@f01898859afd.ant.amazon.com for comments on removal of
segment.h.

[1]: /messages/by-id/20190403214423.GA45392@f01898859afd.ant.amazon.com
/messages/by-id/20190403214423.GA45392@f01898859afd.ant.amazon.com

--
Shawn Debnath
Amazon Web Services (AWS)

#87

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Shawn Debnath (#86)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Apr 4, 2019 at 6:41 PM Shawn Debnath <sdn@amazon.com> wrote:

On Thu, Apr 04, 2019 at 05:39:14PM +1300, Thomas Munro wrote:
On Thu, Apr 4, 2019 at 5:36 PM Andres Freund <andres@anarazel.de> wrote:
On 2019-04-03 21:19:45 -0700, Shawn Debnath wrote:
+typedef struct FileTag
+{
+     int16           handler;                /* SyncRequstHandler value, saving space */
+     int16           forknum;                /* ForkNumber, saving space */
+     RelFileNode rnode;
+     BlockNumber segno;
+} FileTag;
Seems odd to me to use BlockNumber for segno.
That is a tradition in md.c code. I had a new typedef SegmentNumber
in all sync.{c,h} stuff in an earlier version, but had trouble
figuring out where to define it...
Thomas, this is why I had defined segment.h with the contents below :-)
+++ b/src/include/storage/segment.h
[...]
+/*
+ * Segment Number:
+ *
+ * Each relation and its forks are divided into segments. This
+ * definition formalizes the definition of the segment number.
+ */
+typedef uint32 SegmentNumber;
+
+#define InvalidSegmentNumber ((SegmentNumber) 0xFFFFFFFF)

I don't think it's project policy to put a single typedef into its own
header like that, and I'm not sure where else to put it. I have
changed FileTag's segno member to plain old uint32 for now. md.c
continues to use BlockNumber for segment numbers (as it has ever since
commit e0c9301c87 switched over from int), but that's all internal to
md.c and I think we can reasonably leave that sort of improvement to a
later patch.

Pushed. Thanks for all the work on this!

--
Thomas Munro
https://enterprisedb.com

#88

Alvaro Herrera

alvherre@2ndquadrant.com

almost 7 years ago

In reply to: Thomas Munro (#87)

Re: Refactoring the checkpointer's fsync request queue

On 2019-Apr-04, Thomas Munro wrote:

I don't think it's project policy to put a single typedef into its own
header like that, and I'm not sure where else to put it.

shrug. Looks fine to me. I suppose if we don't have it anywhere, it's
just because we haven't needed that particular trick yet. Creating a
file with a lone typedef seems better than using uint32 to me.

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#89

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Alvaro Herrera (#88)

1 attachment(s)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Apr 5, 2019 at 2:03 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Apr-04, Thomas Munro wrote:

I don't think it's project policy to put a single typedef into its own
header like that, and I'm not sure where else to put it.

shrug. Looks fine to me. I suppose if we don't have it anywhere, it's
just because we haven't needed that particular trick yet. Creating a
file with a lone typedef seems better than using uint32 to me.

It was commit 9fac5fd7 that gave me that idea.

Ok, here is a patch that adds a one-typedef header and uses
SegmentIndex to replace all cases of BlockNumber and int holding a
segment number (where as an "index" or a "count").

--
Thomas Munro
https://enterprisedb.com

Attachments:

0001-Introduce-SegmentNumber-typedef-for-relation-segment.patchapplication/octet-stream; name=0001-Introduce-SegmentNumber-typedef-for-relation-segment.patchDownload

From 3cae9d6ed02dabb38f9b2a9b952d5138e0a60c42 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 5 Apr 2019 10:21:54 +1300
Subject: [PATCH] Introduce SegmentNumber typedef for relation segment numbers.

Previously we used BlockNumber for md.c's segment numbers.  Define a
separate typename, but keep the same underlying type.  Also use it for
a couple of tools under src/bin that know about md.c's file layout
scheme.

Discussion: https://postgr.es/m/20190404130258.GA7320%40alvherre.pgsql
---
 src/backend/storage/smgr/md.c       | 28 ++++++++++++++--------------
 src/bin/pg_checksums/pg_checksums.c |  5 +++--
 src/bin/pg_rewind/filemap.c         |  7 ++++---
 src/include/storage/segment.h       | 28 ++++++++++++++++++++++++++++
 src/include/storage/smgr.h          |  3 ++-
 src/include/storage/sync.h          |  3 ++-
 6 files changed, 53 insertions(+), 21 deletions(-)
 create mode 100644 src/include/storage/segment.h

diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index ffb3569698f..84fdc314b31 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -81,7 +81,7 @@
 typedef struct _MdfdVec
 {
 	File		mdfd_vfd;		/* fd number in fd.c's pool */
-	BlockNumber mdfd_segno;		/* segment number, from 0 */
+	SegmentNumber mdfd_segno;	/* segment number, from 0 */
 } MdfdVec;
 
 static MemoryContext MdCxt;		/* context for all MdfdVec objects */
@@ -124,16 +124,16 @@ static MdfdVec *mdopen(SMgrRelation reln, ForkNumber forknum, int behavior);
 static void register_dirty_segment(SMgrRelation reln, ForkNumber forknum,
 					   MdfdVec *seg);
 static void register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
-						BlockNumber segno);
+						SegmentNumber segno);
 static void register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
-						BlockNumber segno);
+						SegmentNumber segno);
 static void _fdvec_resize(SMgrRelation reln,
 			  ForkNumber forknum,
 			  int nseg);
 static char *_mdfd_segpath(SMgrRelation reln, ForkNumber forknum,
-			  BlockNumber segno);
+			  SegmentNumber segno);
 static MdfdVec *_mdfd_openseg(SMgrRelation reln, ForkNumber forkno,
-			  BlockNumber segno, int oflags);
+			  SegmentNumber segno, int oflags);
 static MdfdVec *_mdfd_getseg(SMgrRelation reln, ForkNumber forkno,
 			 BlockNumber blkno, bool skipFsync, int behavior);
 static BlockNumber _mdnblocks(SMgrRelation reln, ForkNumber forknum,
@@ -329,7 +329,7 @@ mdunlinkfork(RelFileNodeBackend rnode, ForkNumber forkNum, bool isRedo)
 	if (ret >= 0)
 	{
 		char	   *segpath = (char *) palloc(strlen(path) + 12);
-		BlockNumber segno;
+		SegmentNumber segno;
 
 		/*
 		 * Note that because we loop until getting ENOENT, we will correctly
@@ -715,7 +715,7 @@ mdnblocks(SMgrRelation reln, ForkNumber forknum)
 {
 	MdfdVec    *v = mdopen(reln, forknum, EXTENSION_FAIL);
 	BlockNumber nblocks;
-	BlockNumber segno = 0;
+	SegmentNumber segno = 0;
 
 	/* mdopen has opened the first segment */
 	Assert(reln->md_num_open_segs[forknum] > 0);
@@ -865,7 +865,7 @@ mdtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 void
 mdimmedsync(SMgrRelation reln, ForkNumber forknum)
 {
-	int			segno;
+	SegmentNumber segno;
 
 	/*
 	 * NOTE: mdnblocks makes sure we have opened all active segments, so that
@@ -925,7 +925,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
  */
 static void
 register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
-						BlockNumber segno)
+						SegmentNumber segno)
 {
 	FileTag		tag;
 
@@ -942,7 +942,7 @@ register_unlink_segment(RelFileNodeBackend rnode, ForkNumber forknum,
  */
 static void
 register_forget_request(RelFileNodeBackend rnode, ForkNumber forknum,
-						BlockNumber segno)
+						SegmentNumber segno)
 {
 	FileTag		tag;
 
@@ -1043,7 +1043,7 @@ _fdvec_resize(SMgrRelation reln,
  * returned string is palloc'd.
  */
 static char *
-_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
+_mdfd_segpath(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno)
 {
 	char	   *path,
 			   *fullpath;
@@ -1066,7 +1066,7 @@ _mdfd_segpath(SMgrRelation reln, ForkNumber forknum, BlockNumber segno)
  * and make a MdfdVec object for it.  Returns NULL on failure.
  */
 static MdfdVec *
-_mdfd_openseg(SMgrRelation reln, ForkNumber forknum, BlockNumber segno,
+_mdfd_openseg(SMgrRelation reln, ForkNumber forknum, SegmentNumber segno,
 			  int oflags)
 {
 	MdfdVec    *v;
@@ -1110,8 +1110,8 @@ _mdfd_getseg(SMgrRelation reln, ForkNumber forknum, BlockNumber blkno,
 			 bool skipFsync, int behavior)
 {
 	MdfdVec    *v;
-	BlockNumber targetseg;
-	BlockNumber nextsegno;
+	SegmentNumber targetseg;
+	SegmentNumber nextsegno;
 
 	/* some way to handle non-existent segments needs to be specified */
 	Assert(behavior &
diff --git a/src/bin/pg_checksums/pg_checksums.c b/src/bin/pg_checksums/pg_checksums.c
index bc899826580..26a7988035a 100644
--- a/src/bin/pg_checksums/pg_checksums.c
+++ b/src/bin/pg_checksums/pg_checksums.c
@@ -29,6 +29,7 @@
 #include "storage/bufpage.h"
 #include "storage/checksum.h"
 #include "storage/checksum_impl.h"
+#include "storage/segment.h"
 
 
 static int64 files = 0;
@@ -167,7 +168,7 @@ skipfile(const char *fn)
 }
 
 static void
-scan_file(const char *fn, BlockNumber segmentno)
+scan_file(const char *fn, SegmentNumber segmentno)
 {
 	PGAlignedBlock buf;
 	PageHeader	header = (PageHeader) buf.data;
@@ -310,7 +311,7 @@ scan_directory(const char *basedir, const char *subdir, bool sizeonly)
 			char		fnonly[MAXPGPATH];
 			char	   *forkpath,
 					   *segmentpath;
-			BlockNumber segmentno = 0;
+			SegmentNumber segmentno = 0;
 
 			if (skipfile(de->d_name))
 				continue;
diff --git a/src/bin/pg_rewind/filemap.c b/src/bin/pg_rewind/filemap.c
index 63d0baee745..f76eee65dd3 100644
--- a/src/bin/pg_rewind/filemap.c
+++ b/src/bin/pg_rewind/filemap.c
@@ -22,12 +22,13 @@
 #include "catalog/pg_tablespace_d.h"
 #include "fe_utils/logging.h"
 #include "storage/fd.h"
+#include "storage/segment.h"
 
 filemap_t  *filemap = NULL;
 
 static bool isRelDataFile(const char *path);
 static char *datasegpath(RelFileNode rnode, ForkNumber forknum,
-			BlockNumber segno);
+			SegmentNumber segno);
 static int	path_cmp(const void *a, const void *b);
 static int	final_filemap_cmp(const void *a, const void *b);
 static void filemap_list_to_array(filemap_t *map);
@@ -424,7 +425,7 @@ process_block_change(ForkNumber forknum, RelFileNode rnode, BlockNumber blkno)
 	file_entry_t *key_ptr;
 	file_entry_t *entry;
 	BlockNumber blkno_inseg;
-	int			segno;
+	SegmentNumber segno;
 	filemap_t  *map = filemap;
 	file_entry_t **e;
 
@@ -762,7 +763,7 @@ isRelDataFile(const char *path)
  * The returned path is palloc'd
  */
 static char *
-datasegpath(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+datasegpath(RelFileNode rnode, ForkNumber forknum, SegmentNumber segno)
 {
 	char	   *path;
 	char	   *segpath;
diff --git a/src/include/storage/segment.h b/src/include/storage/segment.h
new file mode 100644
index 00000000000..7aa86d4d964
--- /dev/null
+++ b/src/include/storage/segment.h
@@ -0,0 +1,28 @@
+/*-------------------------------------------------------------------------
+ *
+ * segment.h
+ *   POSTGRES disk segment definitions.
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/segment.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef SEGMENT_H
+#define SEGMENT_H
+
+#include "storage/block.h"
+
+/*
+ * We avoid creating very large disk files by cutting relations up into
+ * smaller segment files.  Since there are values of RELSEG_SIZE and BLCKSZ
+ * that would require md.c to create more than 2^16 segments for a relation
+ * with MaxBlockNumber blocks, we can't use anything smaller than the size
+ * we use for BlockNumber, so just define one in terms of the other.
+ */
+typedef BlockNumber SegmentNumber;
+
+#endif							/* SEGMENT_H */
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 770193e285e..20d91213b99 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -17,6 +17,7 @@
 #include "lib/ilist.h"
 #include "storage/block.h"
 #include "storage/relfilenode.h"
+#include "storage/segment.h"
 
 /*
  * smgr.c maintains a table of SMgrRelation objects, which are essentially
@@ -67,7 +68,7 @@ typedef struct SMgrRelationData
 	 * for md.c; per-fork arrays of the number of open segments
 	 * (md_num_open_segs) and the segments themselves (md_seg_fds).
 	 */
-	int			md_num_open_segs[MAX_FORKNUM + 1];
+	SegmentNumber md_num_open_segs[MAX_FORKNUM + 1];
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
 	/* if unowned, list link in list of all unowned SMgrRelations */
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index 124a49ea984..6759321e645 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -14,6 +14,7 @@
 #define SYNC_H
 
 #include "storage/relfilenode.h"
+#include "storage/segment.h"
 
 /*
  * Type of sync request.  These are used to manage the set of pending
@@ -47,7 +48,7 @@ typedef struct FileTag
 	int16		handler;		/* SyncRequstHandler value, saving space */
 	int16		forknum;		/* ForkNumber, saving space */
 	RelFileNode rnode;
-	uint32		segno;
+	SegmentNumber segno;
 } FileTag;
 
 extern void InitSync(void);
-- 
2.21.0

#90

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Thomas Munro (#89)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Apr 5, 2019 at 10:53 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Ok, here is a patch that adds a one-typedef header and uses
SegmentIndex to replace all cases of BlockNumber and int holding a
segment number (where as an "index" or a "count").

(sorry, I meant "SegmentNumber", not "SegmentIndex")

--
Thomas Munro
https://enterprisedb.com

#91

Thomas Munro

thomas.munro@gmail.com

almost 7 years ago

In reply to: Thomas Munro (#87)

Re: Refactoring the checkpointer's fsync request queue

On Thu, Apr 4, 2019 at 11:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:

Pushed. Thanks for all the work on this!

I managed to break this today while testing with RELSEG_SIZE set to 1
block (= squillions of 8kb files). The problem is incorrect arguments
to _mdfd_getseg(), in code added recently by me. Without the
EXTENSION_DONT_CHECK_SIZE flag, it refuses to open segments following
segments that have been truncated, leading to a checkpointer fsync
panic. It's also passing segnum where a blocknum is wanted. It
should have used exactly the same arguments as in the old code, but
didn't. I will push a fix shortly.

--
Thomas Munro
https://enterprisedb.com

#92

Shawn Debnath

sdn@amazon.com

almost 7 years ago

In reply to: Thomas Munro (#89)

Re: Refactoring the checkpointer's fsync request queue

On Fri, Apr 05, 2019 at 10:53:53AM +1300, Thomas Munro wrote:

On Fri, Apr 5, 2019 at 2:03 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Apr-04, Thomas Munro wrote:

I don't think it's project policy to put a single typedef into its own
header like that, and I'm not sure where else to put it.

shrug. Looks fine to me. I suppose if we don't have it anywhere, it's
just because we haven't needed that particular trick yet. Creating a
file with a lone typedef seems better than using uint32 to me.

It was commit 9fac5fd7 that gave me that idea.

Ok, here is a patch that adds a one-typedef header and uses
SegmentIndex to replace all cases of BlockNumber and int holding a
segment number (where as an "index" or a "count").

Looks good to me.

--
Shawn Debnath
Amazon Web Services (AWS)

#93

Alvaro Herrera

alvherre@2ndquadrant.com

over 6 years ago

In reply to: Thomas Munro (#89)

Re: Refactoring the checkpointer's fsync request queue

On 2019-Apr-05, Thomas Munro wrote:

Ok, here is a patch that adds a one-typedef header and uses
SegmentIndex to replace all cases of BlockNumber and int holding a
segment number (where as an "index" or a "count").

Hmm, I now see (while doing the pg_checksum translation) that this patch
didn't make it. Pity ...

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services