Handing off SLRU fsyncs to the checkpointer
Hi,
In commit 3eb77eba we made it possible for any subsystem that wants a
file to be flushed as part of the next checkpoint to ask the
checkpointer to do that, as previously only md.c could do.
In the past, foreground CLOG flush stalls were a problem, but then
commit 33aaa139 cranked up the number of buffers, and then commit
5364b357 cranked it right up until the flushes mostly disappeared from
some benchmark workload but not so high that the resulting linear
searches through the buffer array destroyed the gains. I know there
is interest in moving that stuff into regular shared buffers, so it
can be found via the buffer mapping system (and improve as that
improves), written back by the background writer (and improve as that
improves), managed with a proper replacement algorithm (and improve as
that improves), etc etc. That sounds like a great idea to me, but
it's a big project.
In the meantime, one thing we could do is hand off the fsyncs, but I'm
not sure if it's still considered a real problem in the field given
the new parameters.
Anyway, I had a patch for that, that I used while testing commit
3eb77eba. While reading old threads about SLRU today I found that
several people had wished for a thing exactly like that, so I dusted
it off and rebased it.
Attachments:
0001-Use-the-checkpointer-to-fsync-SLRU-files.patchapplication/octet-stream; name=0001-Use-the-checkpointer-to-fsync-SLRU-files.patchDownload
From ad7ec74ba59d1ef237545e9860ecd6d2ec8649c1 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 3 Apr 2019 22:15:19 +1300
Subject: [PATCH] Use the checkpointer to fsync SLRU files.
Previously, we called fsync() after writing out SLRU pages. Ask
the checkpointer to do it instead, as we do for relation files.
---
src/backend/access/transam/clog.c | 13 +++-
src/backend/access/transam/commit_ts.c | 12 ++-
src/backend/access/transam/multixact.c | 24 +++++-
src/backend/access/transam/slru.c | 101 +++++++++++++++++++------
src/backend/access/transam/subtrans.c | 4 +-
src/backend/commands/async.c | 5 +-
src/backend/storage/lmgr/predicate.c | 4 +-
src/backend/storage/sync/sync.c | 24 +++++-
src/include/access/clog.h | 3 +
src/include/access/commit_ts.h | 3 +
src/include/access/multixact.h | 4 +
src/include/access/slru.h | 12 ++-
src/include/storage/sync.h | 7 +-
13 files changed, 172 insertions(+), 44 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f8e7670f8d..b8bafa0678 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "storage/proc.h"
+#include "storage/sync.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -692,7 +693,8 @@ CLOGShmemInit(void)
{
ClogCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(ClogCtl, "clog", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- CLogControlLock, "pg_xact", LWTRANCHE_CLOG_BUFFERS);
+ CLogControlLock, "pg_xact", LWTRANCHE_CLOG_BUFFERS,
+ SYNC_HANDLER_CLOG);
}
/*
@@ -1034,3 +1036,12 @@ clog_redo(XLogReaderState *record)
else
elog(PANIC, "clog_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(ClogCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 630df672cc..f12909e03a 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -494,7 +494,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "commit_timestamp", CommitTsShmemBuffers(), 0,
CommitTsControlLock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFERS);
+ LWTRANCHE_COMMITTS_BUFFERS,
+ SYNC_HANDLER_COMMIT_TS);
commitTsShared = ShmemInitStruct("CommitTs shared",
sizeof(CommitTimestampShared),
@@ -1022,3 +1023,12 @@ commit_ts_redo(XLogReaderState *record)
else
elog(PANIC, "commit_ts_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 50e98caaeb..aab2e8a2e1 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"multixact_offset", NUM_MXACTOFFSET_BUFFERS, 0,
MultiXactOffsetControlLock, "pg_multixact/offsets",
- LWTRANCHE_MXACTOFFSET_BUFFERS);
+ LWTRANCHE_MXACTOFFSET_BUFFERS,
+ SYNC_HANDLER_MULTIXACT_OFFSET);
SimpleLruInit(MultiXactMemberCtl,
"multixact_member", NUM_MXACTMEMBER_BUFFERS, 0,
MultiXactMemberControlLock, "pg_multixact/members",
- LWTRANCHE_MXACTMEMBER_BUFFERS);
+ LWTRANCHE_MXACTMEMBER_BUFFERS,
+ SYNC_HANDLER_MULTIXACT_MEMBER);
/* Initialize our shared state struct */
MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -3394,3 +3396,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funccxt);
}
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index d5b7a08f73..3302dd8d5e 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -80,6 +80,18 @@ typedef struct SlruFlushData
typedef struct SlruFlushData *SlruFlush;
+/*
+ * Populate a file tag describing a segment file. We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_handler,xx_segno) \
+( \
+ memset(&(a), 0, sizeof(FileTag)), \
+ (a).handler = (xx_handler), \
+ (a).segno = (xx_segno) \
+)
+
/*
* Macro to mark a buffer slot "most recently used". Note multiple evaluation
* of arguments!
@@ -162,7 +174,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id)
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
@@ -246,7 +259,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
* assume caller set PagePrecedes.
*/
ctl->shared = shared;
- ctl->do_fsync = true; /* default behavior */
+ ctl->sync_handler = sync_handler;
StrNCpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -861,23 +874,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
pgstat_report_wait_end();
- /*
- * If not part of Flush, need to fsync now. We assume this happens
- * infrequently enough that it's not a performance issue.
- */
- if (!fdata)
+ /* Queue up a sync request for the checkpointer. */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
- if (ctl->do_fsync && pg_fsync(fd) != 0)
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
{
+ /* No space to enqueue sync request. Do it synchronously. */
+ pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+ if (pg_fsync(fd) != 0)
+ {
+ pgstat_report_wait_end();
+ slru_errcause = SLRU_FSYNC_FAILED;
+ slru_errno = errno;
+ CloseTransientFile(fd);
+ return false;
+ }
pgstat_report_wait_end();
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- CloseTransientFile(fd);
- return false;
}
- pgstat_report_wait_end();
+ }
+ /* Close file, unless part of flush request. */
+ if (!fdata)
+ {
if (CloseTransientFile(fd) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1150,21 +1171,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
LWLockRelease(shared->ControlLock);
/*
- * Now fsync and close any files that were open
+ * Now close any files that were open
*/
ok = true;
for (i = 0; i < fdata.num_files; i++)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
- if (ctl->do_fsync && pg_fsync(fdata.fd[i]) != 0)
- {
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
- ok = false;
- }
- pgstat_report_wait_end();
-
if (CloseTransientFile(fdata.fd[i]) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1280,6 +1291,7 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
int slotno;
char path[MAXPGPATH];
bool did_write;
+ FileTag tag;
/* Clean out any possibly existing references to the segment. */
LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
@@ -1323,6 +1335,17 @@ restart:
snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
ereport(DEBUG2,
(errmsg("removing file \"%s\"", path)));
+
+ /*
+ * Tell the checkpointer to forget any sync requests, before we unlink the
+ * file.
+ */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
+ {
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+ }
+
unlink(path);
LWLockRelease(shared->ControlLock);
@@ -1421,3 +1444,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
return retval;
}
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+ int fd;
+ int save_errno;
+ int result;
+
+ SlruFileName(ctl, path, ftag->segno);
+
+ fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+ if (fd < 0)
+ return -1;
+
+ result = pg_fsync(fd);
+ save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ errno = save_errno;
+ return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index 25d7d739cf..6793255a67 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "subtrans", NUM_SUBTRANS_BUFFERS, 0,
SubtransControlLock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFERS);
- /* Override default assumption that writes should be fsync'd */
- SubTransCtl->do_fsync = false;
+ LWTRANCHE_SUBTRANS_BUFFERS, SYNC_HANDLER_NONE);
}
/*
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 9aa2b61600..fdd02a8fe5 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -552,9 +552,8 @@ AsyncShmemInit(void)
*/
AsyncCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(AsyncCtl, "async", NUM_ASYNC_BUFFERS, 0,
- AsyncCtlLock, "pg_notify", LWTRANCHE_ASYNC_BUFFERS);
- /* Override default assumption that writes should be fsync'd */
- AsyncCtl->do_fsync = false;
+ AsyncCtlLock, "pg_notify", LWTRANCHE_ASYNC_BUFFERS,
+ SYNC_HANDLER_NONE);
if (!found)
{
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index 654584b77a..c486728880 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -821,9 +821,7 @@ OldSerXidInit(void)
OldSerXidSlruCtl->PagePrecedes = OldSerXidPagePrecedesLogically;
SimpleLruInit(OldSerXidSlruCtl, "oldserxid",
NUM_OLDSERXID_BUFFERS, 0, OldSerXidLock, "pg_serial",
- LWTRANCHE_OLDSERXID_BUFFERS);
- /* Override default assumption that writes should be fsync'd */
- OldSerXidSlruCtl->do_fsync = false;
+ LWTRANCHE_OLDSERXID_BUFFERS, SYNC_HANDLER_NONE);
/*
* Create or attach to the OldSerXidControl structure.
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 8282a476b4..747eca9579 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -18,6 +18,9 @@
#include <fcntl.h>
#include <sys/file.h>
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "commands/tablespace.h"
@@ -90,13 +93,32 @@ typedef struct SyncOps
const FileTag *candidate);
} SyncOps;
+/*
+ * These indexes must correspond to the values of the SyncRequestHandler enum.
+ */
static const SyncOps syncsw[] = {
/* magnetic disk */
{
.sync_syncfiletag = mdsyncfiletag,
.sync_unlinkfiletag = mdunlinkfiletag,
.sync_filetagmatches = mdfiletagmatches
- }
+ },
+ /* pg_xact */
+ {
+ .sync_syncfiletag = clogsyncfiletag
+ },
+ /* pg_commit_ts */
+ {
+ .sync_syncfiletag = committssyncfiletag
+ },
+ /* pg_multixact/offsets */
+ {
+ .sync_syncfiletag = multixactoffsetssyncfiletag
+ },
+ /* pg_multixact/members */
+ {
+ .sync_syncfiletag = multixactmemberssyncfiletag
+ },
};
/*
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 2db8acb189..d97b9042dc 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
#define CLOG_H
#include "access/xlogreader.h"
+#include "storage/sync.h"
#include "lib/stringinfo.h"
/*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
extern void ExtendCLOG(TransactionId newestXact);
extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define CLOG_ZEROPAGE 0x00
#define CLOG_TRUNCATE 0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 2740c02a84..27900ce430 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
#include "replication/origin.h"
+#include "storage/sync.h"
#include "utils/guc.h"
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
TransactionId newestXact);
extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define COMMIT_TS_ZEROPAGE 0x00
#define COMMIT_TS_TRUNCATE 0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index af4aac08bd..864f9bb871 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
+#include "storage/sync.h"
/*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
MultiXactId multi2);
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
extern void AtEOXact_MultiXact(void);
extern void AtPrepare_MultiXact(void);
extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 00dbd803e1..ec5df23ff5 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "storage/lwlock.h"
+#include "storage/sync.h"
/*
@@ -115,10 +116,10 @@ typedef struct SlruCtlData
SlruShared shared;
/*
- * This flag tells whether to fsync writes (true for pg_xact and multixact
- * stuff, false for pg_subtrans and pg_notify).
+ * Which sync handler function to use when handing sync requests over to
+ * the checkpointer. SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
*/
- bool do_fsync;
+ SyncRequestHandler sync_handler;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
@@ -139,7 +140,8 @@ typedef SlruCtlData *SlruCtl;
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id);
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -155,6 +157,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
extern void SlruDeleteSegment(SlruCtl ctl, int segno);
+extern int slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path);
+
/* SlruScanDirectory public callbacks */
extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index e16ab8e711..f32e412e75 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -34,7 +34,12 @@ typedef enum SyncRequestType
*/
typedef enum SyncRequestHandler
{
- SYNC_HANDLER_MD = 0 /* md smgr */
+ SYNC_HANDLER_MD = 0,
+ SYNC_HANDLER_CLOG,
+ SYNC_HANDLER_COMMIT_TS,
+ SYNC_HANDLER_MULTIXACT_OFFSET,
+ SYNC_HANDLER_MULTIXACT_MEMBER,
+ SYNC_HANDLER_NONE
} SyncRequestHandler;
/*
--
2.23.0
On Wed, Feb 12, 2020 at 9:54 PM Thomas Munro <thomas.munro@gmail.com> wrote:
In commit 3eb77eba we made it possible for any subsystem that wants a
file to be flushed as part of the next checkpoint to ask the
checkpointer to do that, as previously only md.c could do.
Hello,
While working on recovery performance, I found my way back to this
idea and rebased the patch.
Problem statement:
Every time we have to write out a page of pg_commit_ts, pg_multixact
or pg_xact due to cache pressure, we immediately call fsync(). This
runs serially in the recovery process, and it's quite bad for
pg_commit_ts, because we need to dump out a page for every ~800
transactions (track_commit_timestamps is not enabled by default). If
we ask the checkpointer to do it, it collapses the 2048 fsync calls
for each SLRU segment into one, and the kernel can write out the data
with larger I/Os, maybe even ahead of time, and update the inode only
once.
Experiment:
Run crash recovery for 1 million pgbench transactions:
postgres -D pgdata \
-c synchronous_commit=off \
-c enable_commit_timestamps=on \
-c max_wal_size=10GB \
-c checkpoint_timeout=60min
# in another shell
pgbench -i -s10 postgres
psql postgres -c checkpoint
pgbench -t1000000 -Mprepared postgres
killall -9 postgres
# save the crashed pgdata dir for repeated experiments
mv pgdata pgdata-save
# now run experiments like this and see how long recovery takes
rm -fr pgdata
cp -r pgdata-save pgdata
postgres -D pgdata
What I see on a system that has around 2.5ms latency for fsync:
master: 16.83 seconds
patched: 4.00 seconds
It's harder to see it without commit timestamps enabled since we only
need to flush a pg_xact page every 32k transactions (and multixacts
are more complicated to test), but you can still see the effect. With
8x more transactions to make it clearer what's going on, I could
measure a speedup of around 6% from this patch, which I suppose scales
up fairly obviously with storage latency (every million transaction =
at least 30 fsyncs stalls, so you can multiply that by your fsync
latency and work out how much time your recovery process will be
asleep at the wheel instead of applying your records).
From a syscall overhead point of view, it's a bit unfortunate that we
open and close SLRU segments every time we write, but it's probably
not really enough to complain about... except for the (small) risk of
an inode dropping out of kernel caches in the time between closing it
and the checkpointer opening it. Hmm.
Attachments:
v2-0001-Ask-the-checkpointer-to-flush-SLRU-files.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Ask-the-checkpointer-to-flush-SLRU-files.patchDownload
From 0fe8767316d5a973f43553080c3973759d83038d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 4 Aug 2020 17:57:18 +1200
Subject: [PATCH v2] Ask the checkpointer to flush SLRU files.
Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages, leading to an I/O stall in user backends and recovery.
Ask the checkpointer to collapse requests for the same file into a
single system call as part of the next checkpoint, as we do for relation
files.
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
---
src/backend/access/transam/clog.c | 13 +++-
src/backend/access/transam/commit_ts.c | 12 ++-
src/backend/access/transam/multixact.c | 24 +++++-
src/backend/access/transam/slru.c | 101 +++++++++++++++++++------
src/backend/access/transam/subtrans.c | 4 +-
src/backend/commands/async.c | 5 +-
src/backend/storage/lmgr/predicate.c | 4 +-
src/backend/storage/sync/sync.c | 24 +++++-
src/include/access/clog.h | 3 +
src/include/access/commit_ts.h | 3 +
src/include/access/multixact.h | 4 +
src/include/access/slru.h | 12 ++-
src/include/storage/sync.h | 7 +-
13 files changed, 172 insertions(+), 44 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f3da40ae01..0c5b7a525e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "storage/proc.h"
+#include "storage/sync.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -692,7 +693,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+ XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+ SYNC_HANDLER_CLOG);
}
/*
@@ -1034,3 +1036,12 @@ clog_redo(XLogReaderState *record)
else
elog(PANIC, "clog_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 903280ae92..b4edbdb4e3 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER);
+ LWTRANCHE_COMMITTS_BUFFER,
+ SYNC_HANDLER_COMMIT_TS);
commitTsShared = ShmemInitStruct("CommitTs shared",
sizeof(CommitTimestampShared),
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
else
elog(PANIC, "commit_ts_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 475f5ed861..27ae2edbdc 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+ LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ SYNC_HANDLER_MULTIXACT_OFFSET);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+ LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ SYNC_HANDLER_MULTIXACT_MEMBER);
/* Initialize our shared state struct */
MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -3386,3 +3388,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funccxt);
}
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 9e145f1c36..db1243772f 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -80,6 +80,18 @@ typedef struct SlruFlushData
typedef struct SlruFlushData *SlruFlush;
+/*
+ * Populate a file tag describing a segment file. We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_handler,xx_segno) \
+( \
+ memset(&(a), 0, sizeof(FileTag)), \
+ (a).handler = (xx_handler), \
+ (a).segno = (xx_segno) \
+)
+
/*
* Macro to mark a buffer slot "most recently used". Note multiple evaluation
* of arguments!
@@ -173,7 +185,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id)
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
@@ -251,7 +264,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
* assume caller set PagePrecedes.
*/
ctl->shared = shared;
- ctl->do_fsync = true; /* default behavior */
+ ctl->sync_handler = sync_handler;
StrNCpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -870,23 +883,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
pgstat_report_wait_end();
- /*
- * If not part of Flush, need to fsync now. We assume this happens
- * infrequently enough that it's not a performance issue.
- */
- if (!fdata)
+ /* Queue up a sync request for the checkpointer. */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
- if (ctl->do_fsync && pg_fsync(fd) != 0)
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
{
+ /* No space to enqueue sync request. Do it synchronously. */
+ pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+ if (pg_fsync(fd) != 0)
+ {
+ pgstat_report_wait_end();
+ slru_errcause = SLRU_FSYNC_FAILED;
+ slru_errno = errno;
+ CloseTransientFile(fd);
+ return false;
+ }
pgstat_report_wait_end();
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- CloseTransientFile(fd);
- return false;
}
- pgstat_report_wait_end();
+ }
+ /* Close file, unless part of flush request. */
+ if (!fdata)
+ {
if (CloseTransientFile(fd) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1162,21 +1183,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
LWLockRelease(shared->ControlLock);
/*
- * Now fsync and close any files that were open
+ * Now close any files that were open
*/
ok = true;
for (i = 0; i < fdata.num_files; i++)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
- if (ctl->do_fsync && pg_fsync(fdata.fd[i]) != 0)
- {
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
- ok = false;
- }
- pgstat_report_wait_end();
-
if (CloseTransientFile(fdata.fd[i]) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1295,6 +1306,7 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
int slotno;
char path[MAXPGPATH];
bool did_write;
+ FileTag tag;
/* Clean out any possibly existing references to the segment. */
LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
@@ -1338,6 +1350,17 @@ restart:
snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
ereport(DEBUG2,
(errmsg("removing file \"%s\"", path)));
+
+ /*
+ * Tell the checkpointer to forget any sync requests, before we unlink the
+ * file.
+ */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
+ {
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+ }
+
unlink(path);
LWLockRelease(shared->ControlLock);
@@ -1436,3 +1459,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
return retval;
}
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+ int fd;
+ int save_errno;
+ int result;
+
+ SlruFileName(ctl, path, ftag->segno);
+
+ fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+ if (fd < 0)
+ return -1;
+
+ result = pg_fsync(fd);
+ save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ errno = save_errno;
+ return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index f33ae407a6..68568f18dc 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SubTransCtl->do_fsync = false;
+ LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
}
/*
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 71b7577afc..d0a40b0556 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -554,9 +554,8 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- NotifyCtl->do_fsync = false;
+ NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ SYNC_HANDLER_NONE);
if (!found)
{
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d24919f76b..579201b10d 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -821,9 +821,7 @@ SerialInit(void)
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SerialSlruCtl->do_fsync = false;
+ LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
/*
* Create or attach to the SerialControl structure.
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 3ded2cdd71..64107fd732 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -18,6 +18,9 @@
#include <fcntl.h>
#include <sys/file.h>
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "commands/tablespace.h"
@@ -90,13 +93,32 @@ typedef struct SyncOps
const FileTag *candidate);
} SyncOps;
+/*
+ * These indexes must correspond to the values of the SyncRequestHandler enum.
+ */
static const SyncOps syncsw[] = {
/* magnetic disk */
{
.sync_syncfiletag = mdsyncfiletag,
.sync_unlinkfiletag = mdunlinkfiletag,
.sync_filetagmatches = mdfiletagmatches
- }
+ },
+ /* pg_xact */
+ {
+ .sync_syncfiletag = clogsyncfiletag
+ },
+ /* pg_commit_ts */
+ {
+ .sync_syncfiletag = committssyncfiletag
+ },
+ /* pg_multixact/offsets */
+ {
+ .sync_syncfiletag = multixactoffsetssyncfiletag
+ },
+ /* pg_multixact/members */
+ {
+ .sync_syncfiletag = multixactmemberssyncfiletag
+ },
};
/*
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 2db8acb189..d97b9042dc 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
#define CLOG_H
#include "access/xlogreader.h"
+#include "storage/sync.h"
#include "lib/stringinfo.h"
/*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
extern void ExtendCLOG(TransactionId newestXact);
extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define CLOG_ZEROPAGE 0x00
#define CLOG_TRUNCATE 0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 2740c02a84..27900ce430 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
#include "replication/origin.h"
+#include "storage/sync.h"
#include "utils/guc.h"
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
TransactionId newestXact);
extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define COMMIT_TS_ZEROPAGE 0x00
#define COMMIT_TS_TRUNCATE 0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6d729008c6..71d6e78063 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
+#include "storage/sync.h"
/*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
MultiXactId multi2);
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
extern void AtEOXact_MultiXact(void);
extern void AtPrepare_MultiXact(void);
extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 61fbc80ef0..2720284157 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "storage/lwlock.h"
+#include "storage/sync.h"
/*
@@ -111,10 +112,10 @@ typedef struct SlruCtlData
SlruShared shared;
/*
- * This flag tells whether to fsync writes (true for pg_xact and multixact
- * stuff, false for pg_subtrans and pg_notify).
+ * Which sync handler function to use when handing sync requests over to
+ * the checkpointer. SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
*/
- bool do_fsync;
+ SyncRequestHandler sync_handler;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
@@ -135,7 +136,8 @@ typedef SlruCtlData *SlruCtl;
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id);
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -151,6 +153,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
extern void SlruDeleteSegment(SlruCtl ctl, int segno);
+extern int slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path);
+
/* SlruScanDirectory public callbacks */
extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index e16ab8e711..f32e412e75 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -34,7 +34,12 @@ typedef enum SyncRequestType
*/
typedef enum SyncRequestHandler
{
- SYNC_HANDLER_MD = 0 /* md smgr */
+ SYNC_HANDLER_MD = 0,
+ SYNC_HANDLER_CLOG,
+ SYNC_HANDLER_COMMIT_TS,
+ SYNC_HANDLER_MULTIXACT_OFFSET,
+ SYNC_HANDLER_MULTIXACT_MEMBER,
+ SYNC_HANDLER_NONE
} SyncRequestHandler;
/*
--
2.20.1
On Tue, Aug 4, 2020 at 6:02 PM Thomas Munro <thomas.munro@gmail.com> wrote:
... speedup of around 6% ...
I did some better testing. OS: Linux, storage: consumer SSD. I
repeatedly ran crash recovery on 3.3GB worth of WAL generated with 8M
pgbench transactions. I tested 3 different builds 7 times each and
used "ministat" to compare the recovery times. It told me that:
* Master is around 11% faster than last week before commit c5315f4f
"Cache smgrnblocks() results in recovery."
* This patch gives a similar speedup, bringing the total to around 25%
faster than last week (the time is ~20% less, the WAL processing speed
is ~1.25x).
My test fit in RAM and was all cached. With the patch, the recovery
process used 100% of a single core the whole time and stayed on that
core and the variance is low, but in the other builds it hovered
around 90% and hopped around as it kept getting rescheduled and the
variance was higher.
Of course, SLRU fsyncs aren't the only I/O stalls in a real system;
among others, there are also reads from faulting in referenced pages
that don't have full page images in the WAL. I'm working on that
separately, but that's a tad more complicated than this stuff.
Added to commit fest.
=== ministat output showing recovery times in seconds ===
x patched.dat
+ master.dat
* lastweek.dat
+------------------------------------------------------------------------------+
| * |
| x + * |
|x x xx + + ++ + + * **** |
| |AM| |_____AM____| |_____A_M__||
+------------------------------------------------------------------------------+
N Min Max Median Avg Stddev
x 7 38.655 39.406 39.218 39.134857 0.25188849
+ 7 42.128 45.068 43.958 43.815286 0.91387758
Difference at 95.0% confidence
4.68043 +/- 0.780722
11.9597% +/- 1.99495%
(Student's t, pooled s = 0.670306)
* 7 47.187 49.404 49.203 48.904286 0.76793483
Difference at 95.0% confidence
9.76943 +/- 0.665613
24.9635% +/- 1.70082%
(Student's t, pooled s = 0.571477)
On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro <thomas.munro@gmail.com> wrote:
* Master is around 11% faster than last week before commit c5315f4f
"Cache smgrnblocks() results in recovery."
* This patch gives a similar speedup, bringing the total to around 25%
faster than last week (the time is ~20% less, the WAL processing speed
is ~1.25x).
Dang, that's pretty nice, especially for the relatively small amount
of code that it seems to require.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Sat, Aug 8, 2020 at 2:44 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Aug 5, 2020 at 2:01 AM Thomas Munro <thomas.munro@gmail.com> wrote:
* Master is around 11% faster than last week before commit c5315f4f
"Cache smgrnblocks() results in recovery."
* This patch gives a similar speedup, bringing the total to around 25%
faster than last week (the time is ~20% less, the WAL processing speed
is ~1.25x).Dang, that's pretty nice, especially for the relatively small amount
of code that it seems to require.
Yeah, the combined effect of these two patches is better than I
expected. To be clear though, I was only measuring the time between
the "redo starts at ..." and "redo done at ..." messages, since I've
been staring at the main recovery code, but there are also some more
fsyncs before (SyncDataDirectory()) and after (RemoveOldXlogFiles())
that are unaffected. I think it's probably possible to do something
about those too, but that's another topic.
I spotted a small problem: if the transaction ID wrap all the way
around between checkpoints, then you might have cancelled requests for
a removed SLRU segment from the previous epoch, so we'd better
uncancel them if we see that. That's a one line fix, done in the
attached. I also adjusted the commit message to be a little clearer
(this work deferment/collapsing scheme works in crash recovery too,
not just when there is a checkpointer process to hand the work to).
Attachments:
0001-Defer-flushing-of-SLRU-files.patchtext/x-patch; charset=US-ASCII; name=0001-Defer-flushing-of-SLRU-files.patchDownload
From 63235aca6b2525716f8f29caf07e0a0e6a26965e Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 4 Aug 2020 17:57:18 +1200
Subject: [PATCH] Defer flushing of SLRU files.
Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages, leading to an I/O stall in user backends and recovery.
Collapse requests for the same file into a single system call as part of
the next checkpoint, as we do for relation files.
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
---
src/backend/access/transam/clog.c | 13 +++-
src/backend/access/transam/commit_ts.c | 12 ++-
src/backend/access/transam/multixact.c | 24 +++++-
src/backend/access/transam/slru.c | 101 +++++++++++++++++++------
src/backend/access/transam/subtrans.c | 4 +-
src/backend/commands/async.c | 5 +-
src/backend/storage/lmgr/predicate.c | 4 +-
src/backend/storage/sync/sync.c | 28 ++++++-
src/include/access/clog.h | 3 +
src/include/access/commit_ts.h | 3 +
src/include/access/multixact.h | 4 +
src/include/access/slru.h | 12 ++-
src/include/storage/sync.h | 7 +-
13 files changed, 174 insertions(+), 46 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index f3da40ae01..0c5b7a525e 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "storage/proc.h"
+#include "storage/sync.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -692,7 +693,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+ XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+ SYNC_HANDLER_CLOG);
}
/*
@@ -1034,3 +1036,12 @@ clog_redo(XLogReaderState *record)
else
elog(PANIC, "clog_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 903280ae92..b4edbdb4e3 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER);
+ LWTRANCHE_COMMITTS_BUFFER,
+ SYNC_HANDLER_COMMIT_TS);
commitTsShared = ShmemInitStruct("CommitTs shared",
sizeof(CommitTimestampShared),
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
else
elog(PANIC, "commit_ts_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index 475f5ed861..27ae2edbdc 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+ LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ SYNC_HANDLER_MULTIXACT_OFFSET);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+ LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ SYNC_HANDLER_MULTIXACT_MEMBER);
/* Initialize our shared state struct */
MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -3386,3 +3388,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funccxt);
}
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 9e145f1c36..db1243772f 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -80,6 +80,18 @@ typedef struct SlruFlushData
typedef struct SlruFlushData *SlruFlush;
+/*
+ * Populate a file tag describing a segment file. We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_handler,xx_segno) \
+( \
+ memset(&(a), 0, sizeof(FileTag)), \
+ (a).handler = (xx_handler), \
+ (a).segno = (xx_segno) \
+)
+
/*
* Macro to mark a buffer slot "most recently used". Note multiple evaluation
* of arguments!
@@ -173,7 +185,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id)
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
@@ -251,7 +264,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
* assume caller set PagePrecedes.
*/
ctl->shared = shared;
- ctl->do_fsync = true; /* default behavior */
+ ctl->sync_handler = sync_handler;
StrNCpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -870,23 +883,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
pgstat_report_wait_end();
- /*
- * If not part of Flush, need to fsync now. We assume this happens
- * infrequently enough that it's not a performance issue.
- */
- if (!fdata)
+ /* Queue up a sync request for the checkpointer. */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
- if (ctl->do_fsync && pg_fsync(fd) != 0)
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
{
+ /* No space to enqueue sync request. Do it synchronously. */
+ pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+ if (pg_fsync(fd) != 0)
+ {
+ pgstat_report_wait_end();
+ slru_errcause = SLRU_FSYNC_FAILED;
+ slru_errno = errno;
+ CloseTransientFile(fd);
+ return false;
+ }
pgstat_report_wait_end();
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- CloseTransientFile(fd);
- return false;
}
- pgstat_report_wait_end();
+ }
+ /* Close file, unless part of flush request. */
+ if (!fdata)
+ {
if (CloseTransientFile(fd) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1162,21 +1183,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
LWLockRelease(shared->ControlLock);
/*
- * Now fsync and close any files that were open
+ * Now close any files that were open
*/
ok = true;
for (i = 0; i < fdata.num_files; i++)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
- if (ctl->do_fsync && pg_fsync(fdata.fd[i]) != 0)
- {
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
- ok = false;
- }
- pgstat_report_wait_end();
-
if (CloseTransientFile(fdata.fd[i]) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1295,6 +1306,7 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
int slotno;
char path[MAXPGPATH];
bool did_write;
+ FileTag tag;
/* Clean out any possibly existing references to the segment. */
LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
@@ -1338,6 +1350,17 @@ restart:
snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
ereport(DEBUG2,
(errmsg("removing file \"%s\"", path)));
+
+ /*
+ * Tell the checkpointer to forget any sync requests, before we unlink the
+ * file.
+ */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
+ {
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+ }
+
unlink(path);
LWLockRelease(shared->ControlLock);
@@ -1436,3 +1459,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
return retval;
}
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+ int fd;
+ int save_errno;
+ int result;
+
+ SlruFileName(ctl, path, ftag->segno);
+
+ fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+ if (fd < 0)
+ return -1;
+
+ result = pg_fsync(fd);
+ save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ errno = save_errno;
+ return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index f33ae407a6..68568f18dc 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SubTransCtl->do_fsync = false;
+ LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
}
/*
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 71b7577afc..d0a40b0556 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -554,9 +554,8 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- NotifyCtl->do_fsync = false;
+ NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ SYNC_HANDLER_NONE);
if (!found)
{
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index d24919f76b..579201b10d 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -821,9 +821,7 @@ SerialInit(void)
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SerialSlruCtl->do_fsync = false;
+ LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
/*
* Create or attach to the SerialControl structure.
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 3ded2cdd71..a59c5193f7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -18,6 +18,9 @@
#include <fcntl.h>
#include <sys/file.h>
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "commands/tablespace.h"
@@ -90,13 +93,32 @@ typedef struct SyncOps
const FileTag *candidate);
} SyncOps;
+/*
+ * These indexes must correspond to the values of the SyncRequestHandler enum.
+ */
static const SyncOps syncsw[] = {
/* magnetic disk */
{
.sync_syncfiletag = mdsyncfiletag,
.sync_unlinkfiletag = mdunlinkfiletag,
.sync_filetagmatches = mdfiletagmatches
- }
+ },
+ /* pg_xact */
+ {
+ .sync_syncfiletag = clogsyncfiletag
+ },
+ /* pg_commit_ts */
+ {
+ .sync_syncfiletag = committssyncfiletag
+ },
+ /* pg_multixact/offsets */
+ {
+ .sync_syncfiletag = multixactoffsetssyncfiletag
+ },
+ /* pg_multixact/members */
+ {
+ .sync_syncfiletag = multixactmemberssyncfiletag
+ },
};
/*
@@ -505,8 +527,8 @@ RememberSyncRequest(const FileTag *ftag, SyncRequestType type)
(void *) ftag,
HASH_ENTER,
&found);
- /* if new entry, initialize it */
- if (!found)
+ /* if new entry, or was previously canceled, initialize it */
+ if (!found || entry->canceled)
{
entry->cycle_ctr = sync_cycle_ctr;
entry->canceled = false;
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 2db8acb189..d97b9042dc 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
#define CLOG_H
#include "access/xlogreader.h"
+#include "storage/sync.h"
#include "lib/stringinfo.h"
/*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
extern void ExtendCLOG(TransactionId newestXact);
extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define CLOG_ZEROPAGE 0x00
#define CLOG_TRUNCATE 0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 2740c02a84..27900ce430 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
#include "replication/origin.h"
+#include "storage/sync.h"
#include "utils/guc.h"
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
TransactionId newestXact);
extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define COMMIT_TS_ZEROPAGE 0x00
#define COMMIT_TS_TRUNCATE 0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6d729008c6..71d6e78063 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
+#include "storage/sync.h"
/*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
MultiXactId multi2);
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
extern void AtEOXact_MultiXact(void);
extern void AtPrepare_MultiXact(void);
extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 61fbc80ef0..2720284157 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "storage/lwlock.h"
+#include "storage/sync.h"
/*
@@ -111,10 +112,10 @@ typedef struct SlruCtlData
SlruShared shared;
/*
- * This flag tells whether to fsync writes (true for pg_xact and multixact
- * stuff, false for pg_subtrans and pg_notify).
+ * Which sync handler function to use when handing sync requests over to
+ * the checkpointer. SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
*/
- bool do_fsync;
+ SyncRequestHandler sync_handler;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
@@ -135,7 +136,8 @@ typedef SlruCtlData *SlruCtl;
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id);
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -151,6 +153,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
extern void SlruDeleteSegment(SlruCtl ctl, int segno);
+extern int slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path);
+
/* SlruScanDirectory public callbacks */
extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index e16ab8e711..f32e412e75 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -34,7 +34,12 @@ typedef enum SyncRequestType
*/
typedef enum SyncRequestHandler
{
- SYNC_HANDLER_MD = 0 /* md smgr */
+ SYNC_HANDLER_MD = 0,
+ SYNC_HANDLER_CLOG,
+ SYNC_HANDLER_COMMIT_TS,
+ SYNC_HANDLER_MULTIXACT_OFFSET,
+ SYNC_HANDLER_MULTIXACT_MEMBER,
+ SYNC_HANDLER_NONE
} SyncRequestHandler;
/*
--
2.20.1
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
[patch]
Bitrot, rebased, no changes.
Yeah, the combined effect of these two patches is better than I
expected. To be clear though, I was only measuring the time between
the "redo starts at ..." and "redo done at ..." messages, since I've
been staring at the main recovery code, but there are also some more
fsyncs before (SyncDataDirectory()) and after (RemoveOldXlogFiles())
that are unaffected. I think it's probably possible to do something
about those too, but that's another topic.
... and of course the end-of-recovery checkpoint; in my tests this
wasn't materially changed since there isn't actually very much CLOG,
it's just that we avoided syncing it block at a time and getting
rescheduled. FWIW I put a very simple test here:
https://github.com/macdice/redo-bench, YMMV.
Attachments:
v2-0001-Defer-flushing-of-SLRU-files.patchtext/x-patch; charset=US-ASCII; name=v2-0001-Defer-flushing-of-SLRU-files.patchDownload
From 32c1c888816d2800467d1d179678b66d1042d07c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Tue, 4 Aug 2020 17:57:18 +1200
Subject: [PATCH v2] Defer flushing of SLRU files.
Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages, leading to an I/O stall in user backends and recovery.
Collapse requests for the same file into a single system call as part of
the next checkpoint, as we do for relation files.
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
---
src/backend/access/transam/clog.c | 13 +++-
src/backend/access/transam/commit_ts.c | 12 ++-
src/backend/access/transam/multixact.c | 24 +++++-
src/backend/access/transam/slru.c | 101 +++++++++++++++++++------
src/backend/access/transam/subtrans.c | 4 +-
src/backend/commands/async.c | 5 +-
src/backend/storage/lmgr/predicate.c | 4 +-
src/backend/storage/sync/sync.c | 28 ++++++-
src/include/access/clog.h | 3 +
src/include/access/commit_ts.h | 3 +
src/include/access/multixact.h | 4 +
src/include/access/slru.h | 12 ++-
src/include/storage/sync.h | 7 +-
13 files changed, 174 insertions(+), 46 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index dd2f4d5bc7..3eb33aea01 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "storage/proc.h"
+#include "storage/sync.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -692,7 +693,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+ XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+ SYNC_HANDLER_CLOG);
}
/*
@@ -1034,3 +1036,12 @@ clog_redo(XLogReaderState *record)
else
elog(PANIC, "clog_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..913ec9e48d 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER);
+ LWTRANCHE_COMMITTS_BUFFER,
+ SYNC_HANDLER_COMMIT_TS);
commitTsShared = ShmemInitStruct("CommitTs shared",
sizeof(CommitTimestampShared),
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
else
elog(PANIC, "commit_ts_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..344006b0f5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+ LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ SYNC_HANDLER_MULTIXACT_OFFSET);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+ LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ SYNC_HANDLER_MULTIXACT_MEMBER);
/* Initialize our shared state struct */
MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -3386,3 +3388,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funccxt);
}
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index d1dbb43e09..27b87219ed 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -80,6 +80,18 @@ typedef struct SlruFlushData
typedef struct SlruFlushData *SlruFlush;
+/*
+ * Populate a file tag describing a segment file. We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_handler,xx_segno) \
+( \
+ memset(&(a), 0, sizeof(FileTag)), \
+ (a).handler = (xx_handler), \
+ (a).segno = (xx_segno) \
+)
+
/*
* Macro to mark a buffer slot "most recently used". Note multiple evaluation
* of arguments!
@@ -173,7 +185,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id)
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
@@ -251,7 +264,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
* assume caller set PagePrecedes.
*/
ctl->shared = shared;
- ctl->do_fsync = true; /* default behavior */
+ ctl->sync_handler = sync_handler;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -870,23 +883,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
pgstat_report_wait_end();
- /*
- * If not part of Flush, need to fsync now. We assume this happens
- * infrequently enough that it's not a performance issue.
- */
- if (!fdata)
+ /* Queue up a sync request for the checkpointer. */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
- if (ctl->do_fsync && pg_fsync(fd) != 0)
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
{
+ /* No space to enqueue sync request. Do it synchronously. */
+ pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+ if (pg_fsync(fd) != 0)
+ {
+ pgstat_report_wait_end();
+ slru_errcause = SLRU_FSYNC_FAILED;
+ slru_errno = errno;
+ CloseTransientFile(fd);
+ return false;
+ }
pgstat_report_wait_end();
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- CloseTransientFile(fd);
- return false;
}
- pgstat_report_wait_end();
+ }
+ /* Close file, unless part of flush request. */
+ if (!fdata)
+ {
if (CloseTransientFile(fd) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1162,21 +1183,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
LWLockRelease(shared->ControlLock);
/*
- * Now fsync and close any files that were open
+ * Now close any files that were open
*/
ok = true;
for (i = 0; i < fdata.num_files; i++)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
- if (ctl->do_fsync && pg_fsync(fdata.fd[i]) != 0)
- {
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
- ok = false;
- }
- pgstat_report_wait_end();
-
if (CloseTransientFile(fdata.fd[i]) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1295,6 +1306,7 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
int slotno;
char path[MAXPGPATH];
bool did_write;
+ FileTag tag;
/* Clean out any possibly existing references to the segment. */
LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
@@ -1338,6 +1350,17 @@ restart:
snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
ereport(DEBUG2,
(errmsg("removing file \"%s\"", path)));
+
+ /*
+ * Tell the checkpointer to forget any sync requests, before we unlink the
+ * file.
+ */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
+ {
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+ }
+
unlink(path);
LWLockRelease(shared->ControlLock);
@@ -1436,3 +1459,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
return retval;
}
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+ int fd;
+ int save_errno;
+ int result;
+
+ SlruFileName(ctl, path, ftag->segno);
+
+ fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+ if (fd < 0)
+ return -1;
+
+ result = pg_fsync(fd);
+ save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ errno = save_errno;
+ return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index a087a55542..edbf56e11b 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SubTransCtl->do_fsync = false;
+ LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
}
/*
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index 71b7577afc..d0a40b0556 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -554,9 +554,8 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- NotifyCtl->do_fsync = false;
+ NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ SYNC_HANDLER_NONE);
if (!found)
{
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a2f8e7524b..577914a2b8 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -821,9 +821,7 @@ SerialInit(void)
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SerialSlruCtl->do_fsync = false;
+ LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
/*
* Create or attach to the SerialControl structure.
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 3ded2cdd71..a59c5193f7 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -18,6 +18,9 @@
#include <fcntl.h>
#include <sys/file.h>
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "commands/tablespace.h"
@@ -90,13 +93,32 @@ typedef struct SyncOps
const FileTag *candidate);
} SyncOps;
+/*
+ * These indexes must correspond to the values of the SyncRequestHandler enum.
+ */
static const SyncOps syncsw[] = {
/* magnetic disk */
{
.sync_syncfiletag = mdsyncfiletag,
.sync_unlinkfiletag = mdunlinkfiletag,
.sync_filetagmatches = mdfiletagmatches
- }
+ },
+ /* pg_xact */
+ {
+ .sync_syncfiletag = clogsyncfiletag
+ },
+ /* pg_commit_ts */
+ {
+ .sync_syncfiletag = committssyncfiletag
+ },
+ /* pg_multixact/offsets */
+ {
+ .sync_syncfiletag = multixactoffsetssyncfiletag
+ },
+ /* pg_multixact/members */
+ {
+ .sync_syncfiletag = multixactmemberssyncfiletag
+ },
};
/*
@@ -505,8 +527,8 @@ RememberSyncRequest(const FileTag *ftag, SyncRequestType type)
(void *) ftag,
HASH_ENTER,
&found);
- /* if new entry, initialize it */
- if (!found)
+ /* if new entry, or was previously canceled, initialize it */
+ if (!found || entry->canceled)
{
entry->cycle_ctr = sync_cycle_ctr;
entry->canceled = false;
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 2db8acb189..d97b9042dc 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
#define CLOG_H
#include "access/xlogreader.h"
+#include "storage/sync.h"
#include "lib/stringinfo.h"
/*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
extern void ExtendCLOG(TransactionId newestXact);
extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define CLOG_ZEROPAGE 0x00
#define CLOG_TRUNCATE 0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 2740c02a84..27900ce430 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
#include "replication/origin.h"
+#include "storage/sync.h"
#include "utils/guc.h"
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
TransactionId newestXact);
extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define COMMIT_TS_ZEROPAGE 0x00
#define COMMIT_TS_TRUNCATE 0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6d729008c6..71d6e78063 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
+#include "storage/sync.h"
/*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
MultiXactId multi2);
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
extern void AtEOXact_MultiXact(void);
extern void AtPrepare_MultiXact(void);
extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 61fbc80ef0..2720284157 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "storage/lwlock.h"
+#include "storage/sync.h"
/*
@@ -111,10 +112,10 @@ typedef struct SlruCtlData
SlruShared shared;
/*
- * This flag tells whether to fsync writes (true for pg_xact and multixact
- * stuff, false for pg_subtrans and pg_notify).
+ * Which sync handler function to use when handing sync requests over to
+ * the checkpointer. SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
*/
- bool do_fsync;
+ SyncRequestHandler sync_handler;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
@@ -135,7 +136,8 @@ typedef SlruCtlData *SlruCtl;
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id);
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -151,6 +153,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
extern void SlruDeleteSegment(SlruCtl ctl, int segno);
+extern int slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path);
+
/* SlruScanDirectory public callbacks */
extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index e16ab8e711..f32e412e75 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -34,7 +34,12 @@ typedef enum SyncRequestType
*/
typedef enum SyncRequestHandler
{
- SYNC_HANDLER_MD = 0 /* md smgr */
+ SYNC_HANDLER_MD = 0,
+ SYNC_HANDLER_CLOG,
+ SYNC_HANDLER_COMMIT_TS,
+ SYNC_HANDLER_MULTIXACT_OFFSET,
+ SYNC_HANDLER_MULTIXACT_MEMBER,
+ SYNC_HANDLER_NONE
} SyncRequestHandler;
/*
--
2.20.1
On Wed, Aug 12, 2020 at 6:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
[patch]
Hi Thomas / hackers,
I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples() patch [2]https://commitfest.postgresql.org/29/2687/ as both are related to the WAL recovery performance in which I'm interested in. This is my first message to this mailing group so please let me know also if I should adjust testing style or formatting.
With both of those patches applied:
make check -> Passes
make check-world -> Passes
make standbycheck (section "Testing Hot Standby" from docs) -> Passes
There wasn't a single segfault or postmaster crash during the tests.
Review of the patches itself: I'm not qualified to review the PostgreSQL internals.
I've used redo-bench scripts [1]https://github.com/macdice/redo-bench/ by Thomas to measure the performance effect (this approach simplifies testing and excludes network jittering effects): 1st column is redo start->end timing, 2nd is redo end -> end of checkpointing timing before opening the DB for reads. I've conducted 2-3 separate tests that show benefits of those patches depending on the workload:
- Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower / higer latency storage by ~20%
- Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by ~20%
TEST1: workload profile test as per standard TPC-B pgbench -c8 -j8, with majority of records in WAL stream being Heap/HOT_UPDATE:
xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
72.991, 0.919
73.688, 1.027
72.228, 1.032
xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
72.271, 0.857
72.717, 0.748
72.705, 0.81
xvda, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
72.49, 0.103
74.069, 0.102
73.368, 0.102
NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch
70.312, 0.22
70.615, 0.201
69.739, 0.194
NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
69.686, 0.101
70.601, 0.102
70.042, 0.101
As Thomas stated in the standard pgbench workload profile on recovery side there is compactify_tuples()->pg_qsort() overhead visible. So this is where the 2nd patch helps:
NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance improvement
58.85, 0.296
58.605, 0.269
58.603, 0.277
NVMe, same master with, v2-0001-Defer-flushing-of-SLRU-files.patch and compactify_tuples()->pg_qsort() performance improvement, fsync=off
58.618, 0.103
57.693, 0.101
58.779, 0.111
In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal way:
26.68% postgres postgres [.] qsort_itemoff
---qsort_itemoff
|--14.17%--qsort_itemoff
| |--10.96%--compactify_tuples
| | PageRepairFragmentation
| | heap2_redo
| | StartupXLOG
| --3.21%--qsort_itemoff
| --3.10%--compactify_tuples
| PageRepairFragmentation
| heap2_redo
| StartupXLOG
--12.51%--compactify_tuples
PageRepairFragmentation
heap2_redo
StartupXLOG
8.38% postgres libc-2.17.so [.] __memmove_ssse3_back
---__memmove_ssse3_back
compactify_tuples
PageRepairFragmentation
heap2_redo
StartupXLOG
6.51% postgres postgres [.] hash_search_with_hash_value
---hash_search_with_hash_value
|--3.62%--smgropen
| |--2.17%--XLogReadBufferExtended
| | --1.76%--XLogReadBufferForRedoExtended
| | --0.93%--heap_xlog_update
| --1.45%--ReadBufferWithoutRelcache
| XLogReadBufferExtended
| --1.34%--XLogReadBufferForRedoExtended
| --0.72%--heap_xlog_update
--2.69%--BufTableLookup
ReadBuffer_common
ReadBufferWithoutRelcache
XLogReadBufferExtended
--2.48%--XLogReadBufferForRedoExtended
|--1.34%--heap2_redo
| StartupXLOG
--0.83%--heap_xlog_update
So to sum, HOT update-like workload profile tends to be CPU bound on single process recovery side. Even slow storage (like xvda) was not the bottleneck here as I've used already hot stuff from VFS cache.
TEST2: The second suite of tests used append-only workload profile (the same amount of transactions, using pgbench -c8 -j8 -f insert_only.sql -n -t 1000000), however with simpler structure:
CREATE TABLE t (c1 uuid NOT NULL, c2 uuid NOT NULL, c3 integer NOT NULL, c4 integer NOT NULL);
CREATE INDEX i1 ON t USING btree (c1);
CREATE INDEX i2 ON t USING btree (c2);
CREATE INDEX i3 ON t USING btree (c3);
CREATE EXTENSION "uuid-ossp"
and customized script for pgbench, just this: BEGIN; INSERT INTO t (c1,c2,c3,c4) values ( uuid_generate_v4(), uuid_generate_v4(), round(5000*random()), 500); END;
Majority of WAL records being Btree/INSERT_LEAF (60%), results are following:
xvda, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
120.732, 5.275
120.608, 5.902
120.685, 5.872
xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch
99.061, 7.071
99.472, 6.386
98.036, 5.994
xvda, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
102.838, 0.136
102.711, 0.099
103.093, 0.0970001
NVMe, master @ a3c66de6c5e1ee9dd41ce1454496568622fb7712 (24/08/2020) - baseline:
96.46, 0.405
96.121, 0.405
95.951, 0.402
NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch
94.124, 0.387
96.61, 0.416
94.624, 0.451
NVMe, same master, with v2-0001-Defer-flushing-of-SLRU-files.patch, fsync=off
95.401, 0.0969999
95.028, 0.099
94.632, 0.0970001
So apparently the v2-0001-Defer-flushing-of-SLRU-files helps in my case on higher latency storage.
The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache (but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all):
29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
---copy_user_enhanced_fast_string
|--17.98%--copyin
[..]
| __pwrite_nocancel
| FileWrite
| mdwrite
| FlushBuffer
| ReadBuffer_common
| ReadBufferWithoutRelcache
| XLogReadBufferExtended
| XLogReadBufferForRedoExtended
| --17.57%--btree_xlog_insert
| btree_redo
| StartupXLOG
|
--11.64%--copyout
[..]
__pread_nocancel
--11.44%--FileRead
mdread
ReadBuffer_common
ReadBufferWithoutRelcache
XLogReadBufferExtended
XLogReadBufferForRedoExtended
--11.34%--btree_xlog_insert
btree_redo
StartupXLOG
5.62% postgres postgres [.] hash_search_with_hash_value
---hash_search_with_hash_value
|--1.74%--smgropen
[..]
No# of syscalls/s topped at ~100k/s @ 8kB for each read & write respectively (it's logical I/O as everything fits in VFS cache, nearly no physical I/O). It was also visible during the test that the startup/recovering process spent 60% of it's time in %sys time in such conditions. As there was no sorting visible in profiler, I've did not test the workload with compactify_tuples()->pg_qsort() performance improvement here, although from basic runs it appears it did not introduce any degradation.
TEST2b: Quite frankly, at 1st glance I've did not seem to understand why btree_xlog_insert()->ReadBuffer_common() would require FlushBuffer() that would write, until I've bumped shared buffers 128MB -> 24GB as it might have been flushing dirty buffers which caused those pwrite's() - it's not evident as direct call in code from ReadBuffer*() (inlining? any idea how to monitor on standby the cycling of dirty buffers during recovery when there is no bgwriter yet?). Still I wanted to eliminate storage and VFS cache as bottleneck. This lead me to results like below (with fsync=on, also defer patch, shared_buffers=24GB to eliminate VFS cache lookups):
27.341, 0.858
27.26, 0.869
26.959, 0.86
Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any difference, so if anyone is curious the next sets of append-only bottlenecks is like below:
14.69% postgres postgres [.] hash_search_with_hash_value
---hash_search_with_hash_value
|--9.80%--BufTableLookup
| ReadBuffer_common
| ReadBufferWithoutRelcache
| XLogReadBufferExtended
| XLogReadBufferForRedoExtended
| |--7.76%--btree_xlog_insert
| | btree_redo
| | StartupXLOG
| --1.63%--heap_xlog_insert
--4.90%--smgropen
|--2.86%--ReadBufferWithoutRelcache
| XLogReadBufferExtended
| |--2.04%--XLogReadBufferForRedoExtended
| | |--1.22%--btree_xlog_insert
| | | btree_redo
| | | StartupXLOG
| | --0.82%--heap_xlog_insert
| --0.82%--XLogRecordPageWithFreeSpace
| heap_xlog_insert
--2.04%--XLogReadBufferExtended
XLogReadBufferForRedoExtended
--1.63%--btree_xlog_insert
btree_redo
StartupXLOG
7.76% postgres libc-2.17.so [.] __memmove_ssse3_back
---__memmove_ssse3_back
PageAddItemExtended
btree_xlog_insert
btree_redo
StartupXLOG
Still the result seems to be nice, as it is 10^6 (trx) / 27s = ~34k TPS on recovery side (~3000MB of WAL/~27s = ~111MB/s without FPW) provided there would be no I/O overheads and my assumptions. For full picture, compare ratio generation of "primary" data with INSERTs I was able to get without any special tuning:
- ~29k TPS of INSERTs-only/COMMITs with pgbench -c8
- ~49k TPS of INSERTs-only/COMMITs with pgbench -c16
so the WAL (single process) recovery code in 14master seems to have like 70% of performance of fairly low-end primary 16vCPU DB in above append-only conditions.
Specs: Linux 4.14 kernel, ext4 filesystems (data=ordered,noatime), 1s8c16t Xeon CPU E5-2686 v4 @ 2.30GHz, 128GB RAM, gcc 7.2.1, CFLAGS set by ./configure to "-O2", test on default/tiny shared_buffers until last test.
xvda AKA slow storage: root file system, single thread tests:
dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==> 157 MB/s
dd if=/dev/zero of=test bs=8k count=10000 oflag=direct => 17.2 MB/s
NVMe: striped VG consisting of 2x NVMes devices with much lower latency
dd if=/dev/zero of=test bs=1M count=1000 oflag=direct ==> 1.9 GB/s or maybe even more
dd if=/dev/zero of=test bs=8k count=10000 oflag=direct => 141 MB/s
-Jakub Wartak.
[1]: https://github.com/macdice/redo-bench/
[2]: https://commitfest.postgresql.org/29/2687/
On Tue, Aug 25, 2020 at 9:16 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
I just wanted to help testing this patch (defer SLRU fsyncs during recovery) and also faster compactify_tuples() patch [2] as both are related to the WAL recovery performance in which I'm interested in. This is my first message to this mailing group so please let me know also if I should adjust testing style or formatting.
Hi Jakub,
Thanks very much for these results!
- Handing SLRU sync work over to the checkpointer: in my testing it accelerates WAL recovery performance on slower / higer latency storage by ~20%
Wow. Those fsyncs must have had fairly high latency (presumably due
to queuing behind other write back activity).
- Faster sort in compactify_tuples(): in my testing it accelerates WAL recovery performance for HOT updates also by ~20%
Nice.
In the last final case the top profile is as follows related still to the sorting but as I understand in much optimal way:
26.68% postgres postgres [.] qsort_itemoff
---qsort_itemoff
|--14.17%--qsort_itemoff
| |--10.96%--compactify_tuples
| | PageRepairFragmentation
| | heap2_redo
| | StartupXLOG
| --3.21%--qsort_itemoff
| --3.10%--compactify_tuples
| PageRepairFragmentation
| heap2_redo
| StartupXLOG
--12.51%--compactify_tuples
PageRepairFragmentation
heap2_redo
StartupXLOG
I wonder if there is something higher level that could be done to
reduce the amount of compaction work required in the first place, but
in the meantime I'm very happy if we can improve the situation so much
with such a microscopic improvement that might eventually benefit
other sorting stuff...
8.38% postgres libc-2.17.so [.] __memmove_ssse3_back
---__memmove_ssse3_back
compactify_tuples
PageRepairFragmentation
heap2_redo
Hmm, I wonder if this bit could go teensy bit faster by moving as many
adjacent tuples as you can in one go rather than moving them one at a
time...
The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache (but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all):
29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
---copy_user_enhanced_fast_string
|--17.98%--copyin
[..]
| __pwrite_nocancel
| FileWrite
| mdwrite
| FlushBuffer
| ReadBuffer_common
| ReadBufferWithoutRelcache
| XLogReadBufferExtended
| XLogReadBufferForRedoExtended
| --17.57%--btree_xlog_insert
To move these writes out of recovery's way, we should probably just
run the bgwriter process during crash recovery. I'm going to look
into that.
The other thing is of course the checkpointer process, and our
end-of-recovery checkpoint. I was going to suggest it should be
optional and not done by the recovery process itself, which is why
some earlier numbers I shared didn't include the end-of-recovery
checkpoint, but then I realised it complicated the numbers for this
little patch and, anyway, it'd be a good idea to open that can of
worms separately...
| btree_redo
| StartupXLOG
|
--11.64%--copyout
[..]
__pread_nocancel
--11.44%--FileRead
mdread
ReadBuffer_common
ReadBufferWithoutRelcache
XLogReadBufferExtended
XLogReadBufferForRedoExtended
For these reads, the solution should be WAL prefetching, but the patch
I shared for that (and will be updating soon) is just one piece of the
puzzle, and as it stands it actually *increases* the number of
syscalls by adding some posix_fadvise() calls, so ... erm, for an
all-in-kernel-cache-already workload like what you profiled there it
can only make things worse on that front. But... when combined with
Andres's work-in-progress AIO stuff, a whole bunch of reads can be
submitted with a single system call ahead of time and then the results
are delivered directly into our buffer pool by kernel threads or
hardware DMA, so we'll not only avoid going off CPU during recovery
but we'll also reduce the system call count.
Turning on/off the defer SLRU patch and/or fsync doesn't seem to make any difference, so if anyone is curious the next sets of append-only bottlenecks is like below:
14.69% postgres postgres [.] hash_search_with_hash_value
---hash_search_with_hash_value
|--9.80%--BufTableLookup
| ReadBuffer_common
| ReadBufferWithoutRelcache
| XLogReadBufferExtended
| XLogReadBufferForRedoExtended
Hypothesis: Your 24GB buffer pool requires somewhere near 70MB of
buffer mapping table (huh, pg_shmem_allocations doesn't show that
correctly), so it doesn't fit into any level of your memory cache
hierarchy and it's super random access, so every buffer lookup is
costing you a ~60-100ns memory stall. Maybe?
If that's the reason for this showing up in your profile, I think I
could probably add a little cache line prefetch phase to the WAL
prefetch patch to fix it. I've actually tried prefetching the buffer
mapping cache lines before, without success, but never in recovery.
I'll make a note to look into that.
Hi,
On 2020-08-26 15:58:14 +1200, Thomas Munro wrote:
--12.51%--compactify_tuples
PageRepairFragmentation
heap2_redo
StartupXLOGI wonder if there is something higher level that could be done to
reduce the amount of compaction work required in the first place, but
in the meantime I'm very happy if we can improve the situation so much
with such a microscopic improvement that might eventually benefit
other sorting stuff...
Another approach could be to not perform any sorting during recovery,
instead including enough information in the WAL record to avoid doing a
full blown PageRepairFragmentation during recovery.
Greetings,
Andres Freund
On 2020-Aug-25, Andres Freund wrote:
Hi,
On 2020-08-26 15:58:14 +1200, Thomas Munro wrote:
--12.51%--compactify_tuples
PageRepairFragmentation
heap2_redo
StartupXLOGI wonder if there is something higher level that could be done to
reduce the amount of compaction work required in the first place, but
in the meantime I'm very happy if we can improve the situation so much
with such a microscopic improvement that might eventually benefit
other sorting stuff...Another approach could be to not perform any sorting during recovery,
instead including enough information in the WAL record to avoid doing a
full blown PageRepairFragmentation during recovery.
Hmm, including the sorted ItemId array in the WAL record might make
sense to alleviate the qsort work ...
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2020-Aug-25, Jakub Wartak wrote:
Turning on/off the defer SLRU patch and/or fsync doesn't seem to make
any difference, so if anyone is curious the next sets of append-only
bottlenecks is like below:14.69% postgres postgres [.] hash_search_with_hash_value
---hash_search_with_hash_value
|--9.80%--BufTableLookup
| ReadBuffer_common
| ReadBufferWithoutRelcache
| XLogReadBufferExtended
| XLogReadBufferForRedoExtended
| |--7.76%--btree_xlog_insert
| | btree_redo
| | StartupXLOG
| --1.63%--heap_xlog_insert
--4.90%--smgropen
|--2.86%--ReadBufferWithoutRelcache
Looking at an earlier report of this problem I was thinking whether it'd
make sense to replace SMgrRelationHash with a simplehash table; I have a
half-written patch for that, but I haven't completed that work.
However, in the older profile things were looking different, as
hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
of it. BufTableLookup was also there but only 1.51%. So I'm not so
sure now that that'll pay off as clearly as I had hoped.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Aug 27, 2020 at 6:15 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
--4.90%--smgropen
|--2.86%--ReadBufferWithoutRelcacheLooking at an earlier report of this problem I was thinking whether it'd
make sense to replace SMgrRelationHash with a simplehash table; I have a
half-written patch for that, but I haven't completed that work.
However, in the older profile things were looking different, as
hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
of it. BufTableLookup was also there but only 1.51%. So I'm not so
sure now that that'll pay off as clearly as I had hoped.
Right, my hypothesis requires an uncacheably large buffer mapping
table, and I think smgropen() needs a different explanation because
it's not expected to be as large or as random, at least not with a
pgbench workload. I think the reasons for a profile with a smgropen()
showing up so high, and in particular higher than BufTableLookup(),
must be:
1. We call smgropen() twice for every call to BufTableLookup(). Once
in XLogReadBufferExtended(), and then again in
ReadBufferWithoutRelcache().
2. We also call it for every block forced out of the buffer pool, and
in recovery that has to be done by the recovery loop.
3. We also call it for every block in the buffer pool during the
end-of-recovery checkpoint.
Not sure but the last two might perform worse due to proximity to
interleaving pwrite() system calls (just a thought, not investigated).
In any case, I'm going to propose we move those things out of the
recovery loop, in a new thread.
Hi Thomas / hackers,
The append-only bottleneck appears to be limited by syscalls/s due to small block size even with everything in FS cache (but not in shared buffers, please compare with TEST1 as there was no such bottleneck at all):
29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
---copy_user_enhanced_fast_string
|--17.98%--copyin
[..]
| __pwrite_nocancel
| FileWrite
| mdwrite
| FlushBuffer
| ReadBuffer_common
| ReadBufferWithoutRelcache
| XLogReadBufferExtended
| XLogReadBufferForRedoExtended
| --17.57%--btree_xlog_insertTo move these writes out of recovery's way, we should probably just
run the bgwriter process during crash recovery. I'm going to look
into that.
Sounds awesome. Also as this thread is starting to derail the SLRU fsync topic - maybe we should change subject? However, to add some data to the separate bgwriter: on 14master (already with lseek() caching, SLRU fsyncs out of way, better sorting), I've measured the same configuration as last time with still the same append-only WAL workload on NVMe and compared with various shared_buffers settings (and buffers description sizing from pg_shmem_allocations which as You stated is wrongly reported(?) which I'm stating only for reference just in case):
shared_buffers=128MB buffers_desc=1024kB 96.778, 0.438 [a]
shared_buffers=256MB buffers_desc=2048kB 62.755, 0.577 [a]
shared_buffers=512MB buffers_desc=4096kB 33.167, 0.62 [a]
shared_buffers=1GB buffers_desc=8192kB 27.303, 0.929
shared_buffers=4GB buffers_desc=32MB 27.185, 1.166
shared_buffers=8GB buffers_desc=64MB 27.649, 1.088
shared_buffers=16GB buffers_desc=128MB 27.584, 1.201 [b]
shared_buffers=32GB buffers_desc=256MB 32.314, 1.171 [b]
shared_buffers=48GB buffers_desc=384 MB 31.95, 1.217
shared_buffers=64GB buffers_desc=512 MB 31.276, 1.349
shared_buffers=72GB buffers_desc=576 MB 31.925, 1.284
shared_buffers=80GB buffers_desc=640 MB 31.809, 1.413
The amount of WAL to be replayed was ~2.8GB. To me it looks like that
a) flushing dirty buffers by StartupXLog might be a real problem but please read-on.
b) there is very low impact by this L2/L3 hypothesis you mention (?), but it's not that big and it's not degrading linearly as one would expect (??) L1d:L1d:L2:L3 cache sizes on this machine are as follows on this machine: 32K/32K/256K/46080K. Maybe the DB size is too small.
I've double-checked that in condition [a] (shared_buffers=128MB) there was a lot of FlushBuffer() invocations per second (perf stat -e probe_postgres:FlushBuffer -I 1000), e.g:
# time counts unit events
1.000485217 79,494 probe_postgres:FlushBuffer
2.000861366 75,305 probe_postgres:FlushBuffer
3.001263163 79,662 probe_postgres:FlushBuffer
4.001650795 80,317 probe_postgres:FlushBuffer
5.002033273 79,951 probe_postgres:FlushBuffer
6.002418646 79,993 probe_postgres:FlushBuffer
while at 1GB shared_buffers it sits nearly at zero all the time. So there is like 3x speed-up possible when StartupXLog wouldn't have to care too much about dirty buffers in the critical path (bgwriter would as you say?) at least in append-only scenarios. But ... I've checked some real systems (even older versions of PostgreSQL not doing that much of replication, and indeed it's e.g. happening 8-12k/s FlushBuffer's() and shared buffers are pretty huge (>100GB, 0.5TB RAM) but they are *system-wide* numbers, work is really performed by bgwriter not by startup/recovering as in this redo-bench case when DB is open for reads and replicating-- so it appears that this isn't optimization for hot standby case , but for the DB-closed startup recovery/restart/disaster/PITR scenario).
As for the 24GB shared_buffers scenario where dirty buffers are not at all a problem with given top profile (output trimmed), again as expected:
13.41% postgres postgres [.] hash_search_with_hash_value
|--8.31%--BufTableLookup <- ReadBuffer_common <- ReadBufferWithoutRelcache
--5.11%--smgropen
|--2.77%--XLogReadBufferExtended
--2.34%--ReadBufferWithoutRelcache
7.88% postgres postgres [.] MarkBufferDirty
I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even on bigger) reporting as not-supported or zeros. However interestingly the workload has IPC of 1.40 (instruction bound) which to me is strange as I would expect BufTableLookup() to be actually heavy memory bound (?) Maybe I'll try on some different hardware one day.
__pread_nocancel
--11.44%--FileRead
mdread
ReadBuffer_common
ReadBufferWithoutRelcache
XLogReadBufferExtended
XLogReadBufferForRedoExtendedFor these reads, the solution should be WAL prefetching,(..) But... when combined with Andres's work-in-progress AIO stuff (..)
Yes, I've heard a thing or two about those :) I hope I'll be able to deliver some measurements sooner or later of those two together (AIO+WALprefetch).
-Jakub Wartak.
Hi Alvaro, Thomas, hackers
14.69% postgres postgres [.] hash_search_with_hash_value
---hash_search_with_hash_value
|--9.80%--BufTableLookup
[..]
--4.90%--smgropen
|--2.86%--ReadBufferWithoutRelcacheLooking at an earlier report of this problem I was thinking whether it'd
make sense to replace SMgrRelationHash with a simplehash table; I have a
half-written patch for that, but I haven't completed that work.
However, in the older profile things were looking different, as
hash_search_with_hash_value was taking 35.25%, and smgropen was 33.74%
of it. BufTableLookup was also there but only 1.51%. So I'm not so
sure now that that'll pay off as clearly as I had hoped.
Yes, quite frankly my expectation was to see hash_search_with_hash_value()<-smgropen() outcome as 1st one, but in simplified redo-bench script it's not the case. The original scenario was much more complex with plenty of differences (in no particular order: TB-sized DB VS ~500GB RAM -> thousands of forks, multiple tables, huge btrees, multiple INSERTs wih plenty of data in VALUES() thrown as one commit, real primary->hot-standby replication [not closed DB in recovery], sorted not random UUIDs) - I'm going to try nail down these differences and maybe I manage to produce more realistic "pgbench reproducer" (this may take some time though).
-Jakub Wartak.
On Thu, Aug 27, 2020 at 8:48 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
I've tried to get cache misses ratio via PMCs, apparently on EC2 they are (even on bigger) reporting as not-supported or zeros.
I heard some of the counters are only allowed on their dedicated instance types.
However interestingly the workload has IPC of 1.40 (instruction bound) which to me is strange as I would expect BufTableLookup() to be actually heavy memory bound (?) Maybe I'll try on some different hardware one day.
Hmm, OK now you've made me go and read a bunch of Brendan Gregg bloggs
and try some experiments of my own to get a feel for this number and
what it might be telling us about the cache miss counters you can't
see. Since I know how to generate arbitrary cache miss workloads for
quick experiments using hash joins of different sizes, I tried that
and noticed that when LLC misses were at 76% (bad), IPC was at 1.69
which is still higher than what you're seeing. When the hash table
was much smaller and LLC misses were down to 15% (much better), IPC
was at 2.83. I know Gregg said[1]http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html "An IPC < 1.0 likely means memory
bound, and an IPC > 1.0 likely means instruction bound", but that's
not what I'm seeing here, and in his comments section that is
disputed. So I'm not sure your IPC of 1.40 is evidence against the
hypothesis on its own.
[1]: http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html
On Thu, Aug 27, 2020 at 8:48 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
29.62% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
---copy_user_enhanced_fast_string
|--17.98%--copyin
[..]
| __pwrite_nocancel
| FileWrite
| mdwrite
| FlushBuffer
| ReadBuffer_common
| ReadBufferWithoutRelcache
| XLogReadBufferExtended
| XLogReadBufferForRedoExtended
| --17.57%--btree_xlog_insertTo move these writes out of recovery's way, we should probably just
run the bgwriter process during crash recovery. I'm going to look
into that.Sounds awesome.
I wrote a quick and dirty experimental patch to try that. I can't see
any benefit from it on pgbench with default shared buffers, but maybe
it would do better with your append test due to locality, especially
if you can figure out how to tune bgwriter to pace itself optimally.
https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery
Hi Thomas, hackers,
To move these writes out of recovery's way, we should probably just
run the bgwriter process during crash recovery. I'm going to look
into that.Sounds awesome.
I wrote a quick and dirty experimental patch to try that. I can't see
any benefit from it on pgbench with default shared buffers, but maybe
it would do better with your append test due to locality, especially
if you can figure out how to tune bgwriter to pace itself optimally.
https://github.com/macdice/postgres/tree/bgwriter-in-crash-recovery
OK, so I've quickly tested those two PoCs patches together, in the conditions like below:
- similar append-only workload by pgbench (to eliminate other already known different WAL bottlenecks: e.g. sorting),
- 4.3GB of WAL to be applied (mostly Btree/INSERT_LEAF)
- on same system as last time (ext4 on NVMe, 1s8c16, 4.14 kernel)
- 14master already with SLRU fsync to checkpointer/pg_qgsort patches applied
TEST bgwriterPOC1:
- in severe dirty memory conditions (artificially simulated via small s_b here) --> so for workloads with very high FlushBuffer activity in StartupXLOG
- with fsync=off/fpw=off by default and on NVMe (e.g. scenario: I want to perform some PITR as fast as I can to see how production data looked like in the past, before some user deleted some data)
baseline s_b@128MB: 140.404, 0.123 (2nd small as there is small region to checkpoint)
22.49% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
---copy_user_enhanced_fast_string
|--14.72%--copyin
| __pwrite_nocancel
| FileWrite
| mdwrite
| FlushBuffer
| ReadBuffer_common
| --14.52%--btree_xlog_insert
--7.77%--copyout
__pread_nocancel
--7.57%--FileRead
mdread
ReadBuffer_common
6.13% postgres [kernel.kallsyms] [k] do_syscall_64
|--1.64%--__pwrite_nocancel
--1.23%--__pread_nocancel
3.68% postgres postgres [.] hash_search_with_hash_value
---hash_search_with_hash_value
|--1.02%--smgropen
After applying:
patch -p1 < ../0001-Run-checkpointer-and-bgworker-in-crash-recovery.patch
patch -p1 < ../0002-Optionally-don-t-wait-for-end-of-recovery-checkpoint.patch
0001+0002 s_b@128MB: similar result to above
0001+0002 s_b@128MB: 108.871, 0.114 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 1000
0001+0002 s_b@128MB: 85.392, 0.103 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 50000 #~390MB max?
18.40% postgres [kernel.kallsyms] [k] copy_user_enhanced_fast_string
---copy_user_enhanced_fast_string
|--17.79%--copyout
| __pread_nocancel
| |--16.56%--FileRead
| | mdread
| | ReadBuffer_common
--0.61%--copyin // WOW
__pwrite_nocancel
FileWrite
mdwrite
FlushBuffer
ReadBuffer_common
9.20% postgres postgres [.] hash_search_with_hash_value
---hash_search_with_hash_value
|--4.70%--smgropen
of course there is another WOW moment during recovery ("61.9%")
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 120935 0.9 0.0 866052 3824 ? Ss 09:47 0:00 postgres: checkpointer
postgres 120936 61.9 0.0 865796 3824 ? Rs 09:47 0:22 postgres: background writer
postgres 120937 97.4 0.0 865940 5228 ? Rs 09:47 0:36 postgres: startup recovering 000000010000000000000089
speedup of 1.647x when dirty memory is in way. When it's not:
baseline s_b@24000MB: 39.199, 1.448 (2x patches off)
0001+0002 s_b@24000MB: 39.383, 1.442 , bgwriter_delay = 10ms/bgwriter_lru_maxpages = 50000 #~390MB/s max, yay
there's no regression. I have only one comment about those 2 WIP patches, bgwriter_lru_maxpages should be maybe called standby_bgwriter_lru_maxpages in this scenario or even more preferred there shouldn't be a maximum set during closed DB recovery scenario.
TEST bgwriterPOC2a to showcase the 2nd patch which opens the DB for read-write users before the final checkpoint finishes after redo recovery. The DBA may make the decision via this parameter end_of_recovery_checkpoint_wait=off.
- on slow storage (xvda, fsync=on) and even with high memory:
s_b@24000MB: 39.043, 15.639 -- even with WAL recovery being 100% CPU bound(mostly on hash_search_with_hash_value() for Buffers/__memmove_ssse3_back), it took additional 15s to perform checkpoint before DB was open for users (it had to write 269462 buffers =~ 2GB =~ 140MB/s which is close to the xvda device speed): the complete output looks in 14master looks similar to this:
1598609928.620 startup 22543 LOG: redo done at 1/12201C88
1598609928.624 checkpointer 22541 LOG: checkpoint starting: end-of-recovery immediate wait
1598609944.908 checkpointer 22541 LOG: checkpoint complete: wrote 269462 buffers (8.6%); 0 WAL file(s) added, 0 removed, 273 recycled; write=15.145 s, sync=0.138 s, total=16.285 s; sync files=11, longest=0.133 s, average=0.012 s; distance=4468855 kB, estimate=4468855 kB
1598609944.912 postmaster 22538 LOG: database system is ready to accept connections
s_b@24000MB: 39.96, 0 , with end_of_recovery_checkpoint_wait = off, before DB is open 15s faster
1598610331.556 startup 29499 LOG: redo done at 1/12201C88
1598610331.559 checkpointer 29497 LOG: checkpoint starting: immediate force
1598610331.562 postmaster 29473 LOG: database system is ready to accept connections
1598610347.202 checkpointer 29497 LOG: checkpoint complete: wrote 269462 buffers (8.6%); 0 WAL file(s) added, 0 removed, 273 recycled; write=15.092 s, sync=0.149 s, total=15.643 s; sync files=12, longest=0.142 s, average=0.012 s; distance=4468855 kB, estimate=4468855 kB
I suppose a checkpoint for large shared_buffers (hundredths of GB) might take a lot of time and this 0002 patch bypasses that. I would find it quite useful in some scenarios (e.g. testing backups, PITR recoveries, opening DB from storage snapshots / storage replication, maybe with DWH-after-crash too).
TEST bgwriterPOC2b: FYI, I was also testing the the hot_standby code path -- to test if it would reduce time of starting / opening a fresh standby for read-only queries, but this parameter doesn't seem to influence that in my tests. As I've learned it's apparently much more complex to reproduce what I'm after and involves a lot of reading about LogStandbySnapshot() / standby recovery points on my side.
Now, back to smgropen() hash_search_by_values() reproducer...
-Jakub Wartak.
On Sat, Aug 29, 2020 at 12:43 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
postgres 120935 0.9 0.0 866052 3824 ? Ss 09:47 0:00 postgres: checkpointer
postgres 120936 61.9 0.0 865796 3824 ? Rs 09:47 0:22 postgres: background writer
postgres 120937 97.4 0.0 865940 5228 ? Rs 09:47 0:36 postgres: startup recovering 000000010000000000000089speedup of 1.647x
Thanks for testing! That's better than I expected. I guess it wasn't
quite so good with default bgwriter settings.
I have only one comment about those 2 WIP patches, bgwriter_lru_maxpages should be maybe called standby_bgwriter_lru_maxpages in this scenario or even more preferred there shouldn't be a maximum set during closed DB recovery scenario.
I wish bgwriter could auto-tune itself better, so we wouldn't need to
contemplate adding more settings.
As for the second patch ("Optionally, don't wait for end-of-recovery
checkpoint."), that also looked quite useful in your test scenario:
end_of_recovery_checkpoint_wait = off, before DB is open 15s faster
I suppose a checkpoint for large shared_buffers (hundredths of GB) might take a lot of time and this 0002 patch bypasses that. I would find it quite useful in some scenarios (e.g. testing backups, PITR recoveries, opening DB from storage snapshots / storage replication, maybe with DWH-after-crash too).
I suppose a third option that you might want is no checkpoint at all
(meaning leave it to regular checkpoint scheduling), like fast
promotion. One thing missing from the patch is that we probably need
to log an end-of-recovery *record*, like fast promotion does. I'm a
little fuzzy on the timeline stuff. I wonder if any recovery experts
would like to weigh in on theoretical problems that might be lurking
here...
On Sat, Aug 29, 2020 at 12:43 AM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
... %CPU ... COMMAND
... 97.4 ... postgres: startup recovering 000000010000000000000089
So, what else is pushing this thing off CPU, anyway? For one thing, I
guess it might be stalling while reading the WAL itself, because (1)
we only read it 8KB at a time, relying on kernel read-ahead, which
typically defaults to 128KB I/Os unless you cranked it up, but for
example we know that's not enough to saturate a sequential scan on
NVME system, so maybe it hurts here too (2) we keep having to switch
segment files every 16MB. Increasing WAL segment size and kernel
readahead size presumably help with that, if indeed it is a problem,
but we could also experiment with a big POSIX_FADV_WILLNEED hint for a
future segment every time we cross a boundary, and also maybe increase
the size of our reads.
Hi Thomas, hackers,
... %CPU ... COMMAND
... 97.4 ... postgres: startup recovering 000000010000000000000089So, what else is pushing this thing off CPU, anyway? For one thing, I
guess it might be stalling while reading the WAL itself, because (1)
we only read it 8KB at a time, relying on kernel read-ahead, which
typically defaults to 128KB I/Os unless you cranked it up, but for
example we know that's not enough to saturate a sequential scan on
NVME system, so maybe it hurts here too (2) we keep having to switch
segment files every 16MB. Increasing WAL segment size and kernel
readahead size presumably help with that, if indeed it is a problem,
but we could also experiment with a big POSIX_FADV_WILLNEED hint for a
future segment every time we cross a boundary, and also maybe increase
the size of our reads.
All of the above (1,2) would make sense and the effects IMHO are partially possible to achieve via ./configure compile options, but from previous correspondence [1]/messages/by-id/VI1PR0701MB6960EEB838D53886D8A180E3F6520@VI1PR0701MB6960.eurprd07.prod.outlook.com please see profile after "0001+0002 s_b(at)128MB: 85.392" in this particular workload, it looked like it was not WAL reading, but reading random DB blocks into shared buffer: in that case I suppose it was the price of too many syscalls to the OS/VFS cache itself as the DB was small and fully cached there - so problem (3): copy_user_enhanced_fast_string <- 17.79%--copyout (!) <- __pread_nocancel <- 16.56%--FileRead / mdread / ReadBuffer_common (!). Without some micro-optimization or some form of vectorized [batching] I/O in recovery it's dead end when it comes to small changes. Thing that come to my mind as for enhancing recovery:
- preadv() - works only for 1 fd, while WAL stream might require reading a lot of random pages into s_b (many relations/fds, even btree inserting to single relation might put data into many 1GB [default] forks). This would only micro-optimize INSERT (pk) SELECT nextval(seq) kind of processing on recovery side I suppose. Of coruse provided that StartupXLOG would be more working in a batched way: (a) reading a lot of blocks from WAL at once (b) then issuing preadv() to get all the DB blocks into s_b going from the same rel/fd (c) applying WAL. Sounds like a major refactor just to save syscalls :(
- mmap() - even more unrealistic
- IO_URING - gives a lot of promise here I think, is it even planned to be shown for PgSQL14 cycle ? Or it's more like PgSQL15?
-Jakub Wartak
[1]: /messages/by-id/VI1PR0701MB6960EEB838D53886D8A180E3F6520@VI1PR0701MB6960.eurprd07.prod.outlook.com please see profile after "0001+0002 s_b(at)128MB: 85.392"
On Mon, Aug 31, 2020 at 8:50 PM Jakub Wartak <Jakub.Wartak@tomtom.com> wrote:
- IO_URING - gives a lot of promise here I think, is it even planned to be shown for PgSQL14 cycle ? Or it's more like PgSQL15?
I can't answer that, but I've played around with the prototype quite a
bit, and thought quite a lot about how to port it to systems without
IO_URING, and I'm just as keen to see this happen as you are.
In the meantime, from the low-hanging-fruit department, here's a new
version of the SLRU-fsync-offload patch. The only changes are a
tweaked commit message, and adoption of C99 designated initialisers
for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of
relying on humans to make the array order match the enum values. If
there are no comments or objections, I'm planning to commit this quite
soon.
Attachments:
v3-0001-Defer-flushing-of-SLRU-files.patchtext/x-patch; charset=US-ASCII; name=v3-0001-Defer-flushing-of-SLRU-files.patchDownload
From 0eba29484684a0642c4424741e256b0787051a19 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 19 Sep 2020 16:22:06 +1200
Subject: [PATCH v3] Defer flushing of SLRU files.
Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages due to cache pressure, leading to an I/O stall in user
backends and recovery. Collapse requests for the same file into a
single system call as part of the next checkpoint, as we already do for
relation files. This causes a significant improvement to recovery
times.
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
---
src/backend/access/transam/clog.c | 13 +++-
src/backend/access/transam/commit_ts.c | 12 ++-
src/backend/access/transam/multixact.c | 24 +++++-
src/backend/access/transam/slru.c | 101 +++++++++++++++++++------
src/backend/access/transam/subtrans.c | 4 +-
src/backend/commands/async.c | 5 +-
src/backend/storage/lmgr/predicate.c | 4 +-
src/backend/storage/sync/sync.c | 30 +++++++-
src/include/access/clog.h | 3 +
src/include/access/commit_ts.h | 3 +
src/include/access/multixact.h | 4 +
src/include/access/slru.h | 12 ++-
src/include/storage/sync.h | 7 +-
13 files changed, 175 insertions(+), 47 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 65aa8841f7..304612c159 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "storage/proc.h"
+#include "storage/sync.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+ XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+ SYNC_HANDLER_CLOG);
}
/*
@@ -1033,3 +1035,12 @@ clog_redo(XLogReaderState *record)
else
elog(PANIC, "clog_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..913ec9e48d 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER);
+ LWTRANCHE_COMMITTS_BUFFER,
+ SYNC_HANDLER_COMMIT_TS);
commitTsShared = ShmemInitStruct("CommitTs shared",
sizeof(CommitTimestampShared),
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
else
elog(PANIC, "commit_ts_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..344006b0f5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+ LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ SYNC_HANDLER_MULTIXACT_OFFSET);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+ LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ SYNC_HANDLER_MULTIXACT_MEMBER);
/* Initialize our shared state struct */
MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -3386,3 +3388,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funccxt);
}
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 7640f153c2..21d5aa2dc0 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -80,6 +80,18 @@ typedef struct SlruFlushData
typedef struct SlruFlushData *SlruFlush;
+/*
+ * Populate a file tag describing a segment file. We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_handler,xx_segno) \
+( \
+ memset(&(a), 0, sizeof(FileTag)), \
+ (a).handler = (xx_handler), \
+ (a).segno = (xx_segno) \
+)
+
/*
* Macro to mark a buffer slot "most recently used". Note multiple evaluation
* of arguments!
@@ -173,7 +185,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id)
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
@@ -251,7 +264,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
* assume caller set PagePrecedes.
*/
ctl->shared = shared;
- ctl->do_fsync = true; /* default behavior */
+ ctl->sync_handler = sync_handler;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -870,23 +883,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
pgstat_report_wait_end();
- /*
- * If not part of Flush, need to fsync now. We assume this happens
- * infrequently enough that it's not a performance issue.
- */
- if (!fdata)
+ /* Queue up a sync request for the checkpointer. */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
- if (ctl->do_fsync && pg_fsync(fd) != 0)
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
{
+ /* No space to enqueue sync request. Do it synchronously. */
+ pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+ if (pg_fsync(fd) != 0)
+ {
+ pgstat_report_wait_end();
+ slru_errcause = SLRU_FSYNC_FAILED;
+ slru_errno = errno;
+ CloseTransientFile(fd);
+ return false;
+ }
pgstat_report_wait_end();
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- CloseTransientFile(fd);
- return false;
}
- pgstat_report_wait_end();
+ }
+ /* Close file, unless part of flush request. */
+ if (!fdata)
+ {
if (CloseTransientFile(fd) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1162,21 +1183,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
LWLockRelease(shared->ControlLock);
/*
- * Now fsync and close any files that were open
+ * Now close any files that were open
*/
ok = true;
for (i = 0; i < fdata.num_files; i++)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
- if (ctl->do_fsync && pg_fsync(fdata.fd[i]) != 0)
- {
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
- ok = false;
- }
- pgstat_report_wait_end();
-
if (CloseTransientFile(fdata.fd[i]) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1303,6 +1314,7 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
int slotno;
char path[MAXPGPATH];
bool did_write;
+ FileTag tag;
/* Clean out any possibly existing references to the segment. */
LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
@@ -1346,6 +1358,17 @@ restart:
snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
ereport(DEBUG2,
(errmsg("removing file \"%s\"", path)));
+
+ /*
+ * Tell the checkpointer to forget any sync requests, before we unlink the
+ * file.
+ */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
+ {
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+ }
+
unlink(path);
LWLockRelease(shared->ControlLock);
@@ -1444,3 +1467,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
return retval;
}
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+ int fd;
+ int save_errno;
+ int result;
+
+ SlruFileName(ctl, path, ftag->segno);
+
+ fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+ if (fd < 0)
+ return -1;
+
+ result = pg_fsync(fd);
+ save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ errno = save_errno;
+ return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index a50f60b99a..16867c9d1e 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SubTransCtl->do_fsync = false;
+ LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
}
/*
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index cb341365df..8dbcace3f9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -554,9 +554,8 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- NotifyCtl->do_fsync = false;
+ NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ SYNC_HANDLER_NONE);
if (!found)
{
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a2f8e7524b..577914a2b8 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -821,9 +821,7 @@ SerialInit(void)
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SerialSlruCtl->do_fsync = false;
+ LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
/*
* Create or attach to the SerialControl structure.
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 3ded2cdd71..e5b65ca833 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -18,6 +18,9 @@
#include <fcntl.h>
#include <sys/file.h>
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "commands/tablespace.h"
@@ -90,13 +93,32 @@ typedef struct SyncOps
const FileTag *candidate);
} SyncOps;
+/*
+ * These indexes must correspond to the values of the SyncRequestHandler enum.
+ */
static const SyncOps syncsw[] = {
/* magnetic disk */
- {
+ [SYNC_HANDLER_MD] = {
.sync_syncfiletag = mdsyncfiletag,
.sync_unlinkfiletag = mdunlinkfiletag,
.sync_filetagmatches = mdfiletagmatches
- }
+ },
+ /* pg_xact */
+ [SYNC_HANDLER_CLOG] = {
+ .sync_syncfiletag = clogsyncfiletag
+ },
+ /* pg_commit_ts */
+ [SYNC_HANDLER_COMMIT_TS] = {
+ .sync_syncfiletag = committssyncfiletag
+ },
+ /* pg_multixact/offsets */
+ [SYNC_HANDLER_MULTIXACT_OFFSET] = {
+ .sync_syncfiletag = multixactoffsetssyncfiletag
+ },
+ /* pg_multixact/members */
+ [SYNC_HANDLER_MULTIXACT_MEMBER] {
+ .sync_syncfiletag = multixactmemberssyncfiletag
+ },
};
/*
@@ -505,8 +527,8 @@ RememberSyncRequest(const FileTag *ftag, SyncRequestType type)
(void *) ftag,
HASH_ENTER,
&found);
- /* if new entry, initialize it */
- if (!found)
+ /* if new entry, or was previously canceled, initialize it */
+ if (!found || entry->canceled)
{
entry->cycle_ctr = sync_cycle_ctr;
entry->canceled = false;
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 2db8acb189..d97b9042dc 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
#define CLOG_H
#include "access/xlogreader.h"
+#include "storage/sync.h"
#include "lib/stringinfo.h"
/*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
extern void ExtendCLOG(TransactionId newestXact);
extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define CLOG_ZEROPAGE 0x00
#define CLOG_TRUNCATE 0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 2740c02a84..27900ce430 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
#include "replication/origin.h"
+#include "storage/sync.h"
#include "utils/guc.h"
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
TransactionId newestXact);
extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define COMMIT_TS_ZEROPAGE 0x00
#define COMMIT_TS_TRUNCATE 0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6d729008c6..71d6e78063 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
+#include "storage/sync.h"
/*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
MultiXactId multi2);
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
extern void AtEOXact_MultiXact(void);
extern void AtPrepare_MultiXact(void);
extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 61fbc80ef0..2720284157 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "storage/lwlock.h"
+#include "storage/sync.h"
/*
@@ -111,10 +112,10 @@ typedef struct SlruCtlData
SlruShared shared;
/*
- * This flag tells whether to fsync writes (true for pg_xact and multixact
- * stuff, false for pg_subtrans and pg_notify).
+ * Which sync handler function to use when handing sync requests over to
+ * the checkpointer. SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
*/
- bool do_fsync;
+ SyncRequestHandler sync_handler;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
@@ -135,7 +136,8 @@ typedef SlruCtlData *SlruCtl;
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id);
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -151,6 +153,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
extern void SlruDeleteSegment(SlruCtl ctl, int segno);
+extern int slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path);
+
/* SlruScanDirectory public callbacks */
extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index e16ab8e711..f32e412e75 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -34,7 +34,12 @@ typedef enum SyncRequestType
*/
typedef enum SyncRequestHandler
{
- SYNC_HANDLER_MD = 0 /* md smgr */
+ SYNC_HANDLER_MD = 0,
+ SYNC_HANDLER_CLOG,
+ SYNC_HANDLER_COMMIT_TS,
+ SYNC_HANDLER_MULTIXACT_OFFSET,
+ SYNC_HANDLER_MULTIXACT_MEMBER,
+ SYNC_HANDLER_NONE
} SyncRequestHandler;
/*
--
2.20.1
On Sat, Sep 19, 2020 at 5:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
In the meantime, from the low-hanging-fruit department, here's a new
version of the SLRU-fsync-offload patch. The only changes are a
tweaked commit message, and adoption of C99 designated initialisers
for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of
relying on humans to make the array order match the enum values. If
there are no comments or objections, I'm planning to commit this quite
soon.
... and CI told me that Windows didn't like my array syntax with the
extra trailing comma. Here's one without.
Attachments:
v4-0001-Defer-flushing-of-SLRU-files.patchtext/x-patch; charset=US-ASCII; name=v4-0001-Defer-flushing-of-SLRU-files.patchDownload
From 89bc77827a73c93914f018fc862439f53ff54a39 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Sat, 19 Sep 2020 16:22:06 +1200
Subject: [PATCH v4 1/2] Defer flushing of SLRU files.
Previously, we called fsync() after writing out pg_xact, multixact and
commit_ts pages due to cache pressure, leading to an I/O stall in user
backends and recovery. Collapse requests for the same file into a
single system call as part of the next checkpoint, as we already do for
relation files. This causes a significant improvement to recovery
times.
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
---
src/backend/access/transam/clog.c | 13 +++-
src/backend/access/transam/commit_ts.c | 12 ++-
src/backend/access/transam/multixact.c | 24 +++++-
src/backend/access/transam/slru.c | 101 +++++++++++++++++++------
src/backend/access/transam/subtrans.c | 4 +-
src/backend/commands/async.c | 5 +-
src/backend/storage/lmgr/predicate.c | 4 +-
src/backend/storage/sync/sync.c | 28 ++++++-
src/include/access/clog.h | 3 +
src/include/access/commit_ts.h | 3 +
src/include/access/multixact.h | 4 +
src/include/access/slru.h | 12 ++-
src/include/storage/sync.h | 7 +-
13 files changed, 174 insertions(+), 46 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 65aa8841f7..304612c159 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "storage/proc.h"
+#include "storage/sync.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+ XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+ SYNC_HANDLER_CLOG);
}
/*
@@ -1033,3 +1035,12 @@ clog_redo(XLogReaderState *record)
else
elog(PANIC, "clog_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..913ec9e48d 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER);
+ LWTRANCHE_COMMITTS_BUFFER,
+ SYNC_HANDLER_COMMIT_TS);
commitTsShared = ShmemInitStruct("CommitTs shared",
sizeof(CommitTimestampShared),
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
else
elog(PANIC, "commit_ts_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..344006b0f5 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+ LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ SYNC_HANDLER_MULTIXACT_OFFSET);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+ LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ SYNC_HANDLER_MULTIXACT_MEMBER);
/* Initialize our shared state struct */
MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -3386,3 +3388,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funccxt);
}
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+ return slrusyncfiletag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 7640f153c2..21d5aa2dc0 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -80,6 +80,18 @@ typedef struct SlruFlushData
typedef struct SlruFlushData *SlruFlush;
+/*
+ * Populate a file tag describing a segment file. We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_handler,xx_segno) \
+( \
+ memset(&(a), 0, sizeof(FileTag)), \
+ (a).handler = (xx_handler), \
+ (a).segno = (xx_segno) \
+)
+
/*
* Macro to mark a buffer slot "most recently used". Note multiple evaluation
* of arguments!
@@ -173,7 +185,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id)
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
@@ -251,7 +264,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
* assume caller set PagePrecedes.
*/
ctl->shared = shared;
- ctl->do_fsync = true; /* default behavior */
+ ctl->sync_handler = sync_handler;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -870,23 +883,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
pgstat_report_wait_end();
- /*
- * If not part of Flush, need to fsync now. We assume this happens
- * infrequently enough that it's not a performance issue.
- */
- if (!fdata)
+ /* Queue up a sync request for the checkpointer. */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
- if (ctl->do_fsync && pg_fsync(fd) != 0)
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
{
+ /* No space to enqueue sync request. Do it synchronously. */
+ pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+ if (pg_fsync(fd) != 0)
+ {
+ pgstat_report_wait_end();
+ slru_errcause = SLRU_FSYNC_FAILED;
+ slru_errno = errno;
+ CloseTransientFile(fd);
+ return false;
+ }
pgstat_report_wait_end();
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- CloseTransientFile(fd);
- return false;
}
- pgstat_report_wait_end();
+ }
+ /* Close file, unless part of flush request. */
+ if (!fdata)
+ {
if (CloseTransientFile(fd) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1162,21 +1183,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
LWLockRelease(shared->ControlLock);
/*
- * Now fsync and close any files that were open
+ * Now close any files that were open
*/
ok = true;
for (i = 0; i < fdata.num_files; i++)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
- if (ctl->do_fsync && pg_fsync(fdata.fd[i]) != 0)
- {
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
- ok = false;
- }
- pgstat_report_wait_end();
-
if (CloseTransientFile(fdata.fd[i]) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1303,6 +1314,7 @@ SlruDeleteSegment(SlruCtl ctl, int segno)
int slotno;
char path[MAXPGPATH];
bool did_write;
+ FileTag tag;
/* Clean out any possibly existing references to the segment. */
LWLockAcquire(shared->ControlLock, LW_EXCLUSIVE);
@@ -1346,6 +1358,17 @@ restart:
snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
ereport(DEBUG2,
(errmsg("removing file \"%s\"", path)));
+
+ /*
+ * Tell the checkpointer to forget any sync requests, before we unlink the
+ * file.
+ */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
+ {
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+ }
+
unlink(path);
LWLockRelease(shared->ControlLock);
@@ -1444,3 +1467,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
return retval;
}
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+ int fd;
+ int save_errno;
+ int result;
+
+ SlruFileName(ctl, path, ftag->segno);
+
+ fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+ if (fd < 0)
+ return -1;
+
+ result = pg_fsync(fd);
+ save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ errno = save_errno;
+ return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index a50f60b99a..16867c9d1e 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SubTransCtl->do_fsync = false;
+ LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
}
/*
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index cb341365df..8dbcace3f9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -554,9 +554,8 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- NotifyCtl->do_fsync = false;
+ NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ SYNC_HANDLER_NONE);
if (!found)
{
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a2f8e7524b..577914a2b8 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -821,9 +821,7 @@ SerialInit(void)
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SerialSlruCtl->do_fsync = false;
+ LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
/*
* Create or attach to the SerialControl structure.
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 3ded2cdd71..1d635d596c 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -18,6 +18,9 @@
#include <fcntl.h>
#include <sys/file.h>
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "commands/tablespace.h"
@@ -90,12 +93,31 @@ typedef struct SyncOps
const FileTag *candidate);
} SyncOps;
+/*
+ * These indexes must correspond to the values of the SyncRequestHandler enum.
+ */
static const SyncOps syncsw[] = {
/* magnetic disk */
- {
+ [SYNC_HANDLER_MD] = {
.sync_syncfiletag = mdsyncfiletag,
.sync_unlinkfiletag = mdunlinkfiletag,
.sync_filetagmatches = mdfiletagmatches
+ },
+ /* pg_xact */
+ [SYNC_HANDLER_CLOG] = {
+ .sync_syncfiletag = clogsyncfiletag
+ },
+ /* pg_commit_ts */
+ [SYNC_HANDLER_COMMIT_TS] = {
+ .sync_syncfiletag = committssyncfiletag
+ },
+ /* pg_multixact/offsets */
+ [SYNC_HANDLER_MULTIXACT_OFFSET] = {
+ .sync_syncfiletag = multixactoffsetssyncfiletag
+ },
+ /* pg_multixact/members */
+ [SYNC_HANDLER_MULTIXACT_MEMBER] = {
+ .sync_syncfiletag = multixactmemberssyncfiletag
}
};
@@ -505,8 +527,8 @@ RememberSyncRequest(const FileTag *ftag, SyncRequestType type)
(void *) ftag,
HASH_ENTER,
&found);
- /* if new entry, initialize it */
- if (!found)
+ /* if new entry, or was previously canceled, initialize it */
+ if (!found || entry->canceled)
{
entry->cycle_ctr = sync_cycle_ctr;
entry->canceled = false;
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 2db8acb189..d97b9042dc 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
#define CLOG_H
#include "access/xlogreader.h"
+#include "storage/sync.h"
#include "lib/stringinfo.h"
/*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
extern void ExtendCLOG(TransactionId newestXact);
extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define CLOG_ZEROPAGE 0x00
#define CLOG_TRUNCATE 0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 2740c02a84..27900ce430 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
#include "replication/origin.h"
+#include "storage/sync.h"
#include "utils/guc.h"
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
TransactionId newestXact);
extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define COMMIT_TS_ZEROPAGE 0x00
#define COMMIT_TS_TRUNCATE 0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6d729008c6..71d6e78063 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
+#include "storage/sync.h"
/*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
MultiXactId multi2);
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
extern void AtEOXact_MultiXact(void);
extern void AtPrepare_MultiXact(void);
extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 61fbc80ef0..2720284157 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "storage/lwlock.h"
+#include "storage/sync.h"
/*
@@ -111,10 +112,10 @@ typedef struct SlruCtlData
SlruShared shared;
/*
- * This flag tells whether to fsync writes (true for pg_xact and multixact
- * stuff, false for pg_subtrans and pg_notify).
+ * Which sync handler function to use when handing sync requests over to
+ * the checkpointer. SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
*/
- bool do_fsync;
+ SyncRequestHandler sync_handler;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
@@ -135,7 +136,8 @@ typedef SlruCtlData *SlruCtl;
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id);
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
@@ -151,6 +153,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
extern void SlruDeleteSegment(SlruCtl ctl, int segno);
+extern int slrusyncfiletag(SlruCtl ctl, const FileTag *ftag, char *path);
+
/* SlruScanDirectory public callbacks */
extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index e16ab8e711..f32e412e75 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -34,7 +34,12 @@ typedef enum SyncRequestType
*/
typedef enum SyncRequestHandler
{
- SYNC_HANDLER_MD = 0 /* md smgr */
+ SYNC_HANDLER_MD = 0,
+ SYNC_HANDLER_CLOG,
+ SYNC_HANDLER_COMMIT_TS,
+ SYNC_HANDLER_MULTIXACT_OFFSET,
+ SYNC_HANDLER_MULTIXACT_MEMBER,
+ SYNC_HANDLER_NONE
} SyncRequestHandler;
/*
--
2.20.1
On Sun, Sep 20, 2020 at 12:40 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Sat, Sep 19, 2020 at 5:06 PM Thomas Munro <thomas.munro@gmail.com> wrote:
In the meantime, from the low-hanging-fruit department, here's a new
version of the SLRU-fsync-offload patch. The only changes are a
tweaked commit message, and adoption of C99 designated initialisers
for the function table, so { [SYNC_HANDLER_CLOG] = ... } instead of
relying on humans to make the array order match the enum values. If
there are no comments or objections, I'm planning to commit this quite
soon.... and CI told me that Windows didn't like my array syntax with the
extra trailing comma. Here's one without.
While scanning for comments and identifier names that needed updating,
I realised that this patch changed the behaviour of the ShutdownXXX()
functions, since they currently flush the SLRUs but are not followed
by a checkpoint. I'm not entirely sure I understand the logic of
that, but it wasn't my intention to change it. So here's a version
that converts the existing fsync_fname() to fsync_fname_recurse() to
fix that.
Strangely, the fsync calls that ensure that directory entries are on
disk seemed to be missing from CheckPointMultixact(), so I added them.
Isn't that a live bug?
I decided it was a little too magical that CheckPointCLOG() etc
depended on the later call to CheckPointBuffers() to perform their
fsyncs. I started writing comments about that, but then I realised
that the right thing to do was to hoist ProcessSyncRequests() out of
there into CheckPointGuts() and make it all more explicit.
I also realised that it would be inconsistent to count SLRU sync
activity as buffer sync time, but not count SLRU write activity as
buffer write time, or count its buffers as written in the reported
statistics. In other words, SLRU buffers *are* buffers for checkpoint
reporting purposes (or should at least be consistently in or out of
the stats, and with this patch they have to be in).
Does that make sense? Is there a problem I'm not seeing with
reordering CheckPointGuts() as I have?
One comment change that seems worth highlighting is this code reached
by VACUUM, which seems like a strict improvement (it wasn't flushing
for crash recovery):
/*
- * Flush out dirty data, so PhysicalPageExists can work correctly.
- * SimpleLruFlush() is a pretty big hammer for that. Alternatively we
- * could add an in-memory version of page exists, but
find_multixact_start
- * is called infrequently, and it doesn't seem bad to flush buffers to
- * disk before truncation.
+ * Write out dirty data, so PhysicalPageExists can work correctly.
*/
- SimpleLruFlush(MultiXactOffsetCtl, true);
- SimpleLruFlush(MultiXactMemberCtl, true);
+ SimpleLruWriteAll(MultiXactOffsetCtl, true);
+ SimpleLruWriteAll(MultiXactMemberCtl, true);
Attachments:
v5-0001-Defer-flushing-of-SLRU-files.patchtext/x-patch; charset=US-ASCII; name=v5-0001-Defer-flushing-of-SLRU-files.patchDownload
From c0b2281a983144d39ff7288431ef9c4af51b2e32 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 21 Sep 2020 09:43:32 +1200
Subject: [PATCH v5] Defer flushing of SLRU files.
Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery. Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba. This causes a significant improvement to recovery
performance.
Perform a new recursive fsync() in the ShutdownXXX() functions to
preserve existing behavior, since they won't be followed by a
checkpoint.
Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool. Also rearrange things so that data collected in
CheckpointStats includes SLRU activity.
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
---
src/backend/access/transam/clog.c | 29 +++--
src/backend/access/transam/commit_ts.c | 28 +++--
src/backend/access/transam/multixact.c | 62 +++++++---
src/backend/access/transam/slru.c | 150 +++++++++++++++++--------
src/backend/access/transam/subtrans.c | 12 +-
src/backend/access/transam/xlog.c | 14 ++-
src/backend/commands/async.c | 5 +-
src/backend/storage/buffer/bufmgr.c | 7 --
src/backend/storage/file/fd.c | 20 ++++
src/backend/storage/lmgr/predicate.c | 8 +-
src/backend/storage/sync/sync.c | 28 ++++-
src/include/access/clog.h | 3 +
src/include/access/commit_ts.h | 3 +
src/include/access/multixact.h | 4 +
src/include/access/slru.h | 14 ++-
src/include/storage/fd.h | 1 +
src/include/storage/sync.h | 7 +-
src/tools/pgindent/typedefs.list | 1 +
18 files changed, 284 insertions(+), 112 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 65aa8841f7..c3984c5bfc 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "storage/proc.h"
+#include "storage/sync.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+ XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+ SYNC_HANDLER_CLOG);
}
/*
@@ -814,15 +816,15 @@ TrimCLOG(void)
void
ShutdownCLOG(void)
{
- /* Flush dirty CLOG pages to disk */
+ /* Write dirty CLOG pages to disk */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
- SimpleLruFlush(XactCtl, false);
+ SimpleLruWriteAll(XactCtl, false);
/*
- * fsync pg_xact to ensure that any files flushed previously are durably
+ * fsync pg_xact to ensure that any files written previously are durably
* on disk.
*/
- fsync_fname("pg_xact", true);
+ fsync_fname_recurse("pg_xact", true);
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
}
@@ -833,13 +835,13 @@ ShutdownCLOG(void)
void
CheckPointCLOG(void)
{
- /* Flush dirty CLOG pages to disk */
+ /* Write dirty CLOG pages to disk */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
- SimpleLruFlush(XactCtl, true);
+ SimpleLruWriteAll(XactCtl, true);
/*
- * fsync pg_xact to ensure that any files flushed previously are durably
- * on disk.
+ * fsync pg_xact to ensure that directory entries are durably on disk.
+ * The contents of files will be handled by ProcessSyncRequests().
*/
fsync_fname("pg_xact", true);
@@ -1033,3 +1035,12 @@ clog_redo(XLogReaderState *record)
else
elog(PANIC, "clog_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..9ec67fb30c 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER);
+ LWTRANCHE_COMMITTS_BUFFER,
+ SYNC_HANDLER_COMMIT_TS);
commitTsShared = ShmemInitStruct("CommitTs shared",
sizeof(CommitTimestampShared),
@@ -804,14 +805,14 @@ DeactivateCommitTs(void)
void
ShutdownCommitTs(void)
{
- /* Flush dirty CommitTs pages to disk */
- SimpleLruFlush(CommitTsCtl, false);
+ /* Write dirty CommitTs pages to disk */
+ SimpleLruWriteAll(CommitTsCtl, false);
/*
- * fsync pg_commit_ts to ensure that any files flushed previously are
+ * fsync pg_commit_ts to ensure that any files written previously are
* durably on disk.
*/
- fsync_fname("pg_commit_ts", true);
+ fsync_fname_recurse("pg_commit_ts", true);
}
/*
@@ -820,12 +821,12 @@ ShutdownCommitTs(void)
void
CheckPointCommitTs(void)
{
- /* Flush dirty CommitTs pages to disk */
- SimpleLruFlush(CommitTsCtl, true);
+ /* Write dirty CommitTs pages to disk */
+ SimpleLruWriteAll(CommitTsCtl, true);
/*
- * fsync pg_commit_ts to ensure that any files flushed previously are
- * durably on disk.
+ * fsync pg_commit_ts to ensure that directory entries are durably on disk.
+ * The contents of files will be handled by ProcessSyncRequests().
*/
fsync_fname("pg_commit_ts", true);
}
@@ -1083,3 +1084,12 @@ commit_ts_redo(XLogReaderState *record)
else
elog(PANIC, "commit_ts_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..8d65c46df9 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+ LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ SYNC_HANDLER_MULTIXACT_OFFSET);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+ LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ SYNC_HANDLER_MULTIXACT_MEMBER);
/* Initialize our shared state struct */
MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -2106,10 +2108,19 @@ TrimMultiXact(void)
void
ShutdownMultiXact(void)
{
- /* Flush dirty MultiXact pages to disk */
+ /* Write dirty MultiXact pages to disk */
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_START(false);
- SimpleLruFlush(MultiXactOffsetCtl, false);
- SimpleLruFlush(MultiXactMemberCtl, false);
+
+ SimpleLruWriteAll(MultiXactOffsetCtl, false);
+ SimpleLruWriteAll(MultiXactMemberCtl, false);
+
+ /*
+ * fsync pg_multixact to ensure that any files written previously are
+ * durably on disk.
+ */
+ fsync_fname_recurse("pg_multixact/offsets", true);
+ fsync_fname_recurse("pg_multixact/members", true);
+
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(false);
}
@@ -2143,9 +2154,16 @@ CheckPointMultiXact(void)
{
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_START(true);
- /* Flush dirty MultiXact pages to disk */
- SimpleLruFlush(MultiXactOffsetCtl, true);
- SimpleLruFlush(MultiXactMemberCtl, true);
+ /* Write dirty MultiXact pages to disk */
+ SimpleLruWriteAll(MultiXactOffsetCtl, true);
+ SimpleLruWriteAll(MultiXactMemberCtl, true);
+
+ /*
+ * fsync to ensure that directory entries are on disk. File contents will
+ * be handled by ProcessSyncRequests().
+ */
+ fsync_fname("pg_multixact/offsets", true);
+ fsync_fname("pg_multixact/members", true);
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
}
@@ -2728,14 +2746,10 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
entryno = MultiXactIdToOffsetEntry(multi);
/*
- * Flush out dirty data, so PhysicalPageExists can work correctly.
- * SimpleLruFlush() is a pretty big hammer for that. Alternatively we
- * could add an in-memory version of page exists, but find_multixact_start
- * is called infrequently, and it doesn't seem bad to flush buffers to
- * disk before truncation.
+ * Write out dirty data, so PhysicalPageExists can work correctly.
*/
- SimpleLruFlush(MultiXactOffsetCtl, true);
- SimpleLruFlush(MultiXactMemberCtl, true);
+ SimpleLruWriteAll(MultiXactOffsetCtl, true);
+ SimpleLruWriteAll(MultiXactMemberCtl, true);
if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
return false;
@@ -3386,3 +3400,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funccxt);
}
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 7640f153c2..52314d0ee0 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -63,22 +63,33 @@
snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
/*
- * During SimpleLruFlush(), we will usually not need to write/fsync more
- * than one or two physical files, but we may need to write several pages
- * per file. We can consolidate the I/O requests by leaving files open
- * until control returns to SimpleLruFlush(). This data structure remembers
- * which files are open.
+ * During SimpleLruWriteAll(), we will usually not need to write more than one
+ * or two physical files, but we may need to write several pages per file. We
+ * can consolidate the I/O requests by leaving files open until control returns
+ * to SimpleLruWriteAll(). This data structure remembers which files are open.
*/
-#define MAX_FLUSH_BUFFERS 16
+#define MAX_WRITEALL_BUFFERS 16
-typedef struct SlruFlushData
+typedef struct SlruWriteAllData
{
int num_files; /* # files actually open */
- int fd[MAX_FLUSH_BUFFERS]; /* their FD's */
- int segno[MAX_FLUSH_BUFFERS]; /* their log seg#s */
-} SlruFlushData;
+ int fd[MAX_WRITEALL_BUFFERS]; /* their FD's */
+ int segno[MAX_WRITEALL_BUFFERS]; /* their log seg#s */
+} SlruWriteAllData;
-typedef struct SlruFlushData *SlruFlush;
+typedef struct SlruWriteAllData *SlruWriteAll;
+
+/*
+ * Populate a file tag describing a segment file. We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_handler,xx_segno) \
+( \
+ memset(&(a), 0, sizeof(FileTag)), \
+ (a).handler = (xx_handler), \
+ (a).segno = (xx_segno) \
+)
/*
* Macro to mark a buffer slot "most recently used". Note multiple evaluation
@@ -125,10 +136,10 @@ static int slru_errno;
static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
-static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata);
+static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
static bool SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno);
static bool SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno,
- SlruFlush fdata);
+ SlruWriteAll fdata);
static void SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid);
static int SlruSelectLRUPage(SlruCtl ctl, int pageno);
@@ -173,7 +184,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id)
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
@@ -251,7 +263,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
* assume caller set PagePrecedes.
*/
ctl->shared = shared;
- ctl->do_fsync = true; /* default behavior */
+ ctl->sync_handler = sync_handler;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -523,7 +535,7 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
* Control lock must be held at entry, and will be held at exit.
*/
static void
-SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
+SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
{
SlruShared shared = ctl->shared;
int pageno = shared->page_number[slotno];
@@ -587,6 +599,10 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
/* Now it's okay to ereport if we failed */
if (!ok)
SlruReportIOError(ctl, pageno, InvalidTransactionId);
+
+ /* If part of a checkpoint, count this as a buffer written. */
+ if (fdata)
+ CheckpointStats.ckpt_bufs_written++;
}
/*
@@ -730,13 +746,13 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
*
* For now, assume it's not worth keeping a file pointer open across
* independent read/write operations. We do batch operations during
- * SimpleLruFlush, though.
+ * SimpleLruWriteAll, though.
*
* fdata is NULL for a standalone write, pointer to open-file info during
- * SimpleLruFlush.
+ * SimpleLruWriteAll.
*/
static bool
-SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
+SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruWriteAll fdata)
{
SlruShared shared = ctl->shared;
int segno = pageno / SLRU_PAGES_PER_SEGMENT;
@@ -791,7 +807,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
/*
- * During a Flush, we may already have the desired file open.
+ * During a WriteAll, we may already have the desired file open.
*/
if (fdata)
{
@@ -837,7 +853,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
if (fdata)
{
- if (fdata->num_files < MAX_FLUSH_BUFFERS)
+ if (fdata->num_files < MAX_WRITEALL_BUFFERS)
{
fdata->fd[fdata->num_files] = fd;
fdata->segno[fdata->num_files] = segno;
@@ -870,23 +886,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
pgstat_report_wait_end();
- /*
- * If not part of Flush, need to fsync now. We assume this happens
- * infrequently enough that it's not a performance issue.
- */
- if (!fdata)
+ /* Queue up a sync request for the checkpointer. */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
- if (ctl->do_fsync && pg_fsync(fd) != 0)
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
{
+ /* No space to enqueue sync request. Do it synchronously. */
+ pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+ if (pg_fsync(fd) != 0)
+ {
+ pgstat_report_wait_end();
+ slru_errcause = SLRU_FSYNC_FAILED;
+ slru_errno = errno;
+ CloseTransientFile(fd);
+ return false;
+ }
pgstat_report_wait_end();
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- CloseTransientFile(fd);
- return false;
}
- pgstat_report_wait_end();
+ }
+ /* Close file, unless part of flush request. */
+ if (!fdata)
+ {
if (CloseTransientFile(fd) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1122,13 +1146,14 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
}
/*
- * Flush dirty pages to disk during checkpoint or database shutdown
+ * Write dirty pages to disk during checkpoint or database shutdown. Flushing
+ * is deferred until the next call to ProcessSyncRequests().
*/
void
-SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
+SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
{
SlruShared shared = ctl->shared;
- SlruFlushData fdata;
+ SlruWriteAllData fdata;
int slotno;
int pageno = 0;
int i;
@@ -1162,21 +1187,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
LWLockRelease(shared->ControlLock);
/*
- * Now fsync and close any files that were open
+ * Now close any files that were open
*/
ok = true;
for (i = 0; i < fdata.num_files; i++)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
- if (ctl->do_fsync && pg_fsync(fdata.fd[i]) != 0)
- {
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
- ok = false;
- }
- pgstat_report_wait_end();
-
if (CloseTransientFile(fdata.fd[i]) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1346,6 +1361,19 @@ restart:
snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
ereport(DEBUG2,
(errmsg("removing file \"%s\"", path)));
+
+ /*
+ * Tell the checkpointer to forget any sync requests, before we unlink the
+ * file.
+ */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
+ {
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+ }
+
unlink(path);
LWLockRelease(shared->ControlLock);
@@ -1444,3 +1472,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
return retval;
}
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+ int fd;
+ int save_errno;
+ int result;
+
+ SlruFileName(ctl, path, ftag->segno);
+
+ fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+ if (fd < 0)
+ return -1;
+
+ result = pg_fsync(fd);
+ save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ errno = save_errno;
+ return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index a50f60b99a..d9612c1f03 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SubTransCtl->do_fsync = false;
+ LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
}
/*
@@ -285,13 +283,13 @@ void
ShutdownSUBTRANS(void)
{
/*
- * Flush dirty SUBTRANS pages to disk
+ * Write dirty SUBTRANS pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely as a debugging aid.
*/
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_START(false);
- SimpleLruFlush(SubTransCtl, false);
+ SimpleLruWriteAll(SubTransCtl, false);
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(false);
}
@@ -302,14 +300,14 @@ void
CheckPointSUBTRANS(void)
{
/*
- * Flush dirty SUBTRANS pages to disk
+ * Write dirty SUBTRANS pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely to improve the odds that writing of dirty pages is done by
* the checkpoint process and not by backends.
*/
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_START(true);
- SimpleLruFlush(SubTransCtl, true);
+ SimpleLruWriteAll(SubTransCtl, true);
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(true);
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61754312e2..f2c03ee303 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9176,16 +9176,28 @@ CreateEndOfRecoveryRecord(void)
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
+ /* Write out all dirty data in SLRUs and the main buffer pool */
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
+ CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
CheckPointCLOG();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
CheckPointPredicate();
+ CheckPointBuffers(flags);
+
+ /* Perform all required fsyncs */
+ CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
+ ProcessSyncRequests();
+ CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
+
+ /* Other activities performed at checkpoint time */
CheckPointRelationMap();
CheckPointReplicationSlots();
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
- CheckPointBuffers(flags); /* performs all required fsyncs */
CheckPointReplicationOrigin();
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index cb341365df..8dbcace3f9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -554,9 +554,8 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- NotifyCtl->do_fsync = false;
+ NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ SYNC_HANDLER_NONE);
if (!found)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2a963bd5b..e549fa1d30 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2636,14 +2636,7 @@ PrintBufferLeakWarning(Buffer buffer)
void
CheckPointBuffers(int flags)
{
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
- CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
BufferSync(flags);
- CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
- ProcessSyncRequests();
- CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
}
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index bd72a87ee3..93d170c118 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -3530,6 +3530,26 @@ fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel)
return 0;
}
+/*
+ * Helper for fsync_fname_recurse().
+ */
+static void
+do_fsync_fname_recurse(const char *fname, bool isdir, int elevel)
+{
+ fsync_fname_ext(fname, isdir, false, elevel);
+}
+
+/*
+ * Like fsync_fname(), but also fsync the contents of contained files.
+ */
+void
+fsync_fname_recurse(const char *fname, bool isdir)
+{
+ if (isdir)
+ walkdir(fname, do_fsync_fname_recurse, false, data_sync_elevel(ERROR));
+ fsync_fname_ext(fname, isdir, false, data_sync_elevel(ERROR));
+}
+
/*
* fsync_parent_path -- fsync the parent path of a file or directory
*
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a2f8e7524b..8a365b400c 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -821,9 +821,7 @@ SerialInit(void)
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SerialSlruCtl->do_fsync = false;
+ LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
/*
* Create or attach to the SerialControl structure.
@@ -1052,7 +1050,7 @@ CheckPointPredicate(void)
SimpleLruTruncate(SerialSlruCtl, tailPage);
/*
- * Flush dirty SLRU pages to disk
+ * Write dirty SLRU pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely as a debugging aid.
@@ -1061,7 +1059,7 @@ CheckPointPredicate(void)
* before deleting the file in which they sit, which would be completely
* pointless.
*/
- SimpleLruFlush(SerialSlruCtl, true);
+ SimpleLruWriteAll(SerialSlruCtl, true);
}
/*------------------------------------------------------------------------*/
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 3ded2cdd71..1d635d596c 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -18,6 +18,9 @@
#include <fcntl.h>
#include <sys/file.h>
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "commands/tablespace.h"
@@ -90,12 +93,31 @@ typedef struct SyncOps
const FileTag *candidate);
} SyncOps;
+/*
+ * These indexes must correspond to the values of the SyncRequestHandler enum.
+ */
static const SyncOps syncsw[] = {
/* magnetic disk */
- {
+ [SYNC_HANDLER_MD] = {
.sync_syncfiletag = mdsyncfiletag,
.sync_unlinkfiletag = mdunlinkfiletag,
.sync_filetagmatches = mdfiletagmatches
+ },
+ /* pg_xact */
+ [SYNC_HANDLER_CLOG] = {
+ .sync_syncfiletag = clogsyncfiletag
+ },
+ /* pg_commit_ts */
+ [SYNC_HANDLER_COMMIT_TS] = {
+ .sync_syncfiletag = committssyncfiletag
+ },
+ /* pg_multixact/offsets */
+ [SYNC_HANDLER_MULTIXACT_OFFSET] = {
+ .sync_syncfiletag = multixactoffsetssyncfiletag
+ },
+ /* pg_multixact/members */
+ [SYNC_HANDLER_MULTIXACT_MEMBER] = {
+ .sync_syncfiletag = multixactmemberssyncfiletag
}
};
@@ -505,8 +527,8 @@ RememberSyncRequest(const FileTag *ftag, SyncRequestType type)
(void *) ftag,
HASH_ENTER,
&found);
- /* if new entry, initialize it */
- if (!found)
+ /* if new entry, or was previously canceled, initialize it */
+ if (!found || entry->canceled)
{
entry->cycle_ctr = sync_cycle_ctr;
entry->canceled = false;
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 2db8acb189..d97b9042dc 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
#define CLOG_H
#include "access/xlogreader.h"
+#include "storage/sync.h"
#include "lib/stringinfo.h"
/*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
extern void ExtendCLOG(TransactionId newestXact);
extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define CLOG_ZEROPAGE 0x00
#define CLOG_TRUNCATE 0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 2740c02a84..27900ce430 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
#include "replication/origin.h"
+#include "storage/sync.h"
#include "utils/guc.h"
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
TransactionId newestXact);
extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define COMMIT_TS_ZEROPAGE 0x00
#define COMMIT_TS_TRUNCATE 0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6d729008c6..71d6e78063 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
+#include "storage/sync.h"
/*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
MultiXactId multi2);
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
extern void AtEOXact_MultiXact(void);
extern void AtPrepare_MultiXact(void);
extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 61fbc80ef0..b39b43504d 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "storage/lwlock.h"
+#include "storage/sync.h"
/*
@@ -111,10 +112,10 @@ typedef struct SlruCtlData
SlruShared shared;
/*
- * This flag tells whether to fsync writes (true for pg_xact and multixact
- * stuff, false for pg_subtrans and pg_notify).
+ * Which sync handler function to use when handing sync requests over to
+ * the checkpointer. SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
*/
- bool do_fsync;
+ SyncRequestHandler sync_handler;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
@@ -135,14 +136,15 @@ typedef SlruCtlData *SlruCtl;
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id);
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
TransactionId xid);
extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
-extern void SimpleLruFlush(SlruCtl ctl, bool allow_redirtied);
+extern void SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied);
extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
extern bool SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno);
@@ -151,6 +153,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
extern void SlruDeleteSegment(SlruCtl ctl, int segno);
+extern int SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path);
+
/* SlruScanDirectory public callbacks */
extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index e209f047e8..ea5c3010d2 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -155,6 +155,7 @@ extern int pg_fdatasync(int fd);
extern void pg_flush_data(int fd, off_t offset, off_t amount);
extern void fsync_fname(const char *fname, bool isdir);
extern int fsync_fname_ext(const char *fname, bool isdir, bool ignore_perm, int elevel);
+extern void fsync_fname_recurse(const char *fname, bool isdir);
extern int durable_rename(const char *oldfile, const char *newfile, int loglevel);
extern int durable_unlink(const char *fname, int loglevel);
extern int durable_rename_excl(const char *oldfile, const char *newfile, int loglevel);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index e16ab8e711..f32e412e75 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -34,7 +34,12 @@ typedef enum SyncRequestType
*/
typedef enum SyncRequestHandler
{
- SYNC_HANDLER_MD = 0 /* md smgr */
+ SYNC_HANDLER_MD = 0,
+ SYNC_HANDLER_CLOG,
+ SYNC_HANDLER_COMMIT_TS,
+ SYNC_HANDLER_MULTIXACT_OFFSET,
+ SYNC_HANDLER_MULTIXACT_MEMBER,
+ SYNC_HANDLER_NONE
} SyncRequestHandler;
/*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b1afb345c3..88fc3ceae4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2404,6 +2404,7 @@ Syn
SyncOps
SyncRepConfigData
SyncRepStandbyData
+SyncRequestHandler
SyncRequestType
SysScanDesc
SyscacheCallbackFunction
--
2.20.1
On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro <thomas.munro@gmail.com> wrote:
While scanning for comments and identifier names that needed updating,
I realised that this patch changed the behaviour of the ShutdownXXX()
functions, since they currently flush the SLRUs but are not followed
by a checkpoint. I'm not entirely sure I understand the logic of
that, but it wasn't my intention to change it. So here's a version
that converts the existing fsync_fname() to fsync_fname_recurse() to
Bleugh, that was probably a bad idea, it's too expensive. But it
forces me to ask the question: *why* do we need to call
Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a
shutdown checkpoint? I wondered if this might date from before the
WAL, but I see that the pattern was introduced when the CLOG was moved
out of shared buffers into a proto-SLRU in ancient commit 2589735da08,
but even in that commit the preceding CreateCheckPoint() call included
a call to CheckPointCLOG().
On Tue, Sep 22, 2020 at 9:08 AM Thomas Munro <thomas.munro@gmail.com> wrote:
On Mon, Sep 21, 2020 at 2:19 PM Thomas Munro <thomas.munro@gmail.com> wrote:
While scanning for comments and identifier names that needed updating,
I realised that this patch changed the behaviour of the ShutdownXXX()
functions, since they currently flush the SLRUs but are not followed
by a checkpoint. I'm not entirely sure I understand the logic of
that, but it wasn't my intention to change it. So here's a version
that converts the existing fsync_fname() to fsync_fname_recurse() toBleugh, that was probably a bad idea, it's too expensive. But it
forces me to ask the question: *why* do we need to call
Shutdown{CLOG,CommitTS,SUBTRANS, MultiXact}() after a creating a
shutdown checkpoint? I wondered if this might date from before the
WAL, but I see that the pattern was introduced when the CLOG was moved
out of shared buffers into a proto-SLRU in ancient commit 2589735da08,
but even in that commit the preceding CreateCheckPoint() call included
a call to CheckPointCLOG().
I complained about the apparently missing multixact fsync in a new
thread, because if I'm right about that it requires a back-patchable
fix.
As for the ShutdownXXX() functions, I haven't yet come up with any
reason for this code to exist. Emboldened by a colleague's inability
to explain to me what that code is doing for us, here is a new version
that just rips it all out.
Attachments:
v6-0001-Fix-missing-fsync-of-multixact-directories.patchtext/x-patch; charset=US-ASCII; name=v6-0001-Fix-missing-fsync-of-multixact-directories.patchDownload
From 89e5b787ea5efd03ced8da46f95f236e03bf4adf Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Wed, 23 Sep 2020 13:02:27 +1200
Subject: [PATCH v6 1/2] Fix missing fsync of multixact directories.
Standardize behavior by moving reponsibility for fsyncing directories
down into slru.c.
Back-patch to all supported releases.
---
src/backend/access/transam/clog.c | 7 -------
src/backend/access/transam/commit_ts.c | 6 ------
src/backend/access/transam/slru.c | 7 +++++++
3 files changed, 7 insertions(+), 13 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 65aa8841f7..9e352d2658 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -836,13 +836,6 @@ CheckPointCLOG(void)
/* Flush dirty CLOG pages to disk */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
SimpleLruFlush(XactCtl, true);
-
- /*
- * fsync pg_xact to ensure that any files flushed previously are durably
- * on disk.
- */
- fsync_fname("pg_xact", true);
-
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index 5244b06a2b..f6a7329ba3 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -822,12 +822,6 @@ CheckPointCommitTs(void)
{
/* Flush dirty CommitTs pages to disk */
SimpleLruFlush(CommitTsCtl, true);
-
- /*
- * fsync pg_commit_ts to ensure that any files flushed previously are
- * durably on disk.
- */
- fsync_fname("pg_commit_ts", true);
}
/*
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 7640f153c2..89ce7c43b5 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -1187,6 +1187,13 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
}
if (!ok)
SlruReportIOError(ctl, pageno, InvalidTransactionId);
+
+ /*
+ * Make sure that the directory entries for any newly created files are on
+ * disk.
+ */
+ if (ctl->do_fsync)
+ fsync_fname(ctl->Dir, true);
}
/*
--
2.20.1
v6-0002-Defer-flushing-of-SLRU-files.patchtext/x-patch; charset=US-ASCII; name=v6-0002-Defer-flushing-of-SLRU-files.patchDownload
From ae51604c43e10ba6d24b628c2b907eca251177e9 Mon Sep 17 00:00:00 2001
From: Thomas Munro <tmunro@postgresql.org>
Date: Mon, 21 Sep 2020 09:43:32 +1200
Subject: [PATCH v6 2/2] Defer flushing of SLRU files.
Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery. Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba. This causes a significant improvement to recovery
performance.
Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them.
Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool. Also rearrange things so that data collected in
CheckpointStats includes SLRU activity.
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
---
src/backend/access/transam/clog.c | 36 +++---
src/backend/access/transam/commit_ts.c | 32 +++--
src/backend/access/transam/multixact.c | 53 +++++----
src/backend/access/transam/slru.c | 154 +++++++++++++++++--------
src/backend/access/transam/subtrans.c | 25 +---
src/backend/access/transam/xlog.c | 18 ++-
src/backend/commands/async.c | 5 +-
src/backend/storage/buffer/bufmgr.c | 7 --
src/backend/storage/lmgr/predicate.c | 8 +-
src/backend/storage/sync/sync.c | 28 ++++-
src/include/access/clog.h | 3 +
src/include/access/commit_ts.h | 3 +
src/include/access/multixact.h | 4 +
src/include/access/slru.h | 14 ++-
src/include/storage/sync.h | 7 +-
src/tools/pgindent/typedefs.list | 1 +
16 files changed, 233 insertions(+), 165 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 9e352d2658..c4c20f508b 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "storage/proc.h"
+#include "storage/sync.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+ XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+ SYNC_HANDLER_CLOG);
}
/*
@@ -808,34 +810,15 @@ TrimCLOG(void)
LWLockRelease(XactSLRULock);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownCLOG(void)
-{
- /* Flush dirty CLOG pages to disk */
- TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
- SimpleLruFlush(XactCtl, false);
-
- /*
- * fsync pg_xact to ensure that any files flushed previously are durably
- * on disk.
- */
- fsync_fname("pg_xact", true);
-
- TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
-}
-
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointCLOG(void)
{
- /* Flush dirty CLOG pages to disk */
+ /* Write dirty CLOG pages to disk */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
- SimpleLruFlush(XactCtl, true);
+ SimpleLruWriteAll(XactCtl, true);
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
}
@@ -1026,3 +1009,12 @@ clog_redo(XLogReaderState *record)
else
elog(PANIC, "clog_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index f6a7329ba3..af5d9baae5 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER);
+ LWTRANCHE_COMMITTS_BUFFER,
+ SYNC_HANDLER_COMMIT_TS);
commitTsShared = ShmemInitStruct("CommitTs shared",
sizeof(CommitTimestampShared),
@@ -798,30 +799,14 @@ DeactivateCommitTs(void)
LWLockRelease(CommitTsSLRULock);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownCommitTs(void)
-{
- /* Flush dirty CommitTs pages to disk */
- SimpleLruFlush(CommitTsCtl, false);
-
- /*
- * fsync pg_commit_ts to ensure that any files flushed previously are
- * durably on disk.
- */
- fsync_fname("pg_commit_ts", true);
-}
-
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointCommitTs(void)
{
- /* Flush dirty CommitTs pages to disk */
- SimpleLruFlush(CommitTsCtl, true);
+ /* Write dirty CommitTs pages to disk */
+ SimpleLruWriteAll(CommitTsCtl, true);
}
/*
@@ -1077,3 +1062,12 @@ commit_ts_redo(XLogReaderState *record)
else
elog(PANIC, "commit_ts_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..5501da088b 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+ LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ SYNC_HANDLER_MULTIXACT_OFFSET);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+ LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ SYNC_HANDLER_MULTIXACT_MEMBER);
/* Initialize our shared state struct */
MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -2100,19 +2102,6 @@ TrimMultiXact(void)
SetMultiXactIdLimit(oldestMXact, oldestMXactDB, true);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownMultiXact(void)
-{
- /* Flush dirty MultiXact pages to disk */
- TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_START(false);
- SimpleLruFlush(MultiXactOffsetCtl, false);
- SimpleLruFlush(MultiXactMemberCtl, false);
- TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(false);
-}
-
/*
* Get the MultiXact data to save in a checkpoint record
*/
@@ -2143,9 +2132,9 @@ CheckPointMultiXact(void)
{
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_START(true);
- /* Flush dirty MultiXact pages to disk */
- SimpleLruFlush(MultiXactOffsetCtl, true);
- SimpleLruFlush(MultiXactMemberCtl, true);
+ /* Write dirty MultiXact pages to disk */
+ SimpleLruWriteAll(MultiXactOffsetCtl, true);
+ SimpleLruWriteAll(MultiXactMemberCtl, true);
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
}
@@ -2728,14 +2717,10 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
entryno = MultiXactIdToOffsetEntry(multi);
/*
- * Flush out dirty data, so PhysicalPageExists can work correctly.
- * SimpleLruFlush() is a pretty big hammer for that. Alternatively we
- * could add an in-memory version of page exists, but find_multixact_start
- * is called infrequently, and it doesn't seem bad to flush buffers to
- * disk before truncation.
+ * Write out dirty data, so PhysicalPageExists can work correctly.
*/
- SimpleLruFlush(MultiXactOffsetCtl, true);
- SimpleLruFlush(MultiXactMemberCtl, true);
+ SimpleLruWriteAll(MultiXactOffsetCtl, true);
+ SimpleLruWriteAll(MultiXactMemberCtl, true);
if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
return false;
@@ -3386,3 +3371,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funccxt);
}
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index 89ce7c43b5..390d36d03f 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -63,22 +63,33 @@
snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
/*
- * During SimpleLruFlush(), we will usually not need to write/fsync more
- * than one or two physical files, but we may need to write several pages
- * per file. We can consolidate the I/O requests by leaving files open
- * until control returns to SimpleLruFlush(). This data structure remembers
- * which files are open.
+ * During SimpleLruWriteAll(), we will usually not need to write more than one
+ * or two physical files, but we may need to write several pages per file. We
+ * can consolidate the I/O requests by leaving files open until control returns
+ * to SimpleLruWriteAll(). This data structure remembers which files are open.
*/
-#define MAX_FLUSH_BUFFERS 16
+#define MAX_WRITEALL_BUFFERS 16
-typedef struct SlruFlushData
+typedef struct SlruWriteAllData
{
int num_files; /* # files actually open */
- int fd[MAX_FLUSH_BUFFERS]; /* their FD's */
- int segno[MAX_FLUSH_BUFFERS]; /* their log seg#s */
-} SlruFlushData;
+ int fd[MAX_WRITEALL_BUFFERS]; /* their FD's */
+ int segno[MAX_WRITEALL_BUFFERS]; /* their log seg#s */
+} SlruWriteAllData;
-typedef struct SlruFlushData *SlruFlush;
+typedef struct SlruWriteAllData *SlruWriteAll;
+
+/*
+ * Populate a file tag describing a segment file. We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_handler,xx_segno) \
+( \
+ memset(&(a), 0, sizeof(FileTag)), \
+ (a).handler = (xx_handler), \
+ (a).segno = (xx_segno) \
+)
/*
* Macro to mark a buffer slot "most recently used". Note multiple evaluation
@@ -125,10 +136,10 @@ static int slru_errno;
static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
-static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata);
+static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
static bool SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno);
static bool SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno,
- SlruFlush fdata);
+ SlruWriteAll fdata);
static void SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid);
static int SlruSelectLRUPage(SlruCtl ctl, int pageno);
@@ -173,7 +184,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id)
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
@@ -251,7 +263,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
* assume caller set PagePrecedes.
*/
ctl->shared = shared;
- ctl->do_fsync = true; /* default behavior */
+ ctl->sync_handler = sync_handler;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -523,7 +535,7 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
* Control lock must be held at entry, and will be held at exit.
*/
static void
-SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
+SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
{
SlruShared shared = ctl->shared;
int pageno = shared->page_number[slotno];
@@ -587,6 +599,10 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
/* Now it's okay to ereport if we failed */
if (!ok)
SlruReportIOError(ctl, pageno, InvalidTransactionId);
+
+ /* If part of a checkpoint, count this as a buffer written. */
+ if (fdata)
+ CheckpointStats.ckpt_bufs_written++;
}
/*
@@ -730,13 +746,13 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
*
* For now, assume it's not worth keeping a file pointer open across
* independent read/write operations. We do batch operations during
- * SimpleLruFlush, though.
+ * SimpleLruWriteAll, though.
*
* fdata is NULL for a standalone write, pointer to open-file info during
- * SimpleLruFlush.
+ * SimpleLruWriteAll.
*/
static bool
-SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
+SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruWriteAll fdata)
{
SlruShared shared = ctl->shared;
int segno = pageno / SLRU_PAGES_PER_SEGMENT;
@@ -791,7 +807,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
/*
- * During a Flush, we may already have the desired file open.
+ * During a WriteAll, we may already have the desired file open.
*/
if (fdata)
{
@@ -837,7 +853,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
if (fdata)
{
- if (fdata->num_files < MAX_FLUSH_BUFFERS)
+ if (fdata->num_files < MAX_WRITEALL_BUFFERS)
{
fdata->fd[fdata->num_files] = fd;
fdata->segno[fdata->num_files] = segno;
@@ -870,23 +886,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
pgstat_report_wait_end();
- /*
- * If not part of Flush, need to fsync now. We assume this happens
- * infrequently enough that it's not a performance issue.
- */
- if (!fdata)
+ /* Queue up a sync request for the checkpointer. */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
- if (ctl->do_fsync && pg_fsync(fd) != 0)
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
{
+ /* No space to enqueue sync request. Do it synchronously. */
+ pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+ if (pg_fsync(fd) != 0)
+ {
+ pgstat_report_wait_end();
+ slru_errcause = SLRU_FSYNC_FAILED;
+ slru_errno = errno;
+ CloseTransientFile(fd);
+ return false;
+ }
pgstat_report_wait_end();
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- CloseTransientFile(fd);
- return false;
}
- pgstat_report_wait_end();
+ }
+ /* Close file, unless part of flush request. */
+ if (!fdata)
+ {
if (CloseTransientFile(fd) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1122,13 +1146,16 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
}
/*
- * Flush dirty pages to disk during checkpoint or database shutdown
+ * Write dirty pages to disk during checkpoint or database shutdown. Flushing
+ * is deferred until the next call to ProcessSyncRequests(), though we do fsync
+ * the containing directory here to make sure that newly created directory
+ * entries are on disk.
*/
void
-SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
+SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
{
SlruShared shared = ctl->shared;
- SlruFlushData fdata;
+ SlruWriteAllData fdata;
int slotno;
int pageno = 0;
int i;
@@ -1162,21 +1189,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
LWLockRelease(shared->ControlLock);
/*
- * Now fsync and close any files that were open
+ * Now close any files that were open
*/
ok = true;
for (i = 0; i < fdata.num_files; i++)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
- if (ctl->do_fsync && pg_fsync(fdata.fd[i]) != 0)
- {
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
- ok = false;
- }
- pgstat_report_wait_end();
-
if (CloseTransientFile(fdata.fd[i]) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1192,7 +1209,7 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
* Make sure that the directory entries for any newly created files are on
* disk.
*/
- if (ctl->do_fsync)
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
fsync_fname(ctl->Dir, true);
}
@@ -1353,6 +1370,19 @@ restart:
snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
ereport(DEBUG2,
(errmsg("removing file \"%s\"", path)));
+
+ /*
+ * Tell the checkpointer to forget any sync requests, before we unlink the
+ * file.
+ */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
+ {
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+ }
+
unlink(path);
LWLockRelease(shared->ControlLock);
@@ -1451,3 +1481,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
return retval;
}
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+ int fd;
+ int save_errno;
+ int result;
+
+ SlruFileName(ctl, path, ftag->segno);
+
+ fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+ if (fd < 0)
+ return -1;
+
+ result = pg_fsync(fd);
+ save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ errno = save_errno;
+ return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index a50f60b99a..0111e867c7 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SubTransCtl->do_fsync = false;
+ LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
}
/*
@@ -278,23 +276,6 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
LWLockRelease(SubtransSLRULock);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownSUBTRANS(void)
-{
- /*
- * Flush dirty SUBTRANS pages to disk
- *
- * This is not actually necessary from a correctness point of view. We do
- * it merely as a debugging aid.
- */
- TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_START(false);
- SimpleLruFlush(SubTransCtl, false);
- TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(false);
-}
-
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
@@ -302,14 +283,14 @@ void
CheckPointSUBTRANS(void)
{
/*
- * Flush dirty SUBTRANS pages to disk
+ * Write dirty SUBTRANS pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely to improve the odds that writing of dirty pages is done by
* the checkpoint process and not by backends.
*/
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_START(true);
- SimpleLruFlush(SubTransCtl, true);
+ SimpleLruWriteAll(SubTransCtl, true);
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(true);
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61754312e2..ed17dd0d12 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8528,10 +8528,6 @@ ShutdownXLOG(int code, Datum arg)
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
- ShutdownCLOG();
- ShutdownCommitTs();
- ShutdownSUBTRANS();
- ShutdownMultiXact();
}
/*
@@ -9176,16 +9172,28 @@ CreateEndOfRecoveryRecord(void)
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
+ /* Write out all dirty data in SLRUs and the main buffer pool */
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
+ CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
CheckPointCLOG();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
CheckPointPredicate();
+ CheckPointBuffers(flags);
+
+ /* Perform all required fsyncs */
+ CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
+ ProcessSyncRequests();
+ CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
+
+ /* Other activities performed at checkpoint time */
CheckPointRelationMap();
CheckPointReplicationSlots();
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
- CheckPointBuffers(flags); /* performs all required fsyncs */
CheckPointReplicationOrigin();
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index cb341365df..8dbcace3f9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -554,9 +554,8 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- NotifyCtl->do_fsync = false;
+ NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ SYNC_HANDLER_NONE);
if (!found)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2a963bd5b..e549fa1d30 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2636,14 +2636,7 @@ PrintBufferLeakWarning(Buffer buffer)
void
CheckPointBuffers(int flags)
{
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
- CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
BufferSync(flags);
- CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
- ProcessSyncRequests();
- CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
}
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a2f8e7524b..8a365b400c 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -821,9 +821,7 @@ SerialInit(void)
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SerialSlruCtl->do_fsync = false;
+ LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
/*
* Create or attach to the SerialControl structure.
@@ -1052,7 +1050,7 @@ CheckPointPredicate(void)
SimpleLruTruncate(SerialSlruCtl, tailPage);
/*
- * Flush dirty SLRU pages to disk
+ * Write dirty SLRU pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely as a debugging aid.
@@ -1061,7 +1059,7 @@ CheckPointPredicate(void)
* before deleting the file in which they sit, which would be completely
* pointless.
*/
- SimpleLruFlush(SerialSlruCtl, true);
+ SimpleLruWriteAll(SerialSlruCtl, true);
}
/*------------------------------------------------------------------------*/
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 3ded2cdd71..1d635d596c 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -18,6 +18,9 @@
#include <fcntl.h>
#include <sys/file.h>
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "commands/tablespace.h"
@@ -90,12 +93,31 @@ typedef struct SyncOps
const FileTag *candidate);
} SyncOps;
+/*
+ * These indexes must correspond to the values of the SyncRequestHandler enum.
+ */
static const SyncOps syncsw[] = {
/* magnetic disk */
- {
+ [SYNC_HANDLER_MD] = {
.sync_syncfiletag = mdsyncfiletag,
.sync_unlinkfiletag = mdunlinkfiletag,
.sync_filetagmatches = mdfiletagmatches
+ },
+ /* pg_xact */
+ [SYNC_HANDLER_CLOG] = {
+ .sync_syncfiletag = clogsyncfiletag
+ },
+ /* pg_commit_ts */
+ [SYNC_HANDLER_COMMIT_TS] = {
+ .sync_syncfiletag = committssyncfiletag
+ },
+ /* pg_multixact/offsets */
+ [SYNC_HANDLER_MULTIXACT_OFFSET] = {
+ .sync_syncfiletag = multixactoffsetssyncfiletag
+ },
+ /* pg_multixact/members */
+ [SYNC_HANDLER_MULTIXACT_MEMBER] = {
+ .sync_syncfiletag = multixactmemberssyncfiletag
}
};
@@ -505,8 +527,8 @@ RememberSyncRequest(const FileTag *ftag, SyncRequestType type)
(void *) ftag,
HASH_ENTER,
&found);
- /* if new entry, initialize it */
- if (!found)
+ /* if new entry, or was previously canceled, initialize it */
+ if (!found || entry->canceled)
{
entry->cycle_ctr = sync_cycle_ctr;
entry->canceled = false;
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 2db8acb189..d97b9042dc 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
#define CLOG_H
#include "access/xlogreader.h"
+#include "storage/sync.h"
#include "lib/stringinfo.h"
/*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
extern void ExtendCLOG(TransactionId newestXact);
extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define CLOG_ZEROPAGE 0x00
#define CLOG_TRUNCATE 0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 2740c02a84..27900ce430 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
#include "replication/origin.h"
+#include "storage/sync.h"
#include "utils/guc.h"
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
TransactionId newestXact);
extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define COMMIT_TS_ZEROPAGE 0x00
#define COMMIT_TS_TRUNCATE 0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6d729008c6..71d6e78063 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
+#include "storage/sync.h"
/*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
MultiXactId multi2);
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
extern void AtEOXact_MultiXact(void);
extern void AtPrepare_MultiXact(void);
extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 61fbc80ef0..b39b43504d 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "storage/lwlock.h"
+#include "storage/sync.h"
/*
@@ -111,10 +112,10 @@ typedef struct SlruCtlData
SlruShared shared;
/*
- * This flag tells whether to fsync writes (true for pg_xact and multixact
- * stuff, false for pg_subtrans and pg_notify).
+ * Which sync handler function to use when handing sync requests over to
+ * the checkpointer. SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
*/
- bool do_fsync;
+ SyncRequestHandler sync_handler;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
@@ -135,14 +136,15 @@ typedef SlruCtlData *SlruCtl;
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id);
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
TransactionId xid);
extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
-extern void SimpleLruFlush(SlruCtl ctl, bool allow_redirtied);
+extern void SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied);
extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
extern bool SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno);
@@ -151,6 +153,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
extern void SlruDeleteSegment(SlruCtl ctl, int segno);
+extern int SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path);
+
/* SlruScanDirectory public callbacks */
extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index e16ab8e711..f32e412e75 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -34,7 +34,12 @@ typedef enum SyncRequestType
*/
typedef enum SyncRequestHandler
{
- SYNC_HANDLER_MD = 0 /* md smgr */
+ SYNC_HANDLER_MD = 0,
+ SYNC_HANDLER_CLOG,
+ SYNC_HANDLER_COMMIT_TS,
+ SYNC_HANDLER_MULTIXACT_OFFSET,
+ SYNC_HANDLER_MULTIXACT_MEMBER,
+ SYNC_HANDLER_NONE
} SyncRequestHandler;
/*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b1afb345c3..88fc3ceae4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2404,6 +2404,7 @@ Syn
SyncOps
SyncRepConfigData
SyncRepStandbyData
+SyncRequestHandler
SyncRequestType
SysScanDesc
SyscacheCallbackFunction
--
2.20.1
On Wed, Sep 23, 2020 at 1:56 PM Thomas Munro <thomas.munro@gmail.com> wrote:
As for the ShutdownXXX() functions, I haven't yet come up with any
reason for this code to exist. Emboldened by a colleague's inability
to explain to me what that code is doing for us, here is a new version
that just rips it all out.
Rebased.
Tom, do you have any thoughts on ShutdownCLOG() etc?
Attachments:
v7-0001-Defer-flushing-of-SLRU-files.patchtext/x-patch; charset=US-ASCII; name=v7-0001-Defer-flushing-of-SLRU-files.patchDownload
From 07196368e6cd36df5910b536e93570c9acff4ff2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 25 Sep 2020 11:16:03 +1200
Subject: [PATCH v7] Defer flushing of SLRU files.
Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery. Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba. This causes a significant improvement to recovery
performance.
Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them.
Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool. Also rearrange things so that data collected in
CheckpointStats includes SLRU activity.
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
---
src/backend/access/transam/clog.c | 36 +++---
src/backend/access/transam/commit_ts.c | 32 +++--
src/backend/access/transam/multixact.c | 53 +++++----
src/backend/access/transam/slru.c | 154 +++++++++++++++++--------
src/backend/access/transam/subtrans.c | 25 +---
src/backend/access/transam/xlog.c | 18 ++-
src/backend/commands/async.c | 5 +-
src/backend/storage/buffer/bufmgr.c | 7 --
src/backend/storage/lmgr/predicate.c | 8 +-
src/backend/storage/sync/sync.c | 28 ++++-
src/include/access/clog.h | 3 +
src/include/access/commit_ts.h | 3 +
src/include/access/multixact.h | 4 +
src/include/access/slru.h | 14 ++-
src/include/storage/sync.h | 7 +-
src/tools/pgindent/typedefs.list | 1 +
16 files changed, 233 insertions(+), 165 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 9e352d2658..c4c20f508b 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "storage/proc.h"
+#include "storage/sync.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+ XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+ SYNC_HANDLER_CLOG);
}
/*
@@ -808,34 +810,15 @@ TrimCLOG(void)
LWLockRelease(XactSLRULock);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownCLOG(void)
-{
- /* Flush dirty CLOG pages to disk */
- TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
- SimpleLruFlush(XactCtl, false);
-
- /*
- * fsync pg_xact to ensure that any files flushed previously are durably
- * on disk.
- */
- fsync_fname("pg_xact", true);
-
- TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
-}
-
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointCLOG(void)
{
- /* Flush dirty CLOG pages to disk */
+ /* Write dirty CLOG pages to disk */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
- SimpleLruFlush(XactCtl, true);
+ SimpleLruWriteAll(XactCtl, true);
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
}
@@ -1026,3 +1009,12 @@ clog_redo(XLogReaderState *record)
else
elog(PANIC, "clog_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index f6a7329ba3..af5d9baae5 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER);
+ LWTRANCHE_COMMITTS_BUFFER,
+ SYNC_HANDLER_COMMIT_TS);
commitTsShared = ShmemInitStruct("CommitTs shared",
sizeof(CommitTimestampShared),
@@ -798,30 +799,14 @@ DeactivateCommitTs(void)
LWLockRelease(CommitTsSLRULock);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownCommitTs(void)
-{
- /* Flush dirty CommitTs pages to disk */
- SimpleLruFlush(CommitTsCtl, false);
-
- /*
- * fsync pg_commit_ts to ensure that any files flushed previously are
- * durably on disk.
- */
- fsync_fname("pg_commit_ts", true);
-}
-
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointCommitTs(void)
{
- /* Flush dirty CommitTs pages to disk */
- SimpleLruFlush(CommitTsCtl, true);
+ /* Write dirty CommitTs pages to disk */
+ SimpleLruWriteAll(CommitTsCtl, true);
}
/*
@@ -1077,3 +1062,12 @@ commit_ts_redo(XLogReaderState *record)
else
elog(PANIC, "commit_ts_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..5501da088b 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+ LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ SYNC_HANDLER_MULTIXACT_OFFSET);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+ LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ SYNC_HANDLER_MULTIXACT_MEMBER);
/* Initialize our shared state struct */
MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -2100,19 +2102,6 @@ TrimMultiXact(void)
SetMultiXactIdLimit(oldestMXact, oldestMXactDB, true);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownMultiXact(void)
-{
- /* Flush dirty MultiXact pages to disk */
- TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_START(false);
- SimpleLruFlush(MultiXactOffsetCtl, false);
- SimpleLruFlush(MultiXactMemberCtl, false);
- TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(false);
-}
-
/*
* Get the MultiXact data to save in a checkpoint record
*/
@@ -2143,9 +2132,9 @@ CheckPointMultiXact(void)
{
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_START(true);
- /* Flush dirty MultiXact pages to disk */
- SimpleLruFlush(MultiXactOffsetCtl, true);
- SimpleLruFlush(MultiXactMemberCtl, true);
+ /* Write dirty MultiXact pages to disk */
+ SimpleLruWriteAll(MultiXactOffsetCtl, true);
+ SimpleLruWriteAll(MultiXactMemberCtl, true);
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
}
@@ -2728,14 +2717,10 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
entryno = MultiXactIdToOffsetEntry(multi);
/*
- * Flush out dirty data, so PhysicalPageExists can work correctly.
- * SimpleLruFlush() is a pretty big hammer for that. Alternatively we
- * could add an in-memory version of page exists, but find_multixact_start
- * is called infrequently, and it doesn't seem bad to flush buffers to
- * disk before truncation.
+ * Write out dirty data, so PhysicalPageExists can work correctly.
*/
- SimpleLruFlush(MultiXactOffsetCtl, true);
- SimpleLruFlush(MultiXactMemberCtl, true);
+ SimpleLruWriteAll(MultiXactOffsetCtl, true);
+ SimpleLruWriteAll(MultiXactMemberCtl, true);
if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
return false;
@@ -3386,3 +3371,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funccxt);
}
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index fe7d759a8c..636b90db87 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -63,22 +63,33 @@
snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
/*
- * During SimpleLruFlush(), we will usually not need to write/fsync more
- * than one or two physical files, but we may need to write several pages
- * per file. We can consolidate the I/O requests by leaving files open
- * until control returns to SimpleLruFlush(). This data structure remembers
- * which files are open.
+ * During SimpleLruWriteAll(), we will usually not need to write more than one
+ * or two physical files, but we may need to write several pages per file. We
+ * can consolidate the I/O requests by leaving files open until control returns
+ * to SimpleLruWriteAll(). This data structure remembers which files are open.
*/
-#define MAX_FLUSH_BUFFERS 16
+#define MAX_WRITEALL_BUFFERS 16
-typedef struct SlruFlushData
+typedef struct SlruWriteAllData
{
int num_files; /* # files actually open */
- int fd[MAX_FLUSH_BUFFERS]; /* their FD's */
- int segno[MAX_FLUSH_BUFFERS]; /* their log seg#s */
-} SlruFlushData;
+ int fd[MAX_WRITEALL_BUFFERS]; /* their FD's */
+ int segno[MAX_WRITEALL_BUFFERS]; /* their log seg#s */
+} SlruWriteAllData;
-typedef struct SlruFlushData *SlruFlush;
+typedef struct SlruWriteAllData *SlruWriteAll;
+
+/*
+ * Populate a file tag describing a segment file. We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_handler,xx_segno) \
+( \
+ memset(&(a), 0, sizeof(FileTag)), \
+ (a).handler = (xx_handler), \
+ (a).segno = (xx_segno) \
+)
/*
* Macro to mark a buffer slot "most recently used". Note multiple evaluation
@@ -125,10 +136,10 @@ static int slru_errno;
static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
-static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata);
+static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
static bool SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno);
static bool SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno,
- SlruFlush fdata);
+ SlruWriteAll fdata);
static void SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid);
static int SlruSelectLRUPage(SlruCtl ctl, int pageno);
@@ -173,7 +184,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id)
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
@@ -251,7 +263,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
* assume caller set PagePrecedes.
*/
ctl->shared = shared;
- ctl->do_fsync = true; /* default behavior */
+ ctl->sync_handler = sync_handler;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -523,7 +535,7 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
* Control lock must be held at entry, and will be held at exit.
*/
static void
-SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
+SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
{
SlruShared shared = ctl->shared;
int pageno = shared->page_number[slotno];
@@ -587,6 +599,10 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
/* Now it's okay to ereport if we failed */
if (!ok)
SlruReportIOError(ctl, pageno, InvalidTransactionId);
+
+ /* If part of a checkpoint, count this as a buffer written. */
+ if (fdata)
+ CheckpointStats.ckpt_bufs_written++;
}
/*
@@ -730,13 +746,13 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
*
* For now, assume it's not worth keeping a file pointer open across
* independent read/write operations. We do batch operations during
- * SimpleLruFlush, though.
+ * SimpleLruWriteAll, though.
*
* fdata is NULL for a standalone write, pointer to open-file info during
- * SimpleLruFlush.
+ * SimpleLruWriteAll.
*/
static bool
-SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
+SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruWriteAll fdata)
{
SlruShared shared = ctl->shared;
int segno = pageno / SLRU_PAGES_PER_SEGMENT;
@@ -791,7 +807,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
/*
- * During a Flush, we may already have the desired file open.
+ * During a WriteAll, we may already have the desired file open.
*/
if (fdata)
{
@@ -837,7 +853,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
if (fdata)
{
- if (fdata->num_files < MAX_FLUSH_BUFFERS)
+ if (fdata->num_files < MAX_WRITEALL_BUFFERS)
{
fdata->fd[fdata->num_files] = fd;
fdata->segno[fdata->num_files] = segno;
@@ -870,23 +886,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
pgstat_report_wait_end();
- /*
- * If not part of Flush, need to fsync now. We assume this happens
- * infrequently enough that it's not a performance issue.
- */
- if (!fdata)
+ /* Queue up a sync request for the checkpointer. */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
- if (ctl->do_fsync && pg_fsync(fd) != 0)
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
{
+ /* No space to enqueue sync request. Do it synchronously. */
+ pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+ if (pg_fsync(fd) != 0)
+ {
+ pgstat_report_wait_end();
+ slru_errcause = SLRU_FSYNC_FAILED;
+ slru_errno = errno;
+ CloseTransientFile(fd);
+ return false;
+ }
pgstat_report_wait_end();
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- CloseTransientFile(fd);
- return false;
}
- pgstat_report_wait_end();
+ }
+ /* Close file, unless part of flush request. */
+ if (!fdata)
+ {
if (CloseTransientFile(fd) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1122,13 +1146,16 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
}
/*
- * Flush dirty pages to disk during checkpoint or database shutdown
+ * Write dirty pages to disk during checkpoint or database shutdown. Flushing
+ * is deferred until the next call to ProcessSyncRequests(), though we do fsync
+ * the containing directory here to make sure that newly created directory
+ * entries are on disk.
*/
void
-SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
+SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
{
SlruShared shared = ctl->shared;
- SlruFlushData fdata;
+ SlruWriteAllData fdata;
int slotno;
int pageno = 0;
int i;
@@ -1162,21 +1189,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
LWLockRelease(shared->ControlLock);
/*
- * Now fsync and close any files that were open
+ * Now close any files that were open
*/
ok = true;
for (i = 0; i < fdata.num_files; i++)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
- if (ctl->do_fsync && pg_fsync(fdata.fd[i]) != 0)
- {
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
- ok = false;
- }
- pgstat_report_wait_end();
-
if (CloseTransientFile(fdata.fd[i]) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1189,7 +1206,7 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
SlruReportIOError(ctl, pageno, InvalidTransactionId);
/* Ensure that directory entries for new files are on disk. */
- if (ctl->do_fsync)
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
fsync_fname(ctl->Dir, true);
}
@@ -1350,6 +1367,19 @@ restart:
snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
ereport(DEBUG2,
(errmsg("removing file \"%s\"", path)));
+
+ /*
+ * Tell the checkpointer to forget any sync requests, before we unlink the
+ * file.
+ */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
+ {
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+ }
+
unlink(path);
LWLockRelease(shared->ControlLock);
@@ -1448,3 +1478,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
return retval;
}
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+ int fd;
+ int save_errno;
+ int result;
+
+ SlruFileName(ctl, path, ftag->segno);
+
+ fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+ if (fd < 0)
+ return -1;
+
+ result = pg_fsync(fd);
+ save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ errno = save_errno;
+ return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index a50f60b99a..0111e867c7 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SubTransCtl->do_fsync = false;
+ LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
}
/*
@@ -278,23 +276,6 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
LWLockRelease(SubtransSLRULock);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownSUBTRANS(void)
-{
- /*
- * Flush dirty SUBTRANS pages to disk
- *
- * This is not actually necessary from a correctness point of view. We do
- * it merely as a debugging aid.
- */
- TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_START(false);
- SimpleLruFlush(SubTransCtl, false);
- TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(false);
-}
-
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
@@ -302,14 +283,14 @@ void
CheckPointSUBTRANS(void)
{
/*
- * Flush dirty SUBTRANS pages to disk
+ * Write dirty SUBTRANS pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely to improve the odds that writing of dirty pages is done by
* the checkpoint process and not by backends.
*/
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_START(true);
- SimpleLruFlush(SubTransCtl, true);
+ SimpleLruWriteAll(SubTransCtl, true);
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(true);
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61754312e2..ed17dd0d12 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8528,10 +8528,6 @@ ShutdownXLOG(int code, Datum arg)
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
- ShutdownCLOG();
- ShutdownCommitTs();
- ShutdownSUBTRANS();
- ShutdownMultiXact();
}
/*
@@ -9176,16 +9172,28 @@ CreateEndOfRecoveryRecord(void)
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
+ /* Write out all dirty data in SLRUs and the main buffer pool */
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
+ CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
CheckPointCLOG();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
CheckPointPredicate();
+ CheckPointBuffers(flags);
+
+ /* Perform all required fsyncs */
+ CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
+ ProcessSyncRequests();
+ CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
+
+ /* Other activities performed at checkpoint time */
CheckPointRelationMap();
CheckPointReplicationSlots();
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
- CheckPointBuffers(flags); /* performs all required fsyncs */
CheckPointReplicationOrigin();
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index cb341365df..8dbcace3f9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -554,9 +554,8 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- NotifyCtl->do_fsync = false;
+ NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ SYNC_HANDLER_NONE);
if (!found)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2a963bd5b..e549fa1d30 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2636,14 +2636,7 @@ PrintBufferLeakWarning(Buffer buffer)
void
CheckPointBuffers(int flags)
{
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
- CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
BufferSync(flags);
- CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
- ProcessSyncRequests();
- CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
}
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a2f8e7524b..8a365b400c 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -821,9 +821,7 @@ SerialInit(void)
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SerialSlruCtl->do_fsync = false;
+ LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
/*
* Create or attach to the SerialControl structure.
@@ -1052,7 +1050,7 @@ CheckPointPredicate(void)
SimpleLruTruncate(SerialSlruCtl, tailPage);
/*
- * Flush dirty SLRU pages to disk
+ * Write dirty SLRU pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely as a debugging aid.
@@ -1061,7 +1059,7 @@ CheckPointPredicate(void)
* before deleting the file in which they sit, which would be completely
* pointless.
*/
- SimpleLruFlush(SerialSlruCtl, true);
+ SimpleLruWriteAll(SerialSlruCtl, true);
}
/*------------------------------------------------------------------------*/
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 3ded2cdd71..1d635d596c 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -18,6 +18,9 @@
#include <fcntl.h>
#include <sys/file.h>
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "commands/tablespace.h"
@@ -90,12 +93,31 @@ typedef struct SyncOps
const FileTag *candidate);
} SyncOps;
+/*
+ * These indexes must correspond to the values of the SyncRequestHandler enum.
+ */
static const SyncOps syncsw[] = {
/* magnetic disk */
- {
+ [SYNC_HANDLER_MD] = {
.sync_syncfiletag = mdsyncfiletag,
.sync_unlinkfiletag = mdunlinkfiletag,
.sync_filetagmatches = mdfiletagmatches
+ },
+ /* pg_xact */
+ [SYNC_HANDLER_CLOG] = {
+ .sync_syncfiletag = clogsyncfiletag
+ },
+ /* pg_commit_ts */
+ [SYNC_HANDLER_COMMIT_TS] = {
+ .sync_syncfiletag = committssyncfiletag
+ },
+ /* pg_multixact/offsets */
+ [SYNC_HANDLER_MULTIXACT_OFFSET] = {
+ .sync_syncfiletag = multixactoffsetssyncfiletag
+ },
+ /* pg_multixact/members */
+ [SYNC_HANDLER_MULTIXACT_MEMBER] = {
+ .sync_syncfiletag = multixactmemberssyncfiletag
}
};
@@ -505,8 +527,8 @@ RememberSyncRequest(const FileTag *ftag, SyncRequestType type)
(void *) ftag,
HASH_ENTER,
&found);
- /* if new entry, initialize it */
- if (!found)
+ /* if new entry, or was previously canceled, initialize it */
+ if (!found || entry->canceled)
{
entry->cycle_ctr = sync_cycle_ctr;
entry->canceled = false;
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 2db8acb189..d97b9042dc 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
#define CLOG_H
#include "access/xlogreader.h"
+#include "storage/sync.h"
#include "lib/stringinfo.h"
/*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
extern void ExtendCLOG(TransactionId newestXact);
extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define CLOG_ZEROPAGE 0x00
#define CLOG_TRUNCATE 0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 2740c02a84..27900ce430 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
#include "replication/origin.h"
+#include "storage/sync.h"
#include "utils/guc.h"
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
TransactionId newestXact);
extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define COMMIT_TS_ZEROPAGE 0x00
#define COMMIT_TS_TRUNCATE 0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6d729008c6..71d6e78063 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
+#include "storage/sync.h"
/*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
MultiXactId multi2);
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
extern void AtEOXact_MultiXact(void);
extern void AtPrepare_MultiXact(void);
extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 61fbc80ef0..b39b43504d 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "storage/lwlock.h"
+#include "storage/sync.h"
/*
@@ -111,10 +112,10 @@ typedef struct SlruCtlData
SlruShared shared;
/*
- * This flag tells whether to fsync writes (true for pg_xact and multixact
- * stuff, false for pg_subtrans and pg_notify).
+ * Which sync handler function to use when handing sync requests over to
+ * the checkpointer. SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
*/
- bool do_fsync;
+ SyncRequestHandler sync_handler;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
@@ -135,14 +136,15 @@ typedef SlruCtlData *SlruCtl;
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id);
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
TransactionId xid);
extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
-extern void SimpleLruFlush(SlruCtl ctl, bool allow_redirtied);
+extern void SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied);
extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
extern bool SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno);
@@ -151,6 +153,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
extern void SlruDeleteSegment(SlruCtl ctl, int segno);
+extern int SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path);
+
/* SlruScanDirectory public callbacks */
extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index e16ab8e711..f32e412e75 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -34,7 +34,12 @@ typedef enum SyncRequestType
*/
typedef enum SyncRequestHandler
{
- SYNC_HANDLER_MD = 0 /* md smgr */
+ SYNC_HANDLER_MD = 0,
+ SYNC_HANDLER_CLOG,
+ SYNC_HANDLER_COMMIT_TS,
+ SYNC_HANDLER_MULTIXACT_OFFSET,
+ SYNC_HANDLER_MULTIXACT_MEMBER,
+ SYNC_HANDLER_NONE
} SyncRequestHandler;
/*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b1afb345c3..88fc3ceae4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2404,6 +2404,7 @@ Syn
SyncOps
SyncRepConfigData
SyncRepStandbyData
+SyncRequestHandler
SyncRequestType
SysScanDesc
SyscacheCallbackFunction
--
2.20.1
Thomas Munro <thomas.munro@gmail.com> writes:
Tom, do you have any thoughts on ShutdownCLOG() etc?
Hm, if we cannot reach that without first completing a shutdown checkpoint,
it does seem a little pointless.
It'd likely be a good idea to add a comment to CheckPointCLOG et al
explaining that we expect $what-exactly to fsync the data we are writing
before the checkpoint is considered done.
regards, tom lane
On Fri, Sep 25, 2020 at 12:05 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
Tom, do you have any thoughts on ShutdownCLOG() etc?
Hm, if we cannot reach that without first completing a shutdown checkpoint,
it does seem a little pointless.
Thanks for the sanity check.
It'd likely be a good idea to add a comment to CheckPointCLOG et al
explaining that we expect $what-exactly to fsync the data we are writing
before the checkpoint is considered done.
Good point. Done like this:
+ /*
+ * Write dirty CLOG pages to disk. This may result in sync
requests queued
+ * for later handling by ProcessSyncRequests(), as part of the
checkpoint.
+ */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
- SimpleLruFlush(XactCtl, true);
+ SimpleLruWriteAll(XactCtl, true);
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
Here's a new version. The final thing I'm contemplating before
pushing this is whether there may be hidden magical dependencies in
the order of operations in CheckPointGuts(), which I've changed
around. Andres, any comments?
Attachments:
v8-0001-Defer-flushing-of-SLRU-files.patchtext/x-patch; charset=US-ASCII; name=v8-0001-Defer-flushing-of-SLRU-files.patchDownload
From a722d71c4296e5fae778a783eb8140be7887eced Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@gmail.com>
Date: Fri, 25 Sep 2020 11:16:03 +1200
Subject: [PATCH v8] Defer flushing of SLRU files.
Previously, we called fsync() after writing out individual pg_xact,
pg_multixact and pg_commit_ts pages due to cache pressure, leading to
regular I/O stalls in user backends and recovery. Collapse requests for
the same file into a single system call as part of the next checkpoint,
as we already did for relation files, using the infrastructure developed
by commit 3eb77eba. This causes a significant improvement to recovery
performance.
Remove the Shutdown{CLOG,CommitTS,SUBTRANS,MultiXact}() functions,
because they were redundant after the shutdown checkpoint that
immediately precedes them.
Hoist ProcessSyncRequests() up into CheckPointGuts() to make it clearer
that it applies to all the SLRU mini-buffer-pools as well as the main
buffer pool. Also rearrange things so that data collected in
CheckpointStats includes SLRU activity.
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us> (ShutdownXXX removal part)
Tested-by: Jakub Wartak <Jakub.Wartak@tomtom.com>
Discussion: https://postgr.es/m/CA+hUKGLJ=84YT+NvhkEEDAuUtVHMfQ9i-N7k_o50JmQ6Rpj_OQ@mail.gmail.com
---
src/backend/access/transam/clog.c | 39 +++----
src/backend/access/transam/commit_ts.c | 36 +++---
src/backend/access/transam/multixact.c | 57 +++++----
src/backend/access/transam/slru.c | 154 +++++++++++++++++--------
src/backend/access/transam/subtrans.c | 25 +---
src/backend/access/transam/xlog.c | 18 ++-
src/backend/commands/async.c | 5 +-
src/backend/storage/buffer/bufmgr.c | 7 --
src/backend/storage/lmgr/predicate.c | 8 +-
src/backend/storage/sync/sync.c | 28 ++++-
src/include/access/clog.h | 3 +
src/include/access/commit_ts.h | 3 +
src/include/access/multixact.h | 4 +
src/include/access/slru.h | 14 ++-
src/include/storage/sync.h | 7 +-
src/tools/pgindent/typedefs.list | 1 +
16 files changed, 244 insertions(+), 165 deletions(-)
diff --git a/src/backend/access/transam/clog.c b/src/backend/access/transam/clog.c
index 9e352d2658..2ec6a924db 100644
--- a/src/backend/access/transam/clog.c
+++ b/src/backend/access/transam/clog.c
@@ -42,6 +42,7 @@
#include "pg_trace.h"
#include "pgstat.h"
#include "storage/proc.h"
+#include "storage/sync.h"
/*
* Defines for CLOG page sizes. A page is the same BLCKSZ as is used
@@ -691,7 +692,8 @@ CLOGShmemInit(void)
{
XactCtl->PagePrecedes = CLOGPagePrecedes;
SimpleLruInit(XactCtl, "Xact", CLOGShmemBuffers(), CLOG_LSNS_PER_PAGE,
- XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER);
+ XactSLRULock, "pg_xact", LWTRANCHE_XACT_BUFFER,
+ SYNC_HANDLER_CLOG);
}
/*
@@ -808,34 +810,18 @@ TrimCLOG(void)
LWLockRelease(XactSLRULock);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownCLOG(void)
-{
- /* Flush dirty CLOG pages to disk */
- TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(false);
- SimpleLruFlush(XactCtl, false);
-
- /*
- * fsync pg_xact to ensure that any files flushed previously are durably
- * on disk.
- */
- fsync_fname("pg_xact", true);
-
- TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(false);
-}
-
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointCLOG(void)
{
- /* Flush dirty CLOG pages to disk */
+ /*
+ * Write dirty CLOG pages to disk. This may result in sync requests queued
+ * for later handling by ProcessSyncRequests(), as part of the checkpoint.
+ */
TRACE_POSTGRESQL_CLOG_CHECKPOINT_START(true);
- SimpleLruFlush(XactCtl, true);
+ SimpleLruWriteAll(XactCtl, true);
TRACE_POSTGRESQL_CLOG_CHECKPOINT_DONE(true);
}
@@ -1026,3 +1012,12 @@ clog_redo(XLogReaderState *record)
else
elog(PANIC, "clog_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync clog files.
+ */
+int
+clogsyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(XactCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/commit_ts.c b/src/backend/access/transam/commit_ts.c
index f6a7329ba3..cb8a968801 100644
--- a/src/backend/access/transam/commit_ts.c
+++ b/src/backend/access/transam/commit_ts.c
@@ -555,7 +555,8 @@ CommitTsShmemInit(void)
CommitTsCtl->PagePrecedes = CommitTsPagePrecedes;
SimpleLruInit(CommitTsCtl, "CommitTs", CommitTsShmemBuffers(), 0,
CommitTsSLRULock, "pg_commit_ts",
- LWTRANCHE_COMMITTS_BUFFER);
+ LWTRANCHE_COMMITTS_BUFFER,
+ SYNC_HANDLER_COMMIT_TS);
commitTsShared = ShmemInitStruct("CommitTs shared",
sizeof(CommitTimestampShared),
@@ -798,30 +799,18 @@ DeactivateCommitTs(void)
LWLockRelease(CommitTsSLRULock);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownCommitTs(void)
-{
- /* Flush dirty CommitTs pages to disk */
- SimpleLruFlush(CommitTsCtl, false);
-
- /*
- * fsync pg_commit_ts to ensure that any files flushed previously are
- * durably on disk.
- */
- fsync_fname("pg_commit_ts", true);
-}
-
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
void
CheckPointCommitTs(void)
{
- /* Flush dirty CommitTs pages to disk */
- SimpleLruFlush(CommitTsCtl, true);
+ /*
+ * Write dirty CommitTs pages to disk. This may result in sync requests
+ * queued for later handling by ProcessSyncRequests(), as part of the
+ * checkpoint.
+ */
+ SimpleLruWriteAll(CommitTsCtl, true);
}
/*
@@ -1077,3 +1066,12 @@ commit_ts_redo(XLogReaderState *record)
else
elog(PANIC, "commit_ts_redo: unknown op code %u", info);
}
+
+/*
+ * Entrypoint for sync.c to sync commit_ts files.
+ */
+int
+committssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(CommitTsCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/multixact.c b/src/backend/access/transam/multixact.c
index b8bedca04a..a2ce617c8c 100644
--- a/src/backend/access/transam/multixact.c
+++ b/src/backend/access/transam/multixact.c
@@ -1831,11 +1831,13 @@ MultiXactShmemInit(void)
SimpleLruInit(MultiXactOffsetCtl,
"MultiXactOffset", NUM_MULTIXACTOFFSET_BUFFERS, 0,
MultiXactOffsetSLRULock, "pg_multixact/offsets",
- LWTRANCHE_MULTIXACTOFFSET_BUFFER);
+ LWTRANCHE_MULTIXACTOFFSET_BUFFER,
+ SYNC_HANDLER_MULTIXACT_OFFSET);
SimpleLruInit(MultiXactMemberCtl,
"MultiXactMember", NUM_MULTIXACTMEMBER_BUFFERS, 0,
MultiXactMemberSLRULock, "pg_multixact/members",
- LWTRANCHE_MULTIXACTMEMBER_BUFFER);
+ LWTRANCHE_MULTIXACTMEMBER_BUFFER,
+ SYNC_HANDLER_MULTIXACT_MEMBER);
/* Initialize our shared state struct */
MultiXactState = ShmemInitStruct("Shared MultiXact State",
@@ -2100,19 +2102,6 @@ TrimMultiXact(void)
SetMultiXactIdLimit(oldestMXact, oldestMXactDB, true);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownMultiXact(void)
-{
- /* Flush dirty MultiXact pages to disk */
- TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_START(false);
- SimpleLruFlush(MultiXactOffsetCtl, false);
- SimpleLruFlush(MultiXactMemberCtl, false);
- TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(false);
-}
-
/*
* Get the MultiXact data to save in a checkpoint record
*/
@@ -2143,9 +2132,13 @@ CheckPointMultiXact(void)
{
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_START(true);
- /* Flush dirty MultiXact pages to disk */
- SimpleLruFlush(MultiXactOffsetCtl, true);
- SimpleLruFlush(MultiXactMemberCtl, true);
+ /*
+ * Write dirty MultiXact pages to disk. This may result in sync requests
+ * queued for later handling by ProcessSyncRequests(), as part of the
+ * checkpoint.
+ */
+ SimpleLruWriteAll(MultiXactOffsetCtl, true);
+ SimpleLruWriteAll(MultiXactMemberCtl, true);
TRACE_POSTGRESQL_MULTIXACT_CHECKPOINT_DONE(true);
}
@@ -2728,14 +2721,10 @@ find_multixact_start(MultiXactId multi, MultiXactOffset *result)
entryno = MultiXactIdToOffsetEntry(multi);
/*
- * Flush out dirty data, so PhysicalPageExists can work correctly.
- * SimpleLruFlush() is a pretty big hammer for that. Alternatively we
- * could add an in-memory version of page exists, but find_multixact_start
- * is called infrequently, and it doesn't seem bad to flush buffers to
- * disk before truncation.
+ * Write out dirty data, so PhysicalPageExists can work correctly.
*/
- SimpleLruFlush(MultiXactOffsetCtl, true);
- SimpleLruFlush(MultiXactMemberCtl, true);
+ SimpleLruWriteAll(MultiXactOffsetCtl, true);
+ SimpleLruWriteAll(MultiXactMemberCtl, true);
if (!SimpleLruDoesPhysicalPageExist(MultiXactOffsetCtl, pageno))
return false;
@@ -3386,3 +3375,21 @@ pg_get_multixact_members(PG_FUNCTION_ARGS)
SRF_RETURN_DONE(funccxt);
}
+
+/*
+ * Entrypoint for sync.c to sync offsets files.
+ */
+int
+multixactoffsetssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(MultiXactOffsetCtl, ftag, path);
+}
+
+/*
+ * Entrypoint for sync.c to sync members files.
+ */
+int
+multixactmemberssyncfiletag(const FileTag *ftag, char *path)
+{
+ return SlruSyncFileTag(MultiXactMemberCtl, ftag, path);
+}
diff --git a/src/backend/access/transam/slru.c b/src/backend/access/transam/slru.c
index fe7d759a8c..636b90db87 100644
--- a/src/backend/access/transam/slru.c
+++ b/src/backend/access/transam/slru.c
@@ -63,22 +63,33 @@
snprintf(path, MAXPGPATH, "%s/%04X", (ctl)->Dir, seg)
/*
- * During SimpleLruFlush(), we will usually not need to write/fsync more
- * than one or two physical files, but we may need to write several pages
- * per file. We can consolidate the I/O requests by leaving files open
- * until control returns to SimpleLruFlush(). This data structure remembers
- * which files are open.
+ * During SimpleLruWriteAll(), we will usually not need to write more than one
+ * or two physical files, but we may need to write several pages per file. We
+ * can consolidate the I/O requests by leaving files open until control returns
+ * to SimpleLruWriteAll(). This data structure remembers which files are open.
*/
-#define MAX_FLUSH_BUFFERS 16
+#define MAX_WRITEALL_BUFFERS 16
-typedef struct SlruFlushData
+typedef struct SlruWriteAllData
{
int num_files; /* # files actually open */
- int fd[MAX_FLUSH_BUFFERS]; /* their FD's */
- int segno[MAX_FLUSH_BUFFERS]; /* their log seg#s */
-} SlruFlushData;
+ int fd[MAX_WRITEALL_BUFFERS]; /* their FD's */
+ int segno[MAX_WRITEALL_BUFFERS]; /* their log seg#s */
+} SlruWriteAllData;
-typedef struct SlruFlushData *SlruFlush;
+typedef struct SlruWriteAllData *SlruWriteAll;
+
+/*
+ * Populate a file tag describing a segment file. We only use the segment
+ * number, since we can derive everything else we need by having separate
+ * sync handler functions for clog, multixact etc.
+ */
+#define INIT_SLRUFILETAG(a,xx_handler,xx_segno) \
+( \
+ memset(&(a), 0, sizeof(FileTag)), \
+ (a).handler = (xx_handler), \
+ (a).segno = (xx_segno) \
+)
/*
* Macro to mark a buffer slot "most recently used". Note multiple evaluation
@@ -125,10 +136,10 @@ static int slru_errno;
static void SimpleLruZeroLSNs(SlruCtl ctl, int slotno);
static void SimpleLruWaitIO(SlruCtl ctl, int slotno);
-static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata);
+static void SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata);
static bool SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno);
static bool SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno,
- SlruFlush fdata);
+ SlruWriteAll fdata);
static void SlruReportIOError(SlruCtl ctl, int pageno, TransactionId xid);
static int SlruSelectLRUPage(SlruCtl ctl, int pageno);
@@ -173,7 +184,8 @@ SimpleLruShmemSize(int nslots, int nlsns)
*/
void
SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id)
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler)
{
SlruShared shared;
bool found;
@@ -251,7 +263,7 @@ SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
* assume caller set PagePrecedes.
*/
ctl->shared = shared;
- ctl->do_fsync = true; /* default behavior */
+ ctl->sync_handler = sync_handler;
strlcpy(ctl->Dir, subdir, sizeof(ctl->Dir));
}
@@ -523,7 +535,7 @@ SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno, TransactionId xid)
* Control lock must be held at entry, and will be held at exit.
*/
static void
-SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
+SlruInternalWritePage(SlruCtl ctl, int slotno, SlruWriteAll fdata)
{
SlruShared shared = ctl->shared;
int pageno = shared->page_number[slotno];
@@ -587,6 +599,10 @@ SlruInternalWritePage(SlruCtl ctl, int slotno, SlruFlush fdata)
/* Now it's okay to ereport if we failed */
if (!ok)
SlruReportIOError(ctl, pageno, InvalidTransactionId);
+
+ /* If part of a checkpoint, count this as a buffer written. */
+ if (fdata)
+ CheckpointStats.ckpt_bufs_written++;
}
/*
@@ -730,13 +746,13 @@ SlruPhysicalReadPage(SlruCtl ctl, int pageno, int slotno)
*
* For now, assume it's not worth keeping a file pointer open across
* independent read/write operations. We do batch operations during
- * SimpleLruFlush, though.
+ * SimpleLruWriteAll, though.
*
* fdata is NULL for a standalone write, pointer to open-file info during
- * SimpleLruFlush.
+ * SimpleLruWriteAll.
*/
static bool
-SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
+SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruWriteAll fdata)
{
SlruShared shared = ctl->shared;
int segno = pageno / SLRU_PAGES_PER_SEGMENT;
@@ -791,7 +807,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
/*
- * During a Flush, we may already have the desired file open.
+ * During a WriteAll, we may already have the desired file open.
*/
if (fdata)
{
@@ -837,7 +853,7 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
if (fdata)
{
- if (fdata->num_files < MAX_FLUSH_BUFFERS)
+ if (fdata->num_files < MAX_WRITEALL_BUFFERS)
{
fdata->fd[fdata->num_files] = fd;
fdata->segno[fdata->num_files] = segno;
@@ -870,23 +886,31 @@ SlruPhysicalWritePage(SlruCtl ctl, int pageno, int slotno, SlruFlush fdata)
}
pgstat_report_wait_end();
- /*
- * If not part of Flush, need to fsync now. We assume this happens
- * infrequently enough that it's not a performance issue.
- */
- if (!fdata)
+ /* Queue up a sync request for the checkpointer. */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
- if (ctl->do_fsync && pg_fsync(fd) != 0)
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ if (!RegisterSyncRequest(&tag, SYNC_REQUEST, false))
{
+ /* No space to enqueue sync request. Do it synchronously. */
+ pgstat_report_wait_start(WAIT_EVENT_SLRU_SYNC);
+ if (pg_fsync(fd) != 0)
+ {
+ pgstat_report_wait_end();
+ slru_errcause = SLRU_FSYNC_FAILED;
+ slru_errno = errno;
+ CloseTransientFile(fd);
+ return false;
+ }
pgstat_report_wait_end();
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- CloseTransientFile(fd);
- return false;
}
- pgstat_report_wait_end();
+ }
+ /* Close file, unless part of flush request. */
+ if (!fdata)
+ {
if (CloseTransientFile(fd) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1122,13 +1146,16 @@ SlruSelectLRUPage(SlruCtl ctl, int pageno)
}
/*
- * Flush dirty pages to disk during checkpoint or database shutdown
+ * Write dirty pages to disk during checkpoint or database shutdown. Flushing
+ * is deferred until the next call to ProcessSyncRequests(), though we do fsync
+ * the containing directory here to make sure that newly created directory
+ * entries are on disk.
*/
void
-SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
+SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied)
{
SlruShared shared = ctl->shared;
- SlruFlushData fdata;
+ SlruWriteAllData fdata;
int slotno;
int pageno = 0;
int i;
@@ -1162,21 +1189,11 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
LWLockRelease(shared->ControlLock);
/*
- * Now fsync and close any files that were open
+ * Now close any files that were open
*/
ok = true;
for (i = 0; i < fdata.num_files; i++)
{
- pgstat_report_wait_start(WAIT_EVENT_SLRU_FLUSH_SYNC);
- if (ctl->do_fsync && pg_fsync(fdata.fd[i]) != 0)
- {
- slru_errcause = SLRU_FSYNC_FAILED;
- slru_errno = errno;
- pageno = fdata.segno[i] * SLRU_PAGES_PER_SEGMENT;
- ok = false;
- }
- pgstat_report_wait_end();
-
if (CloseTransientFile(fdata.fd[i]) != 0)
{
slru_errcause = SLRU_CLOSE_FAILED;
@@ -1189,7 +1206,7 @@ SimpleLruFlush(SlruCtl ctl, bool allow_redirtied)
SlruReportIOError(ctl, pageno, InvalidTransactionId);
/* Ensure that directory entries for new files are on disk. */
- if (ctl->do_fsync)
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
fsync_fname(ctl->Dir, true);
}
@@ -1350,6 +1367,19 @@ restart:
snprintf(path, MAXPGPATH, "%s/%04X", ctl->Dir, segno);
ereport(DEBUG2,
(errmsg("removing file \"%s\"", path)));
+
+ /*
+ * Tell the checkpointer to forget any sync requests, before we unlink the
+ * file.
+ */
+ if (ctl->sync_handler != SYNC_HANDLER_NONE)
+ {
+ FileTag tag;
+
+ INIT_SLRUFILETAG(tag, ctl->sync_handler, segno);
+ RegisterSyncRequest(&tag, SYNC_FORGET_REQUEST, true);
+ }
+
unlink(path);
LWLockRelease(shared->ControlLock);
@@ -1448,3 +1478,31 @@ SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data)
return retval;
}
+
+/*
+ * Individual SLRUs (clog, ...) have to provide a sync.c handler function so
+ * that they can provide the correct "SlruCtl" (otherwise we don't know how to
+ * build the path), but they just forward to this common implementation that
+ * performs the fsync.
+ */
+int
+SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path)
+{
+ int fd;
+ int save_errno;
+ int result;
+
+ SlruFileName(ctl, path, ftag->segno);
+
+ fd = OpenTransientFile(path, O_RDWR | PG_BINARY);
+ if (fd < 0)
+ return -1;
+
+ result = pg_fsync(fd);
+ save_errno = errno;
+
+ CloseTransientFile(fd);
+
+ errno = save_errno;
+ return result;
+}
diff --git a/src/backend/access/transam/subtrans.c b/src/backend/access/transam/subtrans.c
index a50f60b99a..0111e867c7 100644
--- a/src/backend/access/transam/subtrans.c
+++ b/src/backend/access/transam/subtrans.c
@@ -193,9 +193,7 @@ SUBTRANSShmemInit(void)
SubTransCtl->PagePrecedes = SubTransPagePrecedes;
SimpleLruInit(SubTransCtl, "Subtrans", NUM_SUBTRANS_BUFFERS, 0,
SubtransSLRULock, "pg_subtrans",
- LWTRANCHE_SUBTRANS_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SubTransCtl->do_fsync = false;
+ LWTRANCHE_SUBTRANS_BUFFER, SYNC_HANDLER_NONE);
}
/*
@@ -278,23 +276,6 @@ StartupSUBTRANS(TransactionId oldestActiveXID)
LWLockRelease(SubtransSLRULock);
}
-/*
- * This must be called ONCE during postmaster or standalone-backend shutdown
- */
-void
-ShutdownSUBTRANS(void)
-{
- /*
- * Flush dirty SUBTRANS pages to disk
- *
- * This is not actually necessary from a correctness point of view. We do
- * it merely as a debugging aid.
- */
- TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_START(false);
- SimpleLruFlush(SubTransCtl, false);
- TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(false);
-}
-
/*
* Perform a checkpoint --- either during shutdown, or on-the-fly
*/
@@ -302,14 +283,14 @@ void
CheckPointSUBTRANS(void)
{
/*
- * Flush dirty SUBTRANS pages to disk
+ * Write dirty SUBTRANS pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely to improve the odds that writing of dirty pages is done by
* the checkpoint process and not by backends.
*/
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_START(true);
- SimpleLruFlush(SubTransCtl, true);
+ SimpleLruWriteAll(SubTransCtl, true);
TRACE_POSTGRESQL_SUBTRANS_CHECKPOINT_DONE(true);
}
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 61754312e2..ed17dd0d12 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8528,10 +8528,6 @@ ShutdownXLOG(int code, Datum arg)
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
- ShutdownCLOG();
- ShutdownCommitTs();
- ShutdownSUBTRANS();
- ShutdownMultiXact();
}
/*
@@ -9176,16 +9172,28 @@ CreateEndOfRecoveryRecord(void)
static void
CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
{
+ /* Write out all dirty data in SLRUs and the main buffer pool */
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
+ CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
CheckPointCLOG();
CheckPointCommitTs();
CheckPointSUBTRANS();
CheckPointMultiXact();
CheckPointPredicate();
+ CheckPointBuffers(flags);
+
+ /* Perform all required fsyncs */
+ CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
+ ProcessSyncRequests();
+ CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
+ TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
+
+ /* Other activities performed at checkpoint time */
CheckPointRelationMap();
CheckPointReplicationSlots();
CheckPointSnapBuild();
CheckPointLogicalRewriteHeap();
- CheckPointBuffers(flags); /* performs all required fsyncs */
CheckPointReplicationOrigin();
/* We deliberately delay 2PC checkpointing as long as possible */
CheckPointTwoPhase(checkPointRedo);
diff --git a/src/backend/commands/async.c b/src/backend/commands/async.c
index cb341365df..8dbcace3f9 100644
--- a/src/backend/commands/async.c
+++ b/src/backend/commands/async.c
@@ -554,9 +554,8 @@ AsyncShmemInit(void)
*/
NotifyCtl->PagePrecedes = asyncQueuePagePrecedes;
SimpleLruInit(NotifyCtl, "Notify", NUM_NOTIFY_BUFFERS, 0,
- NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- NotifyCtl->do_fsync = false;
+ NotifySLRULock, "pg_notify", LWTRANCHE_NOTIFY_BUFFER,
+ SYNC_HANDLER_NONE);
if (!found)
{
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index a2a963bd5b..e549fa1d30 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2636,14 +2636,7 @@ PrintBufferLeakWarning(Buffer buffer)
void
CheckPointBuffers(int flags)
{
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_START(flags);
- CheckpointStats.ckpt_write_t = GetCurrentTimestamp();
BufferSync(flags);
- CheckpointStats.ckpt_sync_t = GetCurrentTimestamp();
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_SYNC_START();
- ProcessSyncRequests();
- CheckpointStats.ckpt_sync_end_t = GetCurrentTimestamp();
- TRACE_POSTGRESQL_BUFFER_CHECKPOINT_DONE();
}
diff --git a/src/backend/storage/lmgr/predicate.c b/src/backend/storage/lmgr/predicate.c
index a2f8e7524b..8a365b400c 100644
--- a/src/backend/storage/lmgr/predicate.c
+++ b/src/backend/storage/lmgr/predicate.c
@@ -821,9 +821,7 @@ SerialInit(void)
SerialSlruCtl->PagePrecedes = SerialPagePrecedesLogically;
SimpleLruInit(SerialSlruCtl, "Serial",
NUM_SERIAL_BUFFERS, 0, SerialSLRULock, "pg_serial",
- LWTRANCHE_SERIAL_BUFFER);
- /* Override default assumption that writes should be fsync'd */
- SerialSlruCtl->do_fsync = false;
+ LWTRANCHE_SERIAL_BUFFER, SYNC_HANDLER_NONE);
/*
* Create or attach to the SerialControl structure.
@@ -1052,7 +1050,7 @@ CheckPointPredicate(void)
SimpleLruTruncate(SerialSlruCtl, tailPage);
/*
- * Flush dirty SLRU pages to disk
+ * Write dirty SLRU pages to disk
*
* This is not actually necessary from a correctness point of view. We do
* it merely as a debugging aid.
@@ -1061,7 +1059,7 @@ CheckPointPredicate(void)
* before deleting the file in which they sit, which would be completely
* pointless.
*/
- SimpleLruFlush(SerialSlruCtl, true);
+ SimpleLruWriteAll(SerialSlruCtl, true);
}
/*------------------------------------------------------------------------*/
diff --git a/src/backend/storage/sync/sync.c b/src/backend/storage/sync/sync.c
index 3ded2cdd71..1d635d596c 100644
--- a/src/backend/storage/sync/sync.c
+++ b/src/backend/storage/sync/sync.c
@@ -18,6 +18,9 @@
#include <fcntl.h>
#include <sys/file.h>
+#include "access/commit_ts.h"
+#include "access/clog.h"
+#include "access/multixact.h"
#include "access/xlog.h"
#include "access/xlogutils.h"
#include "commands/tablespace.h"
@@ -90,12 +93,31 @@ typedef struct SyncOps
const FileTag *candidate);
} SyncOps;
+/*
+ * These indexes must correspond to the values of the SyncRequestHandler enum.
+ */
static const SyncOps syncsw[] = {
/* magnetic disk */
- {
+ [SYNC_HANDLER_MD] = {
.sync_syncfiletag = mdsyncfiletag,
.sync_unlinkfiletag = mdunlinkfiletag,
.sync_filetagmatches = mdfiletagmatches
+ },
+ /* pg_xact */
+ [SYNC_HANDLER_CLOG] = {
+ .sync_syncfiletag = clogsyncfiletag
+ },
+ /* pg_commit_ts */
+ [SYNC_HANDLER_COMMIT_TS] = {
+ .sync_syncfiletag = committssyncfiletag
+ },
+ /* pg_multixact/offsets */
+ [SYNC_HANDLER_MULTIXACT_OFFSET] = {
+ .sync_syncfiletag = multixactoffsetssyncfiletag
+ },
+ /* pg_multixact/members */
+ [SYNC_HANDLER_MULTIXACT_MEMBER] = {
+ .sync_syncfiletag = multixactmemberssyncfiletag
}
};
@@ -505,8 +527,8 @@ RememberSyncRequest(const FileTag *ftag, SyncRequestType type)
(void *) ftag,
HASH_ENTER,
&found);
- /* if new entry, initialize it */
- if (!found)
+ /* if new entry, or was previously canceled, initialize it */
+ if (!found || entry->canceled)
{
entry->cycle_ctr = sync_cycle_ctr;
entry->canceled = false;
diff --git a/src/include/access/clog.h b/src/include/access/clog.h
index 2db8acb189..d97b9042dc 100644
--- a/src/include/access/clog.h
+++ b/src/include/access/clog.h
@@ -12,6 +12,7 @@
#define CLOG_H
#include "access/xlogreader.h"
+#include "storage/sync.h"
#include "lib/stringinfo.h"
/*
@@ -50,6 +51,8 @@ extern void CheckPointCLOG(void);
extern void ExtendCLOG(TransactionId newestXact);
extern void TruncateCLOG(TransactionId oldestXact, Oid oldestxid_datoid);
+extern int clogsyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define CLOG_ZEROPAGE 0x00
#define CLOG_TRUNCATE 0x10
diff --git a/src/include/access/commit_ts.h b/src/include/access/commit_ts.h
index 2740c02a84..27900ce430 100644
--- a/src/include/access/commit_ts.h
+++ b/src/include/access/commit_ts.h
@@ -14,6 +14,7 @@
#include "access/xlog.h"
#include "datatype/timestamp.h"
#include "replication/origin.h"
+#include "storage/sync.h"
#include "utils/guc.h"
@@ -45,6 +46,8 @@ extern void SetCommitTsLimit(TransactionId oldestXact,
TransactionId newestXact);
extern void AdvanceOldestCommitTsXid(TransactionId oldestXact);
+extern int committssyncfiletag(const FileTag *ftag, char *path);
+
/* XLOG stuff */
#define COMMIT_TS_ZEROPAGE 0x00
#define COMMIT_TS_TRUNCATE 0x10
diff --git a/src/include/access/multixact.h b/src/include/access/multixact.h
index 6d729008c6..71d6e78063 100644
--- a/src/include/access/multixact.h
+++ b/src/include/access/multixact.h
@@ -13,6 +13,7 @@
#include "access/xlogreader.h"
#include "lib/stringinfo.h"
+#include "storage/sync.h"
/*
@@ -116,6 +117,9 @@ extern bool MultiXactIdPrecedes(MultiXactId multi1, MultiXactId multi2);
extern bool MultiXactIdPrecedesOrEquals(MultiXactId multi1,
MultiXactId multi2);
+extern int multixactoffsetssyncfiletag(const FileTag *ftag, char *path);
+extern int multixactmemberssyncfiletag(const FileTag *ftag, char *path);
+
extern void AtEOXact_MultiXact(void);
extern void AtPrepare_MultiXact(void);
extern void PostPrepare_MultiXact(TransactionId xid);
diff --git a/src/include/access/slru.h b/src/include/access/slru.h
index 61fbc80ef0..b39b43504d 100644
--- a/src/include/access/slru.h
+++ b/src/include/access/slru.h
@@ -15,6 +15,7 @@
#include "access/xlogdefs.h"
#include "storage/lwlock.h"
+#include "storage/sync.h"
/*
@@ -111,10 +112,10 @@ typedef struct SlruCtlData
SlruShared shared;
/*
- * This flag tells whether to fsync writes (true for pg_xact and multixact
- * stuff, false for pg_subtrans and pg_notify).
+ * Which sync handler function to use when handing sync requests over to
+ * the checkpointer. SYNC_HANDLER_NONE to disable fsync (eg pg_notify).
*/
- bool do_fsync;
+ SyncRequestHandler sync_handler;
/*
* Decide which of two page numbers is "older" for truncation purposes. We
@@ -135,14 +136,15 @@ typedef SlruCtlData *SlruCtl;
extern Size SimpleLruShmemSize(int nslots, int nlsns);
extern void SimpleLruInit(SlruCtl ctl, const char *name, int nslots, int nlsns,
- LWLock *ctllock, const char *subdir, int tranche_id);
+ LWLock *ctllock, const char *subdir, int tranche_id,
+ SyncRequestHandler sync_handler);
extern int SimpleLruZeroPage(SlruCtl ctl, int pageno);
extern int SimpleLruReadPage(SlruCtl ctl, int pageno, bool write_ok,
TransactionId xid);
extern int SimpleLruReadPage_ReadOnly(SlruCtl ctl, int pageno,
TransactionId xid);
extern void SimpleLruWritePage(SlruCtl ctl, int slotno);
-extern void SimpleLruFlush(SlruCtl ctl, bool allow_redirtied);
+extern void SimpleLruWriteAll(SlruCtl ctl, bool allow_redirtied);
extern void SimpleLruTruncate(SlruCtl ctl, int cutoffPage);
extern bool SimpleLruDoesPhysicalPageExist(SlruCtl ctl, int pageno);
@@ -151,6 +153,8 @@ typedef bool (*SlruScanCallback) (SlruCtl ctl, char *filename, int segpage,
extern bool SlruScanDirectory(SlruCtl ctl, SlruScanCallback callback, void *data);
extern void SlruDeleteSegment(SlruCtl ctl, int segno);
+extern int SlruSyncFileTag(SlruCtl ctl, const FileTag *ftag, char *path);
+
/* SlruScanDirectory public callbacks */
extern bool SlruScanDirCbReportPresence(SlruCtl ctl, char *filename,
int segpage, void *data);
diff --git a/src/include/storage/sync.h b/src/include/storage/sync.h
index e16ab8e711..f32e412e75 100644
--- a/src/include/storage/sync.h
+++ b/src/include/storage/sync.h
@@ -34,7 +34,12 @@ typedef enum SyncRequestType
*/
typedef enum SyncRequestHandler
{
- SYNC_HANDLER_MD = 0 /* md smgr */
+ SYNC_HANDLER_MD = 0,
+ SYNC_HANDLER_CLOG,
+ SYNC_HANDLER_COMMIT_TS,
+ SYNC_HANDLER_MULTIXACT_OFFSET,
+ SYNC_HANDLER_MULTIXACT_MEMBER,
+ SYNC_HANDLER_NONE
} SyncRequestHandler;
/*
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index b1afb345c3..88fc3ceae4 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2404,6 +2404,7 @@ Syn
SyncOps
SyncRepConfigData
SyncRepStandbyData
+SyncRequestHandler
SyncRequestType
SysScanDesc
SyscacheCallbackFunction
--
2.20.1
On Fri, Sep 25, 2020 at 12:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Here's a new version. The final thing I'm contemplating before
pushing this is whether there may be hidden magical dependencies in
the order of operations in CheckPointGuts(), which I've changed
around. Andres, any comments?
I nagged Andres off-list and he opined that it might be better to
reorder it a bit so that ProcessSyncRequests() comes after almost
everything else, so that if we ever teach more things to offload their
fsync work it'll be in the right order. I reordered it like that; now
only CheckPointTwoPhase() comes later, based on the comment that
accompanies it. In any case, we can always reconsider the ordering of
this function in later commits as required. Pushed like that.
On 9/25/20 9:09 AM, Thomas Munro wrote:
On Fri, Sep 25, 2020 at 12:53 PM Thomas Munro <thomas.munro@gmail.com> wrote:
Here's a new version. The final thing I'm contemplating before
pushing this is whether there may be hidden magical dependencies in
the order of operations in CheckPointGuts(), which I've changed
around. Andres, any comments?I nagged Andres off-list and he opined that it might be better to
reorder it a bit so that ProcessSyncRequests() comes after almost
everything else, so that if we ever teach more things to offload their
fsync work it'll be in the right order. I reordered it like that; now
only CheckPointTwoPhase() comes later, based on the comment that
accompanies it. In any case, we can always reconsider the ordering of
this function in later commits as required. Pushed like that.
Seems this commit left behind a couple unnecessary prototypes in a bunch
of header files. In particular, it removed these functions
- ShutdownCLOG();
- ShutdownCommitTs();
- ShutdownSUBTRANS();
- ShutdownMultiXact();
but we still have
$ git grep ShutdownCLOG
src/include/access/clog.h:extern void ShutdownCLOG(void);
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Mon, Jan 4, 2021 at 3:35 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
Seems this commit left behind a couple unnecessary prototypes in a bunch
of header files. In particular, it removed these functions- ShutdownCLOG();
- ShutdownCommitTs();
- ShutdownSUBTRANS();
- ShutdownMultiXact();
Thanks. Fixed.