[PoC] Non-volatile WAL buffer
Dear hackers,
I propose "non-volatile WAL buffer," a proof-of-concept new feature. It
enables WAL records to be durable without output to WAL segment files by
residing on persistent memory (PMEM) instead of DRAM. It improves database
performance by reducing copies of WAL and shortening the time of write
transactions.
I attach the first patchset that can be applied to PostgreSQL 12.0 (refs/
tags/REL_12_0). Please see README.nvwal (added by the patch 0003) to use
the new feature.
PMEM [1]Persistent Memory (SNIA) https://www.snia.org/PM is fast, non-volatile, and byte-addressable memory installed into
DIMM slots. Such products have been already available. For example, an
NVDIMM-N is a type of PMEM module that contains both DRAM and NAND flash.
It can be accessed like a regular DRAM, but on power loss, it can save its
contents into flash area. On power restore, it performs the reverse, that
is, the contents are copied back into DRAM. PMEM also has been already
supported by major operating systems such as Linux and Windows, and new
open-source libraries such as Persistent Memory Development Kit (PMDK) [2]Persistent Memory Development Kit (pmem.io) https://pmem.io/pmdk/.
Furthermore, several DBMSes have started to support PMEM.
It's time for PostgreSQL. PMEM is faster than a solid state disk and
naively can be used as a block storage. However, we cannot gain much
performance in that way because it is so fast that the overhead of
traditional software stacks now becomes unignorable, such as user buffers,
filesystems, and block layers. Non-volatile WAL buffer is a work to make
PostgreSQL PMEM-aware, that is, accessing directly to PMEM as a RAM to
bypass such overhead and achieve the maximum possible benefit. I believe
WAL is one of the most important modules to be redesigned for PMEM because
it has assumed slow disks such as HDDs and SSDs but PMEM is not so.
This work is inspired by "Non-volatile Memory Logging" talked in PGCon
2016 [3]Non-volatile Memory Logging (PGCon 2016) https://www.pgcon.org/2016/schedule/track/Performance/945.en.html to gain more benefit from PMEM than my and Yoshimi's previous
work did [4]Introducing PMDK into PostgreSQL (PGCon 2018) https://www.pgcon.org/2018/schedule/events/1154.en.html[5]Applying PMDK to WAL operations for persistent memory (pgsql-hackers) /messages/by-id/C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp. I submitted a talk proposal for PGCon in this year, and
have measured and analyzed performance of my PostgreSQL with non-volatile
WAL buffer, comparing with the original one that uses PMEM as "a faster-
than-SSD storage." I will talk about the results if accepted.
Best regards,
Takashi Menjo
[1]: Persistent Memory (SNIA) https://www.snia.org/PM
https://www.snia.org/PM
[2]: Persistent Memory Development Kit (pmem.io) https://pmem.io/pmdk/
https://pmem.io/pmdk/
[3]: Non-volatile Memory Logging (PGCon 2016) https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
[4]: Introducing PMDK into PostgreSQL (PGCon 2018) https://www.pgcon.org/2018/schedule/events/1154.en.html
https://www.pgcon.org/2018/schedule/events/1154.en.html
[5]: Applying PMDK to WAL operations for persistent memory (pgsql-hackers) /messages/by-id/C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp
/messages/by-id/C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
Attachments:
0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=0001-Support-GUCs-for-external-WAL-buffer.patchDownload
From 02896517f42d60e8f436ec5d0ab1a55b0ce1a3f9 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Fri, 24 Jan 2020 13:16:26 +0900
Subject: [PATCH 1/3] Support GUCs for external WAL buffer
To implement non-volatile WAL buffer, we add two new GUCs nvwal_path
and nvwal_size. Now postgres maps a file at that path onto memory to
use it as WAL buffer. Note that the buffer is still volatile for now.
---
configure | 99 +++++++++++
configure.in | 19 ++
src/backend/access/transam/Makefile | 2 +-
src/backend/access/transam/nv_xlog_buffer.c | 95 ++++++++++
src/backend/access/transam/xlog.c | 164 ++++++++++++++++--
src/backend/utils/misc/guc.c | 23 ++-
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/bin/initdb/initdb.c | 95 +++++++++-
src/include/access/nv_xlog_buffer.h | 71 ++++++++
src/include/access/xlog.h | 2 +
src/include/pg_config.h.in | 6 +
src/include/utils/guc.h | 4 +
12 files changed, 560 insertions(+), 22 deletions(-)
create mode 100644 src/backend/access/transam/nv_xlog_buffer.c
create mode 100644 src/include/access/nv_xlog_buffer.h
diff --git a/configure b/configure
index 54c852aca5..4674419094 100755
--- a/configure
+++ b/configure
@@ -864,6 +864,7 @@ with_libxml
with_libxslt
with_system_tzdata
with_zlib
+with_nvwal
with_gnu_ld
enable_largefile
enable_float4_byval
@@ -1570,6 +1571,7 @@ Optional Packages:
--with-system-tzdata=DIR
use system time zone data in DIR
--without-zlib do not use Zlib
+ --with-nvwal use non-volatile WAL buffer (NVWAL)
--with-gnu-ld assume the C compiler uses GNU ld [default=no]
Some influential environment variables:
@@ -8306,6 +8308,40 @@ fi
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with non-volatile WAL buffer (NVWAL)" >&5
+$as_echo_n "checking whether to build with non-volatile WAL buffer (NVWAL)... " >&6; }
+
+
+
+# Check whether --with-nvwal was given.
+if test "${with_nvwal+set}" = set; then :
+ withval=$with_nvwal;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_NVWAL 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-nvwal option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_nvwal=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_nvwal" >&5
+$as_echo "$with_nvwal" >&6; }
+
#
# Elf
#
@@ -12694,6 +12730,57 @@ fi
fi
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_pmem_pmem_map_file=yes
+else
+ ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+ LIBS="-lpmem $LIBS"
+
+else
+ as_fn_error $? "library 'libpmem' is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+fi
+
##
## Header files
@@ -13373,6 +13460,18 @@ fi
done
+fi
+
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+ ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+
fi
if test "$PORTNAME" = "win32" ; then
diff --git a/configure.in b/configure.in
index 6942f81d1e..d2062d020a 100644
--- a/configure.in
+++ b/configure.in
@@ -964,6 +964,14 @@ PGAC_ARG_BOOL(with, zlib, yes,
[do not use Zlib])
AC_SUBST(with_zlib)
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+AC_MSG_CHECKING([whether to build with non-volatile WAL buffer (NVWAL)])
+PGAC_ARG_BOOL(with, nvwal, no, [use non-volatile WAL buffer (NVWAL)],
+ [AC_DEFINE([USE_NVWAL], 1, [Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal)])])
+AC_MSG_RESULT([$with_nvwal])
+
#
# Elf
#
@@ -1287,6 +1295,12 @@ elif test "$with_uuid" = ossp ; then
fi
AC_SUBST(UUID_LIBS)
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+ AC_CHECK_LIB(pmem, pmem_map_file, [],
+ [AC_MSG_ERROR([library 'libpmem' is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
##
## Header files
@@ -1467,6 +1481,11 @@ elif test "$with_uuid" = ossp ; then
[AC_MSG_ERROR([header file <ossp/uuid.h> or <uuid.h> is required for OSSP UUID])])])
fi
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+ AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
if test "$PORTNAME" = "win32" ; then
AC_CHECK_HEADERS(crtdefs.h)
fi
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47269..addeae9477 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
xact.o xlog.o xlogarchive.o xlogfuncs.o \
- xloginsert.o xlogreader.o xlogutils.o
+ xloginsert.o xlogreader.o xlogutils.o nv_xlog_buffer.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/nv_xlog_buffer.c b/src/backend/access/transam/nv_xlog_buffer.c
new file mode 100644
index 0000000000..cfc6a6376b
--- /dev/null
+++ b/src/backend/access/transam/nv_xlog_buffer.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * nv_xlog_buffer.c
+ * PostgreSQL non-volatile WAL buffer
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/nv_xlog_buffer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#ifdef USE_NVWAL
+
+#include <libpmem.h>
+#include "access/nv_xlog_buffer.h"
+
+#include "miscadmin.h" /* IsBootstrapProcessingMode */
+#include "common/file_perm.h" /* pg_file_create_mode */
+
+/*
+ * Maps non-volatile WAL buffer on shared memory.
+ *
+ * Returns a mapped address if success; PANICs and never return otherwise.
+ */
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+ void *addr;
+ size_t map_len = 0;
+ int is_pmem = 0;
+
+ Assert(fname != NULL);
+ Assert(fsize > 0);
+
+ if (IsBootstrapProcessingMode())
+ {
+ /*
+ * Create and map a new file if we are in bootstrap mode (typically
+ * executed by initdb).
+ */
+ addr = pmem_map_file(fname, fsize, PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+ pg_file_create_mode, &map_len, &is_pmem);
+ }
+ else
+ {
+ /*
+ * Map an existing file. The second argument (len) should be zero,
+ * the third argument (flags) should have neither PMEM_FILE_CREATE nor
+ * PMEM_FILE_EXCL, and the fourth argument (mode) will be ignored.
+ */
+ addr = pmem_map_file(fname, 0, 0, 0, &map_len, &is_pmem);
+ }
+
+ if (addr == NULL)
+ elog(PANIC, "could not map non-volatile WAL buffer '%s': %m", fname);
+
+ if (map_len != fsize)
+ elog(PANIC, "size of non-volatile WAL buffer '%s' is invalid; "
+ "expected %zu; actual %zu",
+ fname, fsize, map_len);
+
+ if (!is_pmem)
+ elog(PANIC, "non-volatile WAL buffer '%s' is not on persistent memory",
+ fname);
+
+ /*
+ * Assert page boundary alignment (8KiB as default). It should pass because
+ * PMDK considers hugepage boundary alignment (2MiB or 1GiB on x64).
+ */
+ Assert((uint64) addr % XLOG_BLCKSZ == 0);
+
+ elog(LOG, "non-volatile WAL buffer '%s' is mapped on [%p-%p)",
+ fname, addr, (char *) addr + map_len);
+ return addr;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+ Assert(addr != NULL);
+
+ if (pmem_unmap(addr, fsize) < 0)
+ {
+ elog(WARNING, "could not unmap non-volatile WAL buffer: %m");
+ return;
+ }
+
+ elog(LOG, "non-volatile WAL buffer unmapped");
+}
+
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 77ad765989..eae0c01e3c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
#include "access/xloginsert.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
+#include "access/nv_xlog_buffer.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
@@ -848,6 +849,12 @@ static bool InRedo = false;
/* Have we launched bgwriter during recovery? */
static bool bgwriterLaunched = false;
+/* For non-volatile WAL buffer (NVWAL) */
+char *NvwalPath = NULL; /* a GUC parameter */
+int NvwalSizeMB = 1024; /* a direct GUC parameter */
+static Size NvwalSize = 0; /* an indirect GUC parameter */
+static bool NvwalAvail = false;
+
/* For WALInsertLockAcquire/Release functions */
static int MyLockNo = 0;
static bool holdingAllLocks = false;
@@ -4906,6 +4913,76 @@ check_wal_buffers(int *newval, void **extra, GucSource source)
return true;
}
+/*
+ * GUC check_hook for nvwal_path.
+ */
+bool
+check_nvwal_path(char **newval, void **extra, GucSource source)
+{
+#ifndef USE_NVWAL
+ Assert(!NvwalAvail);
+
+ if (**newval != '\0')
+ {
+ GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+ GUC_check_errmsg("nvwal_path is invalid parameter without NVWAL");
+ return false;
+ }
+#endif
+
+ return true;
+}
+
+void
+assign_nvwal_path(const char *newval, void *extra)
+{
+ /* true if not empty; false if empty */
+ NvwalAvail = (bool) (*newval != '\0');
+}
+
+/*
+ * GUC check_hook for nvwal_size.
+ *
+ * It checks the boundary only and DOES NOT check if the size is multiple
+ * of wal_segment_size because the segment size (probably stored in the
+ * control file) have not been set properly here yet.
+ *
+ * See XLOGShmemSize for more validation.
+ */
+bool
+check_nvwal_size(int *newval, void **extra, GucSource source)
+{
+#ifdef USE_NVWAL
+ Size buf_size;
+ int64 npages;
+
+ Assert(*newval > 0);
+
+ buf_size = (Size) (*newval) * 1024 * 1024;
+ npages = (int64) buf_size / XLOG_BLCKSZ;
+ Assert(npages > 0);
+
+ if (npages > INT_MAX)
+ {
+ /* XLOG_BLCKSZ could be so small that npages exceeds INT_MAX */
+ GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+ GUC_check_errmsg("invalid value for nvwal_size (%dMB): "
+ "the number of WAL pages too large; "
+ "buf_size %zu; XLOG_BLCKSZ %d",
+ *newval, buf_size, (int) XLOG_BLCKSZ);
+ return false;
+ }
+#endif
+
+ return true;
+}
+
+void
+assign_nvwal_size(int newval, void *extra)
+{
+ NvwalSize = (Size) newval * 1024 * 1024;
+}
+
/*
* Read the control file, set respective GUCs.
*
@@ -4934,13 +5011,49 @@ XLOGShmemSize(void)
{
Size size;
+ /*
+ * If we use non-volatile WAL buffer, we don't use the given wal_buffers.
+ * Instead, we set it the value based on the size of the file for the
+ * buffer. This should be done here because of xlblocks array calculation.
+ */
+ if (NvwalAvail)
+ {
+ char buf[32];
+ int64 npages;
+
+ Assert(NvwalSizeMB > 0);
+ Assert(NvwalSize > 0);
+ Assert(wal_segment_size > 0);
+ Assert(wal_segment_size % XLOG_BLCKSZ == 0);
+
+ /*
+ * At last, we can check if the size of non-volatile WAL buffer
+ * (nvwal_size) is multiple of WAL segment size.
+ *
+ * Note that NvwalSize has already been calculated in assign_nvwal_size.
+ */
+ if (NvwalSize % wal_segment_size != 0)
+ {
+ elog(PANIC,
+ "invalid value for nvwal_size (%dMB): "
+ "it should be multiple of WAL segment size; "
+ "NvwalSize %zu; wal_segment_size %d",
+ NvwalSizeMB, NvwalSize, wal_segment_size);
+ }
+
+ npages = (int64) NvwalSize / XLOG_BLCKSZ;
+ Assert(npages > 0 && npages <= INT_MAX);
+
+ snprintf(buf, sizeof(buf), "%d", (int) npages);
+ SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
+ }
/*
* If the value of wal_buffers is -1, use the preferred auto-tune value.
* This isn't an amazingly clean place to do this, but we must wait till
* NBuffers has received its final value, and must do it before using the
* value of XLOGbuffers to do anything important.
*/
- if (XLOGbuffers == -1)
+ else if (XLOGbuffers == -1)
{
char buf[32];
@@ -4956,10 +5069,13 @@ XLOGShmemSize(void)
size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
/* xlblocks array */
size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
- /* extra alignment padding for XLOG I/O buffers */
- size = add_size(size, XLOG_BLCKSZ);
- /* and the buffers themselves */
- size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+ if (!NvwalAvail)
+ {
+ /* extra alignment padding for XLOG I/O buffers */
+ size = add_size(size, XLOG_BLCKSZ);
+ /* and the buffers themselves */
+ size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+ }
/*
* Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5056,13 +5172,32 @@ XLOGShmemInit(void)
}
/*
- * Align the start of the page buffers to a full xlog block size boundary.
- * This simplifies some calculations in XLOG insertion. It is also
- * required for O_DIRECT.
+ * Open and memory-map a file for non-volatile XLOG buffer. The PMDK will
+ * align the start of the buffer to 2-MiB boundary if the size of the
+ * buffer is larger than or equal to 4 MiB.
*/
- allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
- XLogCtl->pages = allocptr;
- memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+ if (NvwalAvail)
+ {
+ /* Logging and error-handling should be done in the function */
+ XLogCtl->pages = MapNonVolatileXLogBuffer(NvwalPath, NvwalSize);
+
+ /*
+ * Do not memset non-volatile XLOG buffer (XLogCtl->pages) here
+ * because it would contain records for recovery. We should do so in
+ * checkpoint after the recovery completes successfully.
+ */
+ }
+ else
+ {
+ /*
+ * Align the start of the page buffers to a full xlog block size
+ * boundary. This simplifies some calculations in XLOG insertion. It
+ * is also required for O_DIRECT.
+ */
+ allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+ XLogCtl->pages = allocptr;
+ memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+ }
/*
* Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
@@ -8343,6 +8478,13 @@ ShutdownXLOG(int code, Datum arg)
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
+
+ /*
+ * If we use non-volatile XLOG buffer, unmap it.
+ */
+ if (NvwalAvail)
+ UnmapNonVolatileXLogBuffer(XLogCtl->pages, NvwalSize);
+
ShutdownCLOG();
ShutdownCommitTs();
ShutdownSUBTRANS();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f0ed326a1b..39d087d2d1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2606,7 +2606,7 @@ static struct config_int ConfigureNamesInt[] =
GUC_UNIT_XBLOCKS
},
&XLOGbuffers,
- -1, -1, (INT_MAX / XLOG_BLCKSZ),
+ -1, -1, INT_MAX,
check_wal_buffers, NULL, NULL
},
@@ -3194,6 +3194,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, assign_tcp_user_timeout, show_tcp_user_timeout
},
+ {
+ {"nvwal_size", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Size of non-volatile WAL buffer (NVWAL)."),
+ NULL,
+ GUC_UNIT_MB
+ },
+ &NvwalSizeMB,
+ 1024, 1, INT_MAX,
+ check_nvwal_size, assign_nvwal_size, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -4199,6 +4210,16 @@ static struct config_string ConfigureNamesString[] =
NULL, NULL, NULL
},
+ {
+ {"nvwal_path", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Path to file for non-volatile WAL buffer (NVWAL)."),
+ NULL
+ },
+ &NvwalPath,
+ "",
+ check_nvwal_path, assign_nvwal_path, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b61e66932c..f77a4a7d0e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -223,6 +223,8 @@
#checkpoint_timeout = 5min # range 30s-1d
#max_wal_size = 1GB
#min_wal_size = 80MB
+#nvwal_path = '/path/to/nvwal'
+#nvwal_size = 1GB
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_flush_after = 0 # measured in pages, 0 disables
#checkpoint_warning = 30s # 0 disables
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index acf610808e..f08da4da9b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -144,7 +144,10 @@ static bool show_setting = false;
static bool data_checksums = false;
static char *xlog_dir = NULL;
static char *str_wal_segment_size_mb = NULL;
+static char *nvwal_path = NULL;
+static char *str_nvwal_size_mb = NULL;
static int wal_segment_size_mb;
+static int nvwal_size_mb;
/* internal vars */
@@ -1115,14 +1118,78 @@ setup_config(void)
conflines = replace_token(conflines, "#port = 5432", repltok);
#endif
- /* set default max_wal_size and min_wal_size */
- snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
- pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
- conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
-
- snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
- pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
- conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+ if (nvwal_path != NULL)
+ {
+ int nr_segs;
+
+ if (str_nvwal_size_mb == NULL)
+ nvwal_size_mb = 1024;
+ else
+ {
+ char *endptr;
+
+ /* check that the argument is a number */
+ nvwal_size_mb = strtol(str_nvwal_size_mb, &endptr, 10);
+
+ /* verify that the size of non-volatile WAL buffer is valid */
+ if (endptr == str_nvwal_size_mb || *endptr != '\0')
+ {
+ pg_log_error("argument of --nvwal-size must be a number; "
+ "str_nvwal_size_mb '%s'",
+ str_nvwal_size_mb);
+ exit(1);
+ }
+ if (nvwal_size_mb <= 0)
+ {
+ pg_log_error("argument of --nvwal-size must be a positive number; "
+ "str_nvwal_size_mb '%s'; nvwal_size_mb %d",
+ str_nvwal_size_mb, nvwal_size_mb);
+ exit(1);
+ }
+ if (nvwal_size_mb % wal_segment_size_mb != 0)
+ {
+ pg_log_error("argument of --nvwal-size must be multiple of WAL segment size; "
+ "str_nvwal_size_mb '%s'; nvwal_size_mb %d; wal_segment_size_mb %d",
+ str_nvwal_size_mb, nvwal_size_mb, wal_segment_size_mb);
+ exit(1);
+ }
+ }
+
+ /*
+ * XXX We set {min_,max_,nv}wal_size to the same value. Note that
+ * postgres might bootstrap and run if the three config does not have
+ * the same value, but have not been tested yet.
+ */
+ nr_segs = nvwal_size_mb / wal_segment_size_mb;
+
+ snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "nvwal_path = '%s'",
+ nvwal_path);
+ conflines = replace_token(conflines,
+ "#nvwal_path = '/path/to/nvwal'", repltok);
+
+ snprintf(repltok, sizeof(repltok), "nvwal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#nvwal_size = 1GB", repltok);
+ }
+ else
+ {
+ /* set default max_wal_size and min_wal_size */
+ snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+ pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
+ conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+ pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
+ conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+ }
snprintf(repltok, sizeof(repltok), "lc_messages = '%s'",
escape_quotes(lc_messages));
@@ -2373,6 +2440,8 @@ usage(const char *progname)
printf(_(" -W, --pwprompt prompt for a password for the new superuser\n"));
printf(_(" -X, --waldir=WALDIR location for the write-ahead log directory\n"));
printf(_(" --wal-segsize=SIZE size of WAL segments, in megabytes\n"));
+ printf(_(" -P, --nvwal-path=FILE path to file for non-volatile WAL buffer (NVWAL)\n"));
+ printf(_(" -Q, --nvwal-size=SIZE size of NVWAL, in megabytes\n"));
printf(_("\nLess commonly used options:\n"));
printf(_(" -d, --debug generate lots of debugging output\n"));
printf(_(" -k, --data-checksums use data page checksums\n"));
@@ -3051,6 +3120,8 @@ main(int argc, char *argv[])
{"sync-only", no_argument, NULL, 'S'},
{"waldir", required_argument, NULL, 'X'},
{"wal-segsize", required_argument, NULL, 12},
+ {"nvwal-path", required_argument, NULL, 'P'},
+ {"nvwal-size", required_argument, NULL, 'Q'},
{"data-checksums", no_argument, NULL, 'k'},
{"allow-group-access", no_argument, NULL, 'g'},
{NULL, 0, NULL, 0}
@@ -3094,7 +3165,7 @@ main(int argc, char *argv[])
/* process command-line options */
- while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:g", long_options, &option_index)) != -1)
+ while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:P:Q:g", long_options, &option_index)) != -1)
{
switch (c)
{
@@ -3188,6 +3259,12 @@ main(int argc, char *argv[])
case 12:
str_wal_segment_size_mb = pg_strdup(optarg);
break;
+ case 'P':
+ nvwal_path = pg_strdup(optarg);
+ break;
+ case 'Q':
+ str_nvwal_size_mb = pg_strdup(optarg);
+ break;
case 'g':
SetDataDirectoryCreatePerm(PG_DIR_MODE_GROUP);
break;
diff --git a/src/include/access/nv_xlog_buffer.h b/src/include/access/nv_xlog_buffer.h
new file mode 100644
index 0000000000..b58878c92b
--- /dev/null
+++ b/src/include/access/nv_xlog_buffer.h
@@ -0,0 +1,71 @@
+/*
+ * nv_xlog_buffer.h
+ *
+ * PostgreSQL non-volatile WAL buffer
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nv_xlog_buffer.h
+ */
+#ifndef NV_XLOG_BUFFER_H
+#define NV_XLOG_BUFFER_H
+
+extern void *MapNonVolatileXLogBuffer(const char *fname, Size fsize);
+extern void UnmapNonVolatileXLogBuffer(void *addr, Size fsize);
+
+#ifdef USE_NVWAL
+#include <libpmem.h>
+
+#define nv_memset_persist pmem_memset_persist
+#define nv_memcpy_nodrain pmem_memcpy_nodrain
+#define nv_flush pmem_flush
+#define nv_drain pmem_drain
+#define nv_persist pmem_persist
+
+#else
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+ return NULL;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+ return;
+}
+
+static inline void *
+nv_memset_persist(void *pmemdest, int c, size_t len)
+{
+ return NULL;
+}
+
+static inline void *
+nv_memcpy_nodrain(void *pmemdest, const void *src,
+ size_t len)
+{
+ return NULL;
+}
+
+static inline void
+nv_flush(void *pmemdest, size_t len)
+{
+ return;
+}
+
+static inline void
+nv_drain(void)
+{
+ return;
+}
+
+static inline void
+nv_persist(const void *addr, size_t len)
+{
+ return;
+}
+
+#endif /* USE_NVWAL */
+#endif /* NV_XLOG_BUFFER_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252aad..bc09fa104c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -129,6 +129,8 @@ extern int recoveryTargetAction;
extern int recovery_min_apply_delay;
extern char *PrimaryConnInfo;
extern char *PrimarySlotName;
+extern char *NvwalPath;
+extern int NvwalSizeMB;
/* indirectly set via GUC system */
extern TransactionId recoveryTargetXid;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 512213aa32..bd2b434d93 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -356,6 +356,9 @@
/* Define to 1 if you have the `pam' library (-lpam). */
#undef HAVE_LIBPAM
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
/* Define if you have a function readline library */
#undef HAVE_LIBREADLINE
@@ -932,6 +935,9 @@
/* Define to select named POSIX semaphores. */
#undef USE_NAMED_POSIX_SEMAPHORES
+/* Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal) */
+#undef USE_NVWAL
+
/* Define to build with OpenSSL support. (--with-openssl) */
#undef USE_OPENSSL
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index a93ed77c9c..3bd4bbb872 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -432,6 +432,10 @@ extern void assign_search_path(const char *newval, void *extra);
/* in access/transam/xlog.c */
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_nvwal_path(char **newval, void **extra, GucSource source);
+extern void assign_nvwal_path(const char *newval, void *extra);
+extern bool check_nvwal_size(int *newval, void **extra, GucSource source);
+extern void assign_nvwal_size(int newval, void *extra);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
#endif /* GUC_H */
--
2.20.1
0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=0002-Non-volatile-WAL-buffer.patchDownload
From 6d75e271b7475cc853b13ef54d13ba1c0b2fab1d Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Fri, 24 Jan 2020 13:16:27 +0900
Subject: [PATCH 2/3] Non-volatile WAL buffer
Now external WAL buffer becomes non-volatile.
Bumps PG_CONTROL_VERSION.
---
src/backend/access/transam/xlog.c | 975 +++++++++++++++++++++---
src/backend/replication/walsender.c | 50 ++
src/bin/pg_controldata/pg_controldata.c | 3 +
src/include/access/xlog.h | 6 +
src/include/catalog/pg_control.h | 17 +-
5 files changed, 948 insertions(+), 103 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eae0c01e3c..ba89d3c158 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -643,6 +643,13 @@ typedef struct XLogCtlData
TimeLineID ThisTimeLineID;
TimeLineID PrevTimeLineID;
+ /*
+ * Used for non-volatile WAL buffer (NVWAL).
+ *
+ * All the records up to this LSN are persistent in NVWAL.
+ */
+ XLogRecPtr persistentUpTo;
+
/*
* SharedRecoveryInProgress indicates if we're still in crash or archive
* recovery. Protected by info_lck.
@@ -766,11 +773,12 @@ typedef enum
XLOG_FROM_ANY = 0, /* request to read WAL from any source */
XLOG_FROM_ARCHIVE, /* restored using restore_command */
XLOG_FROM_PG_WAL, /* existing file in pg_wal */
+ XLOG_FROM_NVWAL, /* non-volatile WAL buffer */
XLOG_FROM_STREAM /* streamed from master */
} XLogSource;
/* human-readable names for XLogSources, for debugging output */
-static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
+static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "nvwal", "stream"};
/*
* openLogFile is -1 or a kernel FD for an open log file segment.
@@ -898,6 +906,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
bool fetching_ckpt, XLogRecPtr tliRecPtr);
static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
static void XLogFileClose(void);
+static void PreallocNonVolatileXlogBuffer(void);
static void PreallocXlogFiles(XLogRecPtr endptr);
static void RemoveTempXlogFiles(void);
static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
@@ -1177,6 +1186,43 @@ XLogInsertRecord(XLogRecData *rdata,
}
}
+ /*
+ * Request a checkpoint here if non-volatile WAL buffer is used and we
+ * have consumed too much WAL since the last checkpoint.
+ *
+ * We first screen under the condition (1) OR (2) below:
+ *
+ * (1) The record was the first one in a certain segment.
+ * (2) The record was inserted across segments.
+ *
+ * We then check the segment number which the record was inserted into.
+ */
+ if (NvwalAvail && inserted &&
+ (StartPos % wal_segment_size == SizeOfXLogLongPHD ||
+ StartPos / wal_segment_size < EndPos / wal_segment_size))
+ {
+ XLogSegNo end_segno;
+
+ XLByteToSeg(EndPos, end_segno, wal_segment_size);
+
+ /*
+ * NOTE: We do not signal walsender here because the inserted record
+ * have not drained by NVWAL buffer yet.
+ *
+ * NOTE: We do not signal walarchiver here because the inserted record
+ * have not flushed to a segment file. So we don't need to update
+ * XLogCtl->lastSegSwitch{Time,LSN}, used only by CheckArchiveTimeout.
+ */
+
+ /* Two-step checking for speed (see also XLogWrite) */
+ if (IsUnderPostmaster && XLogCheckpointNeeded(end_segno))
+ {
+ (void) GetRedoRecPtr();
+ if (XLogCheckpointNeeded(end_segno))
+ RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+ }
+ }
+
#ifdef WAL_DEBUG
if (XLOG_DEBUG)
{
@@ -2100,6 +2146,15 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
XLogRecPtr NewPageBeginPtr;
XLogPageHeader NewPage;
int npages = 0;
+ bool is_firstpage;
+
+ if (NvwalAvail)
+ elog(DEBUG1, "XLogCtl->InitializedUpTo %X/%X; upto %X/%X; opportunistic %s",
+ (uint32) (XLogCtl->InitializedUpTo >> 32),
+ (uint32) XLogCtl->InitializedUpTo,
+ (uint32) (upto >> 32),
+ (uint32) upto,
+ opportunistic ? "true" : "false");
LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
@@ -2161,7 +2216,25 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
{
/* Have to write it ourselves */
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
- WriteRqst.Write = OldPageRqstPtr;
+
+ if (NvwalAvail)
+ {
+ /*
+ * If we use non-volatile WAL buffer, it is a special
+ * but expected case to write the buffer pages out to
+ * segment files, and for simplicity, it is done in
+ * segment by segment.
+ */
+ XLogRecPtr OldSegEndPtr;
+
+ OldSegEndPtr = OldPageRqstPtr - XLOG_BLCKSZ + wal_segment_size;
+ Assert(OldSegEndPtr % wal_segment_size == 0);
+
+ WriteRqst.Write = OldSegEndPtr;
+ }
+ else
+ WriteRqst.Write = OldPageRqstPtr;
+
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, false);
LWLockRelease(WALWriteLock);
@@ -2188,7 +2261,20 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
* Be sure to re-zero the buffer so that bytes beyond what we've
* written will look like zeroes and not valid XLOG records...
*/
- MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
+ if (NvwalAvail)
+ {
+ /*
+ * We do not take the way that combines MemSet() and pmem_persist()
+ * because pmem_persist() may use slow and strong-ordered cache
+ * flush instruction if weak-ordered fast one is not supported.
+ * Instead, we first fill the buffer with zero by
+ * pmem_memset_persist() that can leverage non-temporal fast store
+ * instructions, then make the header persistent later.
+ */
+ nv_memset_persist(NewPage, 0, XLOG_BLCKSZ);
+ }
+ else
+ MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
/*
* Fill the new page's header
@@ -2220,7 +2306,8 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
/*
* If first page of an XLOG segment file, make it a long header.
*/
- if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+ is_firstpage = ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0);
+ if (is_firstpage)
{
XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
@@ -2235,7 +2322,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
* before the xlblocks update. GetXLogBuffer() reads xlblocks without
* holding a lock.
*/
- pg_write_barrier();
+ if (NvwalAvail)
+ {
+ /* Make the header persistent on PMEM */
+ nv_persist(NewPage, is_firstpage ? SizeOfXLogLongPHD : SizeOfXLogShortPHD);
+ }
+ else
+ pg_write_barrier();
*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
@@ -2245,6 +2338,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
}
LWLockRelease(WALBufMappingLock);
+ if (NvwalAvail)
+ elog(DEBUG1, "ControlFile->discardedUpTo %X/%X; XLogCtl->InitializedUpTo %X/%X",
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo,
+ (uint32) (XLogCtl->InitializedUpTo >> 32),
+ (uint32) XLogCtl->InitializedUpTo);
+
#ifdef WAL_DEBUG
if (XLOG_DEBUG && npages > 0)
{
@@ -2616,6 +2716,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
LogwrtResult.Flush = LogwrtResult.Write;
}
+ /*
+ * Update discardedUpTo if NVWAL is used. A new value should not fall
+ * behind the old one.
+ */
+ if (NvwalAvail)
+ {
+ Assert(LogwrtResult.Write == LogwrtResult.Flush);
+
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ if (ControlFile->discardedUpTo < LogwrtResult.Write)
+ {
+ ControlFile->discardedUpTo = LogwrtResult.Write;
+ UpdateControlFile();
+ }
+ LWLockRelease(ControlFileLock);
+ }
+
/*
* Update shared-memory status
*
@@ -2820,6 +2937,123 @@ XLogFlush(XLogRecPtr record)
return;
}
+ if (NvwalAvail)
+ {
+ XLogRecPtr FromPos;
+
+ /*
+ * No page on the NVWAL is to be flushed to segment files. Instead,
+ * we wait all the insertions preceding this one complete. We will
+ * wait for all the records to be persistent on the NVWAL below.
+ */
+ record = WaitXLogInsertionsToFinish(record);
+
+ /*
+ * Check if another backend already have done what I am doing.
+ *
+ * We can compare something <= XLogCtl->persistentUpTo without
+ * holding XLogCtl->info_lck spinlock because persistentUpTo is
+ * monotonically increasing and can be loaded atomically on each
+ * NVWAL-supported platform (now x64 only).
+ */
+ FromPos = *((volatile XLogRecPtr *) &XLogCtl->persistentUpTo);
+ if (record <= FromPos)
+ return;
+
+ /*
+ * In a very rare case, we rounded whole the NVWAL. We do not need
+ * to care old pages here because they already have been evicted to
+ * segment files at record insertion.
+ *
+ * In such a case, we flush whole the NVWAL. We also log it as
+ * warning because it can be time-consuming operation.
+ *
+ * TODO Advance XLogCtl->persistentUpTo at the end of XLogWrite, and
+ * we can remove the following first if-block.
+ */
+ if (record - FromPos > NvwalSize)
+ {
+ elog(WARNING, "flush whole the NVWAL; FromPos %X/%X; record %X/%X",
+ (uint32) (FromPos >> 32), (uint32) FromPos,
+ (uint32) (record >> 32), (uint32) record);
+
+ nv_flush(XLogCtl->pages, NvwalSize);
+ }
+ else
+ {
+ char *frompos;
+ char *uptopos;
+ size_t fromoff;
+ size_t uptooff;
+
+ /*
+ * Flush each record that is probably not flushed yet.
+ *
+ * We have two reasons why we say "probably". The first is because
+ * such a record copied with non-temporal store instruction has
+ * already "flushed" but we cannot distinguish it. nv_flush is
+ * harmless for it in consistency.
+ *
+ * The second reason is that the target record might have already
+ * been evicted to a segment file until now. Also in this case,
+ * nv_flush is harmless in consistency.
+ */
+ uptooff = record % NvwalSize;
+ uptopos = XLogCtl->pages + uptooff;
+ fromoff = FromPos % NvwalSize;
+ frompos = XLogCtl->pages + fromoff;
+
+ /* Handles rotation */
+ if (uptopos <= frompos)
+ {
+ nv_flush(frompos, NvwalSize - fromoff);
+ fromoff = 0;
+ frompos = XLogCtl->pages;
+ }
+
+ nv_flush(frompos, uptooff - fromoff);
+ }
+
+ /*
+ * To guarantee durability ("D" of ACID), we should satisfy the
+ * following two for each transaction X:
+ *
+ * (1) All the WAL records inserted by X, including the commit record
+ * of X, should persist on NVWAL before the server commits X.
+ *
+ * (2) All the WAL records inserted by any other transactions than
+ * X, that have less LSN than the commit record just inserted
+ * by X, should persist on NVWAL before the server commits X.
+ *
+ * The (1) can be satisfied by a store barrier after the commit record
+ * of X is flushed because each WAL record on X is already flushed in
+ * the end of its insertion. The (2) can be satisfied by waiting for
+ * any record insertions that have less LSN than the commit record just
+ * inserted by X, and by a store barrier as well.
+ *
+ * Now is the time. Have a store barrier.
+ */
+ nv_drain();
+
+ /*
+ * Remember where the last persistent record is. A new value should
+ * not fall behind the old one.
+ */
+ SpinLockAcquire(&XLogCtl->info_lck);
+ if (XLogCtl->persistentUpTo < record)
+ XLogCtl->persistentUpTo = record;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ /*
+ * The records up to the returned "record" have been persisntent on
+ * NVWAL. Now signal walsenders.
+ */
+ WalSndWakeupRequest();
+ WalSndWakeupProcessRequests();
+
+ return;
+ }
+
/* Quick exit if already known flushed */
if (record <= LogwrtResult.Flush)
return;
@@ -3003,6 +3237,13 @@ XLogBackgroundFlush(void)
if (RecoveryInProgress())
return false;
+ /*
+ * Quick exit if NVWAL buffer is used and archiving is not active. In this
+ * case, we need no WAL segment file in pg_wal directory.
+ */
+ if (NvwalAvail && !XLogArchivingActive())
+ return false;
+
/* read LogwrtResult and update local state */
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
@@ -3021,6 +3262,18 @@ XLogBackgroundFlush(void)
flexible = false; /* ensure it all gets written */
}
+ /*
+ * If NVWAL is used, back off to the last compeleted segment boundary
+ * for writing the buffer page to files in segment by segment. We do so
+ * nowhere but here after XLogCtl->asyncXactLSN is loaded because it
+ * should be considered.
+ */
+ if (NvwalAvail)
+ {
+ WriteRqst.Write -= WriteRqst.Write % wal_segment_size;
+ flexible = false; /* ensure it all gets written */
+ }
+
/*
* If already known flushed, we're done. Just need to check if we are
* holding an open file handle to a logfile that's no longer in use,
@@ -3047,7 +3300,12 @@ XLogBackgroundFlush(void)
flushbytes =
WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
- if (WalWriterFlushAfter == 0 || lastflush == 0)
+ if (NvwalAvail)
+ {
+ WriteRqst.Flush = WriteRqst.Write;
+ lastflush = now;
+ }
+ else if (WalWriterFlushAfter == 0 || lastflush == 0)
{
/* first call, or block based limits disabled */
WriteRqst.Flush = WriteRqst.Write;
@@ -3106,7 +3364,28 @@ XLogBackgroundFlush(void)
* Great, done. To take some work off the critical path, try to initialize
* as many of the no-longer-needed WAL buffers for future use as we can.
*/
- AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
+ if (NvwalAvail && max_wal_senders == 0)
+ {
+ XLogRecPtr upto;
+
+ /*
+ * If NVWAL is used and there is no walsender, nobody is to load
+ * segments on the buffer. So let's recycle segments up to {where we
+ * have requested to write and flush} + NvwalSize.
+ *
+ * Note that if NVWAL is used and a walsender seems running, we have to
+ * do nothing; keep the written pages on the buffer for walsenders to be
+ * loaded from the buffer, not from the segment files. Note that the
+ * buffer pages are eventually to be recycled by checkpoint.
+ */
+ Assert(WriteRqst.Write == WriteRqst.Flush);
+ Assert(WriteRqst.Write % wal_segment_size == 0);
+
+ upto = WriteRqst.Write + NvwalSize;
+ AdvanceXLInsertBuffer(upto - XLOG_BLCKSZ, false);
+ }
+ else
+ AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
/*
* If we determined that we need to write data, but somebody else
@@ -3806,6 +4085,43 @@ XLogFileClose(void)
openLogFile = -1;
}
+/*
+ * Preallocate non-volatile XLOG buffers.
+ *
+ * This zeroes buffers and prepare page headers up to
+ * ControlFile->discardedUpTo + S, where S is the total size of
+ * the non-volatile XLOG buffers.
+ *
+ * It is caller's responsibility to update ControlFile->discardedUpTo
+ * and to set XLogCtl->InitializedUpTo properly.
+ */
+static void
+PreallocNonVolatileXlogBuffer(void)
+{
+ XLogRecPtr newupto,
+ InitializedUpTo;
+
+ Assert(NvwalAvail);
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ newupto = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ InitializedUpTo = XLogCtl->InitializedUpTo;
+
+ newupto += NvwalSize;
+ Assert(newupto % wal_segment_size == 0);
+
+ if (newupto <= InitializedUpTo)
+ return;
+
+ /*
+ * Subtracting XLOG_BLCKSZ is important, because AdvanceXLInsertBuffer
+ * handles the first argument as the beginning of pages, not the end.
+ */
+ AdvanceXLInsertBuffer(newupto - XLOG_BLCKSZ, false);
+}
+
/*
* Preallocate log files beyond the specified log endpoint.
*
@@ -4101,8 +4417,11 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
* Before deleting the file, see if it can be recycled as a future log
* segment. Only recycle normal files, pg_standby for example can create
* symbolic links pointing to a separate archive directory.
+ *
+ * If NVWAL buffer is used, a log segment file is never to be recycled
+ * (that is, always go into else block).
*/
- if (wal_recycle &&
+ if (!NvwalAvail && wal_recycle &&
endlogSegNo <= recycleSegNo &&
lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
InstallXLogFileSegment(&endlogSegNo, path,
@@ -5336,36 +5655,53 @@ BootStrapXLOG(void)
record->xl_crc = crc;
/* Create first XLOG segment file */
- use_existent = false;
- openLogFile = XLogFileInit(1, &use_existent, false);
-
- /* Write the first page with the initial record */
- errno = 0;
- pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
- if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+ if (NvwalAvail)
{
- /* if write didn't set errno, assume problem is no disk space */
- if (errno == 0)
- errno = ENOSPC;
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not write bootstrap write-ahead log file: %m")));
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+ nv_memcpy_nodrain(XLogCtl->pages + wal_segment_size, page, XLOG_BLCKSZ);
+ pgstat_report_wait_end();
+
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+ nv_drain();
+ pgstat_report_wait_end();
+
+ /*
+ * Other WAL stuffs will be initialized in startup process.
+ */
}
- pgstat_report_wait_end();
+ else
+ {
+ use_existent = false;
+ openLogFile = XLogFileInit(1, &use_existent, false);
+
+ /* Write the first page with the initial record */
+ errno = 0;
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+ if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write bootstrap write-ahead log file: %m")));
+ }
+ pgstat_report_wait_end();
- pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
- if (pg_fsync(openLogFile) != 0)
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not fsync bootstrap write-ahead log file: %m")));
- pgstat_report_wait_end();
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+ if (pg_fsync(openLogFile) != 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not fsync bootstrap write-ahead log file: %m")));
+ pgstat_report_wait_end();
- if (close(openLogFile))
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not close bootstrap write-ahead log file: %m")));
+ if (close(openLogFile))
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not close bootstrap write-ahead log file: %m")));
- openLogFile = -1;
+ openLogFile = -1;
+ }
/* Now create pg_control */
@@ -5378,6 +5714,7 @@ BootStrapXLOG(void)
ControlFile->checkPoint = checkPoint.redo;
ControlFile->checkPointCopy = checkPoint;
ControlFile->unloggedLSN = FirstNormalUnloggedLSN;
+ ControlFile->discardedUpTo = (NvwalAvail) ? wal_segment_size : InvalidXLogRecPtr;
/* Set important parameter values for use when replaying WAL */
ControlFile->MaxConnections = MaxConnections;
@@ -5638,35 +5975,41 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
* happens in the middle of a segment, copy data from the last WAL segment
* of the old timeline up to the switch point, to the starting WAL segment
* on the new timeline.
+ *
+ * If non-volatile WAL buffer is used, no new segment file is created. Data
+ * up to the switch point will be copied into NVWAL buffer by StartupXLOG().
*/
- if (endLogSegNo == startLogSegNo)
+ if (!NvwalAvail)
{
- /*
- * Make a copy of the file on the new timeline.
- *
- * Writing WAL isn't allowed yet, so there are no locking
- * considerations. But we should be just as tense as XLogFileInit to
- * avoid emplacing a bogus file.
- */
- XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
- XLogSegmentOffset(endOfLog, wal_segment_size));
- }
- else
- {
- /*
- * The switch happened at a segment boundary, so just create the next
- * segment on the new timeline.
- */
- bool use_existent = true;
- int fd;
+ if (endLogSegNo == startLogSegNo)
+ {
+ /*
+ * Make a copy of the file on the new timeline.
+ *
+ * Writing WAL isn't allowed yet, so there are no locking
+ * considerations. But we should be just as tense as XLogFileInit to
+ * avoid emplacing a bogus file.
+ */
+ XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
+ XLogSegmentOffset(endOfLog, wal_segment_size));
+ }
+ else
+ {
+ /*
+ * The switch happened at a segment boundary, so just create the next
+ * segment on the new timeline.
+ */
+ bool use_existent = true;
+ int fd;
- fd = XLogFileInit(startLogSegNo, &use_existent, true);
+ fd = XLogFileInit(startLogSegNo, &use_existent, true);
- if (close(fd))
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not close file \"%s\": %m",
- XLogFileNameP(ThisTimeLineID, startLogSegNo))));
+ if (close(fd))
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m",
+ XLogFileNameP(ThisTimeLineID, startLogSegNo))));
+ }
}
/*
@@ -6888,6 +7231,11 @@ StartupXLOG(void)
InRecovery = true;
}
+ /* Dump discardedUpTo just before REDO */
+ elog(LOG, "ControlFile->discardedUpTo %X/%X",
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo);
+
/* REDO */
if (InRecovery)
{
@@ -7635,10 +7983,88 @@ StartupXLOG(void)
Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+ if (NvwalAvail)
+ {
+ XLogRecPtr discardedUpTo;
+
+ discardedUpTo = ControlFile->discardedUpTo;
+ Assert(discardedUpTo == InvalidXLogRecPtr ||
+ discardedUpTo % wal_segment_size == 0);
+
+ if (discardedUpTo == InvalidXLogRecPtr)
+ {
+ elog(DEBUG1, "brand-new NVWAL");
+
+ /* The following "Tricky point" is to initialize the buffer */
+ }
+ else if (EndOfLog <= discardedUpTo)
+ {
+ elog(DEBUG1, "no record on NVWAL has been UNDONE");
+
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = InvalidXLogRecPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+
+ nv_memset_persist(XLogCtl->pages, 0, NvwalSize);
+
+ /* The following "Tricky point" is to initialize the buffer */
+ }
+ else
+ {
+ int last_idx;
+ int idx;
+ XLogRecPtr ptr;
+
+ elog(DEBUG1, "some records on NVWAL have been UNDONE; keep them");
+
+ /*
+ * Initialize xlblock array because we decided to keep UNDONE
+ * records on NVWAL buffer; or each page on the buffer that meets
+ * xlblocks == 0 (initialized as so by XLOGShmemInit) is to be
+ * accidentally cleared by the following AdvanceXLInsertBuffer!
+ *
+ * Two cases can be considered:
+ *
+ * 1) EndOfLog is on a page boundary (divisible by XLOG_BLCKSZ):
+ * Initialize up to (and including) the page containing the last
+ * record. That page should end with EndOfLog. The one more
+ * next page "N" beginning with EndOfLog is to be untouched
+ * because, in such a very corner case that all the NVWAL
+ * buffer pages are already filled, page N is on the same
+ * location as the first page "F" beginning with discardedUpTo.
+ * Of cource we should not overwrite the page F.
+ *
+ * In this case, we first get XLogRecPtrToBufIdx(EndOfLog) as
+ * last_idx, indicating the page N. Then, we go forward from
+ * the page F up to (but excluding) page N that have the same
+ * index as the page F.
+ *
+ * 2) EndOfLog is not on a page boundary: Initialize all the pages
+ * but the page "L" having the last record. The page L is to be
+ * initialized by the following "Tricky point", including its
+ * content.
+ *
+ * In either case, XLogCtl->InitializedUpTo is to be initialized in
+ * the following "Tricky" if-else block.
+ */
+
+ last_idx = XLogRecPtrToBufIdx(EndOfLog);
+
+ ptr = discardedUpTo;
+ for (idx = XLogRecPtrToBufIdx(ptr); idx != last_idx;
+ idx = NextBufIdx(idx))
+ {
+ ptr += XLOG_BLCKSZ;
+ XLogCtl->xlblocks[idx] = ptr;
+ }
+ }
+ }
+
/*
- * Tricky point here: readBuf contains the *last* block that the LastRec
- * record spans, not the one it starts in. The last block is indeed the
- * one we want to use.
+ * Tricky point here: readBuf contains the *last* block that the
+ * LastRec record spans, not the one it starts in. The last block is
+ * indeed the one we want to use.
*/
if (EndOfLog % XLOG_BLCKSZ != 0)
{
@@ -7658,6 +8084,9 @@ StartupXLOG(void)
memcpy(page, xlogreader->readBuf, len);
memset(page + len, 0, XLOG_BLCKSZ - len);
+ if (NvwalAvail)
+ nv_persist(page, XLOG_BLCKSZ);
+
XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
}
@@ -7671,12 +8100,54 @@ StartupXLOG(void)
XLogCtl->InitializedUpTo = EndOfLog;
}
- LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+ if (NvwalAvail)
+ {
+ XLogRecPtr SegBeginPtr;
- XLogCtl->LogwrtResult = LogwrtResult;
+ /*
+ * If NVWAL buffer is used, writing records out to segment files should
+ * be done in segment by segment. So Logwrt{Rqst,Result} (and also
+ * discardedUpTo) should be multiple of wal_segment_size. Let's get
+ * them back off to the last segment boundary.
+ */
- XLogCtl->LogwrtRqst.Write = EndOfLog;
- XLogCtl->LogwrtRqst.Flush = EndOfLog;
+ SegBeginPtr = EndOfLog - (EndOfLog % wal_segment_size);
+ LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+ XLogCtl->LogwrtResult = LogwrtResult;
+ XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+ XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+
+ /*
+ * persistentUpTo does not need to be multiple of wal_segment_size,
+ * and should be drained-up-to LSN. walsender will use it to load
+ * records from NVWAL buffer.
+ */
+ XLogCtl->persistentUpTo = EndOfLog;
+
+ /* Update discardedUpTo in pg_control if still invalid */
+ if (ControlFile->discardedUpTo == InvalidXLogRecPtr)
+ {
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = SegBeginPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+ }
+
+ elog(DEBUG1, "EndOfLog: %X/%X",
+ (uint32) (EndOfLog >> 32), (uint32) EndOfLog);
+
+ elog(DEBUG1, "SegBeginPtr: %X/%X",
+ (uint32) (SegBeginPtr >> 32), (uint32) SegBeginPtr);
+ }
+ else
+ {
+ LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+
+ XLogCtl->LogwrtResult = LogwrtResult;
+
+ XLogCtl->LogwrtRqst.Write = EndOfLog;
+ XLogCtl->LogwrtRqst.Flush = EndOfLog;
+ }
/*
* Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7807,6 +8278,7 @@ StartupXLOG(void)
char origpath[MAXPGPATH];
char partialfname[MAXFNAMELEN];
char partialpath[MAXPGPATH];
+ XLogRecPtr discardedUpTo;
XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
@@ -7818,6 +8290,53 @@ StartupXLOG(void)
*/
XLogArchiveCleanup(partialfname);
+ /*
+ * If NVWAL is also used for archival recovery, write old
+ * records out to segment files to archive them. Note that we
+ * need locks related to WAL because LocalXLogInsertAllowed
+ * already got to -1.
+ */
+ discardedUpTo = ControlFile->discardedUpTo;
+ if (NvwalAvail && discardedUpTo != InvalidXLogRecPtr &&
+ discardedUpTo < EndOfLog)
+ {
+ XLogwrtRqst WriteRqst;
+ TimeLineID thisTLI = ThisTimeLineID;
+ XLogRecPtr SegBeginPtr =
+ EndOfLog - (EndOfLog % wal_segment_size);
+
+ /*
+ * XXX Assume that all the records have the same TLI.
+ */
+ ThisTimeLineID = EndOfLogTLI;
+
+ WriteRqst.Write = EndOfLog;
+ WriteRqst.Flush = 0;
+
+ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ XLogWrite(WriteRqst, false);
+
+ /*
+ * Force back-off to the last segment boundary.
+ */
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = SegBeginPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+
+ LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ XLogCtl->LogwrtResult = LogwrtResult;
+ XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+ XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ LWLockRelease(WALWriteLock);
+
+ ThisTimeLineID = thisTLI;
+ }
+
durable_rename(origpath, partialpath, ERROR);
XLogArchiveNotify(partialfname);
}
@@ -7827,7 +8346,10 @@ StartupXLOG(void)
/*
* Preallocate additional log files, if wanted.
*/
- PreallocXlogFiles(EndOfLog);
+ if (NvwalAvail)
+ PreallocNonVolatileXlogBuffer();
+ else
+ PreallocXlogFiles(EndOfLog);
/*
* Okay, we're officially UP.
@@ -8371,10 +8893,24 @@ GetInsertRecPtr(void)
/*
* GetFlushRecPtr -- Returns the current flush position, ie, the last WAL
* position known to be fsync'd to disk.
+ *
+ * If NVWAL is used, this returns the last persistent WAL position instead.
*/
XLogRecPtr
GetFlushRecPtr(void)
{
+ if (NvwalAvail)
+ {
+ XLogRecPtr ret;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ LogwrtResult = XLogCtl->LogwrtResult;
+ ret = XLogCtl->persistentUpTo;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ return ret;
+ }
+
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);
@@ -8674,6 +9210,9 @@ CreateCheckPoint(int flags)
VirtualTransactionId *vxids;
int nvxids;
+ /* for non-volatile WAL buffer */
+ XLogRecPtr newDiscardedUpTo = 0;
+
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
* issued at a different time.
@@ -8985,6 +9524,22 @@ CreateCheckPoint(int flags)
*/
PriorRedoPtr = ControlFile->checkPointCopy.redo;
+ /*
+ * If non-volatile WAL buffer is used, discardedUpTo should be updated and
+ * persist on the control file. So the new value should be caluculated
+ * here.
+ *
+ * TODO Do not copy and paste codes...
+ */
+ if (NvwalAvail)
+ {
+ XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+ KeepLogSeg(recptr, &_logSegNo);
+ _logSegNo--;
+
+ newDiscardedUpTo = _logSegNo * wal_segment_size;
+ }
+
/*
* Update the control file.
*/
@@ -8993,6 +9548,16 @@ CreateCheckPoint(int flags)
ControlFile->state = DB_SHUTDOWNED;
ControlFile->checkPoint = ProcLastRecPtr;
ControlFile->checkPointCopy = checkPoint;
+ if (NvwalAvail)
+ {
+ /*
+ * A new value should not fall behind the old one.
+ */
+ if (ControlFile->discardedUpTo < newDiscardedUpTo)
+ ControlFile->discardedUpTo = newDiscardedUpTo;
+ else
+ newDiscardedUpTo = ControlFile->discardedUpTo;
+ }
ControlFile->time = (pg_time_t) time(NULL);
/* crash recovery should always recover to the end of WAL */
ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
@@ -9010,6 +9575,44 @@ CreateCheckPoint(int flags)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * If we use non-volatile XLOG buffer, update XLogCtl->Logwrt{Rqst,Result}
+ * so that the XLOG records older than newDiscardedUpTo are treated as
+ * "already written and flushed."
+ */
+ if (NvwalAvail)
+ {
+ Assert(newDiscardedUpTo > 0);
+
+ /* Update process-local variables */
+ LogwrtResult.Write = LogwrtResult.Flush = newDiscardedUpTo;
+
+ /*
+ * Update shared-memory variables. We need both light-weight lock and
+ * spin lock to update them.
+ */
+ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&XLogCtl->info_lck);
+
+ /*
+ * Note that there can be a corner case that process-local
+ * LogwrtResult falls behind shared XLogCtl->LogwrtResult if whole the
+ * non-volatile XLOG buffer is filled and some pages are written out
+ * to segment files between UpdateControlFile and LWLockAcquire above.
+ *
+ * TODO For now, we ignore that case because it can hardly occur.
+ */
+ XLogCtl->LogwrtResult = LogwrtResult;
+
+ if (XLogCtl->LogwrtRqst.Write < newDiscardedUpTo)
+ XLogCtl->LogwrtRqst.Write = newDiscardedUpTo;
+ if (XLogCtl->LogwrtRqst.Flush < newDiscardedUpTo)
+ XLogCtl->LogwrtRqst.Flush = newDiscardedUpTo;
+
+ SpinLockRelease(&XLogCtl->info_lck);
+ LWLockRelease(WALWriteLock);
+ }
+
/* Update shared-memory copy of checkpoint XID/epoch */
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->ckptFullXid = checkPoint.nextFullXid;
@@ -9033,21 +9636,31 @@ CreateCheckPoint(int flags)
if (PriorRedoPtr != InvalidXLogRecPtr)
UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
- /*
- * Delete old log files, those no longer needed for last checkpoint to
- * prevent the disk holding the xlog from growing full.
- */
- XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
- KeepLogSeg(recptr, &_logSegNo);
- _logSegNo--;
- RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ if (NvwalAvail)
+ RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ else
+ {
+ /*
+ * Delete old log files, those no longer needed for last checkpoint to
+ * prevent the disk holding the xlog from growing full.
+ */
+ XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+ KeepLogSeg(recptr, &_logSegNo);
+ _logSegNo--;
+ RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ }
/*
* Make more log segments if needed. (Do this after recycling old log
* segments, since that may supply some of the needed files.)
*/
if (!shutdown)
- PreallocXlogFiles(recptr);
+ {
+ if (NvwalAvail)
+ PreallocNonVolatileXlogBuffer();
+ else
+ PreallocXlogFiles(recptr);
+ }
/*
* Truncate pg_subtrans if possible. We can throw away all data before
@@ -11651,6 +12264,76 @@ CancelBackup(void)
}
}
+/*
+ * Is NVWAL used?
+ */
+bool
+IsNvwalAvail(void)
+{
+ return NvwalAvail;
+}
+
+/*
+ * Get a pointer to the *possibly* right location in the NVWAL buffer
+ * containing the target XLogRecPtr; NULL if the target have already been
+ * discarded.
+ *
+ * Note that the target would be discarded by checkpoint after this
+ * function returns. The caller should check if the copied record has
+ * expected LSN.
+ */
+char *
+GetNvwalBuffer(XLogRecPtr target, Size *max_read)
+{
+ Size off;
+ XLogRecPtr discardedUpTo;
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ discardedUpTo = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ if (target < discardedUpTo)
+ return NULL;
+
+ off = target % NvwalSize;
+ *max_read = NvwalSize - off;
+ return XLogCtl->pages + off;
+}
+
+/*
+ * Returns the size we can load from NVWAL and sets nvwalptr to load-from LSN.
+ */
+Size
+GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
+{
+ XLogRecPtr readUpTo;
+ XLogRecPtr discardedUpTo;
+
+ Assert(IsNvwalAvail());
+ Assert(nvwalptr != NULL);
+
+ readUpTo = target + count;
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ discardedUpTo = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ /* Check if all the records are on WAL segment files */
+ if (readUpTo <= discardedUpTo)
+ return 0;
+
+ /* Check if all the records are on NVWAL */
+ if (discardedUpTo <= target)
+ {
+ *nvwalptr = target;
+ return count;
+ }
+
+ /* Some on WAL segment files, some on NVWAL */
+ *nvwalptr = discardedUpTo;
+ return (Size) (readUpTo - discardedUpTo);
+}
+
/*
* Read the XLOG page containing RecPtr into readBuf (if not read already).
* Returns number of bytes read, if the page is read successfully, or -1
@@ -11718,7 +12401,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
retry:
/* See if we need to retrieve more data */
- if (readFile < 0 ||
+ if ((readSource != XLOG_FROM_NVWAL && readFile < 0) ||
(readSource == XLOG_FROM_STREAM &&
receivedUpto < targetPagePtr + reqLen))
{
@@ -11730,10 +12413,68 @@ retry:
if (readFile >= 0)
close(readFile);
readFile = -1;
- readLen = 0;
- readSource = 0;
- return -1;
+ /*
+ * Try non-volatile WAL buffer as last resort.
+ *
+ * XXX It is not supported yet on stanby mode.
+ */
+ if (NvwalAvail && !StandbyMode && readSource != XLOG_FROM_STREAM)
+ {
+ XLogRecPtr discardedUpTo;
+
+ elog(DEBUG1, "see if NVWAL has records to be UNDONE");
+
+ discardedUpTo = ControlFile->discardedUpTo;
+ if (discardedUpTo != InvalidXLogRecPtr &&
+ discardedUpTo <= targetPagePtr)
+ {
+ elog(DEBUG1, "recovering NVWAL");
+
+ /* Loading records from non-volatile WAL buffer */
+ currentSource = XLOG_FROM_NVWAL;
+ lastSourceFailed = false;
+
+ /* Report recovery progress in PS display */
+ set_ps_display("recovering NVWAL", false);
+
+ /* Track source of data */
+ readSource = XLOG_FROM_NVWAL;
+ XLogReceiptSource = XLOG_FROM_NVWAL;
+
+ /* Track receipt time */
+ XLogReceiptTime = GetCurrentTimestamp();
+
+ /*
+ * Construct expectedTLEs. This is necessary to recover
+ * only from NVWAL because its filename does not have any
+ * TLI information.
+ */
+ if (!expectedTLEs)
+ {
+ TimeLineHistoryEntry *entry;
+
+ entry = (TimeLineHistoryEntry *) palloc(sizeof(TimeLineHistoryEntry));
+ entry->tli = recoveryTargetTLI;
+ entry->begin = entry->end = InvalidXLogRecPtr;
+
+ expectedTLEs = list_make1(entry);
+
+ elog(DEBUG1, "expectedTLEs: [%u]", (uint32) recoveryTargetTLI);
+ }
+ }
+ }
+ else
+ elog(DEBUG1, "do not recover NVWAL");
+
+ /* See if the try above succeeded or not */
+ if (readSource != XLOG_FROM_NVWAL)
+ {
+ readLen = 0;
+ readSource = 0;
+
+ return -1;
+ }
}
}
@@ -11741,7 +12482,7 @@ retry:
* At this point, we have the right segment open and if we're streaming we
* know the requested record is in it.
*/
- Assert(readFile != -1);
+ Assert(readFile != -1 || readSource == XLOG_FROM_NVWAL);
/*
* If the current segment is being streamed from master, calculate how
@@ -11760,41 +12501,60 @@ retry:
else
readLen = XLOG_BLCKSZ;
- /* Read the requested page */
readOff = targetPageOff;
- pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
- r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
- if (r != XLOG_BLCKSZ)
+ if (currentSource == XLOG_FROM_NVWAL)
{
- char fname[MAXFNAMELEN];
- int save_errno = errno;
+ Size offset = (Size) (targetPagePtr % NvwalSize);
+ char *readpos = XLogCtl->pages + offset;
+ Assert(readLen == XLOG_BLCKSZ);
+ Assert(offset % XLOG_BLCKSZ == 0);
+
+ /* Load the requested page from non-volatile WAL buffer */
+ pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+ memcpy(readBuf, readpos, readLen);
pgstat_report_wait_end();
- XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
- if (r < 0)
+
+ /* There are not any other clues of TLI... */
+ *readTLI = ((XLogPageHeader) readBuf)->xlp_tli;
+ }
+ else
+ {
+ /* Read the requested page from file */
+ pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+ r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+ if (r != XLOG_BLCKSZ)
{
- errno = save_errno;
- ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
- (errcode_for_file_access(),
- errmsg("could not read from log segment %s, offset %u: %m",
- fname, readOff)));
+ char fname[MAXFNAMELEN];
+ int save_errno = errno;
+
+ pgstat_report_wait_end();
+ XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+ if (r < 0)
+ {
+ errno = save_errno;
+ ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+ (errcode_for_file_access(),
+ errmsg("could not read from log segment %s, offset %u: %m",
+ fname, readOff)));
+ }
+ else
+ ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+ fname, readOff, r, (Size) XLOG_BLCKSZ)));
+ goto next_record_is_invalid;
}
- else
- ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("could not read from log segment %s, offset %u: read %d of %zu",
- fname, readOff, r, (Size) XLOG_BLCKSZ)));
- goto next_record_is_invalid;
+ pgstat_report_wait_end();
+
+ *readTLI = curFileTLI;
}
- pgstat_report_wait_end();
Assert(targetSegNo == readSegNo);
Assert(targetPageOff == readOff);
Assert(reqLen <= readLen);
- *readTLI = curFileTLI;
-
/*
* Check the page header immediately, so that we can retry immediately if
* it's not valid. This may seem unnecessary, because XLogReadRecord()
@@ -11828,6 +12588,17 @@ retry:
goto next_record_is_invalid;
}
+ /*
+ * Updating curFileTLI on each page verified if non-volatile WAL buffer
+ * is used because there is no TimeLineID information in NVWAL's filename.
+ */
+ if (readSource == XLOG_FROM_NVWAL &&
+ curFileTLI != xlogreader->latestPageTLI)
+ {
+ curFileTLI = xlogreader->latestPageTLI;
+ elog(DEBUG1, "curFileTLI: %u", curFileTLI);
+ }
+
return readLen;
next_record_is_invalid:
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 92fa86fc9d..f2992b4a85 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2368,7 +2368,9 @@ XLogRead(char *buf, XLogRecPtr startptr, Size count)
{
char *p;
XLogRecPtr recptr;
+ XLogRecPtr recptr_nvwal = 0;
Size nbytes;
+ Size nbytes_nvwal = 0;
XLogSegNo segno;
retry:
@@ -2376,6 +2378,13 @@ retry:
recptr = startptr;
nbytes = count;
+ /* Try to load records directly from NVWAL if used */
+ if (IsNvwalAvail())
+ {
+ nbytes_nvwal = GetLoadableSizeFromNvwal(startptr, count, &recptr_nvwal);
+ nbytes = count - nbytes_nvwal;
+ }
+
while (nbytes > 0)
{
uint32 startoff;
@@ -2500,6 +2509,47 @@ retry:
p += readbytes;
}
+ /* Load records directly from NVWAL */
+ while (nbytes_nvwal > 0)
+ {
+ char *src;
+ Size max_read = 0;
+ Size readbytes;
+
+ Assert(IsNvwalAvail());
+
+ /*
+ * Get the target address on NVWAL and the size we can load from it at
+ * once because WAL buffer can rotate and we might have to load what we
+ * want devided into two or more.
+ *
+ * Note that, in a rare case, some records on NVWAL might have been
+ * already discarded. We retry in such a case.
+ */
+ src = GetNvwalBuffer(recptr_nvwal, &max_read);
+ if (src == NULL)
+ {
+ elog(WARNING, "some records on NVWAL had been discarded; retry");
+ goto retry;
+ }
+
+ if (nbytes_nvwal < max_read)
+ readbytes = nbytes_nvwal;
+ else
+ readbytes = max_read;
+
+ memcpy(p, src, readbytes);
+
+ /*
+ * Update state for load. Note that we do not need to update sendOff
+ * because it indicates an offset in a segment file and we do not use
+ * any segment file inside this loop.
+ */
+ recptr_nvwal += readbytes;
+ nbytes_nvwal -= readbytes;
+ p += readbytes;
+ }
+
/*
* After reading into the buffer, check that what we read was valid. We do
* this after reading, because even though the segment was present when we
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index d955b97c0b..a47caefa99 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -280,6 +280,9 @@ main(int argc, char *argv[])
ControlFile->checkPointCopy.oldestCommitTsXid);
printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
ControlFile->checkPointCopy.newestCommitTsXid);
+ printf(_("discarded Up To: %X/%X\n"),
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo);
printf(_("Time of latest checkpoint: %s\n"),
ckpttime_str);
printf(_("Fake LSN counter for unlogged rels: %X/%X\n"),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index bc09fa104c..14efb904be 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -325,6 +325,12 @@ extern void XLogRequestWalReceiverReply(void);
extern void assign_max_wal_size(int newval, void *extra);
extern void assign_checkpoint_completion_target(double newval, void *extra);
+extern bool IsNvwalAvail(void);
+extern char *GetNvwalBuffer(XLogRecPtr target, Size *max_read);
+extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
+ Size count,
+ XLogRecPtr *nvwalptr);
+
/*
* Routines to start, stop, and get status of a base backup.
*/
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ff98d9e91a..04b1b94645 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
/* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION 1201
+#define PG_CONTROL_VERSION 9901
/* Nonce key length, see below */
#define MOCK_AUTH_NONCE_LEN 32
@@ -132,6 +132,21 @@ typedef struct ControlFileData
XLogRecPtr unloggedLSN; /* current fake LSN value, for unlogged rels */
+ /*
+ * Used for non-volatile WAL buffer (NVWAL).
+ *
+ * discardedUpTo is updated to the oldest LSN in the NVWAL when either a
+ * checkpoint or a restartpoint is completed successfully, or whole the
+ * NVWAL is filled with WAL records and a new record is being inserted.
+ * This field tells that the NVWAL contains WAL records in the range of
+ * [discardedUpTo, discardedUpTo+S), where S is the size of the NVWAL.
+ * Note that the WAL records whose LSN are less than discardedUpTo would
+ * remain in WAL segment files and be needed for recovery.
+ *
+ * It is set to zero when NVWAL is not used.
+ */
+ XLogRecPtr discardedUpTo;
+
/*
* These two values determine the minimum point we must recover up to
* before starting up:
--
2.20.1
0003-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=0003-README-for-non-volatile-WAL-buffer.patchDownload
From e98b3c3fd4c48b21b4fe26d568899722cd202dc9 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Fri, 24 Jan 2020 13:16:28 +0900
Subject: [PATCH 3/3] README for non-volatile WAL buffer
---
README.nvwal | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 184 insertions(+)
create mode 100644 README.nvwal
diff --git a/README.nvwal b/README.nvwal
new file mode 100644
index 0000000000..b6b9d576e7
--- /dev/null
+++ b/README.nvwal
@@ -0,0 +1,184 @@
+Non-volatile WAL buffer
+=======================
+Here is a PostgreSQL branch with a proof-of-concept "non-volatile WAL buffer"
+(NVWAL) feature. Putting the WAL buffer pages on persistent memory (PMEM) [1],
+inserting WAL records into it directly, and eliminating I/O for WAL segment
+files, PostgreSQL gets lower latency and higher throughput.
+
+
+Prerequisites and recommends
+----------------------------
+* An x64 system
+ * (Recommended) Supporting CLFLUSHOPT or CLWB instruction
+ * See if lscpu shows "clflushopt" or "clwb" flag
+* An OS supporting PMEM
+ * Linux: 4.15 or later (tested on 5.2)
+ * Windows: (Sorry but we have not tested on Windows yet.)
+* A filesystem supporting DAX (tested on ext4)
+* libpmem in PMDK [2] 1.4 or later (tested on 1.7)
+* ndctl [3] (tested on 61.2)
+* ipmctl [4] if you use Intel DCPMM
+* sudo privilege
+* All other prerequisites of original PostgreSQL
+* (Recommended) PMEM module(s) (NVDIMM-N or Intel DCPMM)
+ * You can emulate PMEM using DRAM [5] even if you have no PMEM module.
+* (Recommended) numactl
+
+
+Build and install PostgreSQL with NVWAL feature
+-----------------------------------------------
+We have a new configure option --with-nvwal.
+
+I believe it is good to install under your home directory with --prefix option.
+If you do so, please DO NOT forget "export PATH".
+
+ $ ./configure --with-nvwal --prefix="$HOME/postgres"
+ $ make
+ $ make install
+ $ export PATH="$HOME/postgres/bin:$PATH"
+
+NOTE: ./configure --with-nvwal will fail if libpmem is not found.
+
+
+Prepare DAX filesystem
+----------------------
+Here we use NVDIMM-N or emulated PMEM, make ext4 filesystem on namespace0.0
+(/dev/pmem0), and mount it onto /mnt/pmem0. Please DO NOT forget "-o dax" option
+on mount. For Intel DCPMM and ipmctl, please see [4].
+
+ $ ndctl list
+ [
+ {
+ "dev":"namespace1.0",
+ "mode":"raw",
+ "size":103079215104,
+ "sector_size":512,
+ "blockdev":"pmem1",
+ "numa_node":1
+ },
+ {
+ "dev":"namespace0.0",
+ "mode":"raw",
+ "size":103079215104,
+ "sector_size":512,
+ "blockdev":"pmem0",
+ "numa_node":0
+ }
+ ]
+
+ $ sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0
+ {
+ "dev":"namespace0.0",
+ "mode":"fsdax",
+ "map":"dev",
+ "size":"94.50 GiB (101.47 GB)",
+ "uuid":"e7da9d65-140b-4e1e-90ec-6548023a1b6e",
+ "sector_size":512,
+ "blockdev":"pmem0",
+ "numa_node":0
+ }
+
+ $ ls -l /dev/pmem0
+ brw-rw---- 1 root disk 259, 3 Jan 6 17:06 /dev/pmem0
+
+ $ sudo mkfs.ext4 -q -F /dev/pmem0
+ $ sudo mkdir -p /mnt/pmem0
+ $ sudo mount -o dax /dev/pmem0 /mnt/pmem0
+ $ mount -l | grep ^/dev/pmem0
+ /dev/pmem0 on /mnt/pmem0 type ext4 (rw,relatime,dax)
+
+
+Enable transparent huge page
+----------------------------
+Of course transparent huge page would not be suitable for database workload,
+but it improves performance of PMEM by reducing overhead of page walk.
+
+ $ ls -l /sys/kernel/mm/transparent_hugepage/enabled
+ -rw-r--r-- 1 root root 4096 Dec 3 10:38 /sys/kernel/mm/transparent_hugepage/enabled
+
+ $ echo always | sudo dd of=/sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null
+ $ cat /sys/kernel/mm/transparent_hugepage/enabled
+ [always] madvise never
+
+
+initdb
+------
+We have two new options:
+
+ -P, --nvwal-path=FILE path to file for non-volatile WAL buffer (NVWAL)
+ -Q, --nvwal-size=SIZE size of NVWAL, in megabytes
+
+If you want to create a new 80GB (81920MB) NVWAL file on /mnt/pmem0/pgsql/nvwal,
+please run initdb as follows:
+
+ $ sudo mkdir -p /mnt/pmem0/pgsql
+ $ sudo chown "$USER:$USER" /mnt/pmem0/pgsql
+ $ export PGDATA="$HOME/pgdata"
+ $ initdb -P /mnt/pmem0/pgsql/nvwal -Q 81920
+
+You will find there is no WAL segment file to be created in PGDATA/pg_wal
+directory. That is okay; your NVWAL file has the content of the first WAL
+segment file.
+
+NOTE:
+* initdb will fail if the given NVWAL size is not multiple of WAL segment
+ size. The segment size is given with initdb --wal-segsize, or is 16MB as
+ default.
+* postgres (executed by initdb) will fail in bootstrap if the directory in
+ which the NVWAL file is being created (/mnt/pmem0/pgsql for example
+ above) does not exist.
+* postgres (executed by initdb) will fail in bootstrap if an entry already
+ exists on the given path.
+* postgres (executed by initdb) will fail in bootstrap if the given path is
+ not on PMEM or you forget "-o dax" option on mount.
+* Resizing an NVWAL file is NOT supported yet. Please be careful to decide
+ how large your NVWAL file is to be.
+* "-Q 1024" (1024MB) will be assumed if -P is given but -Q is not.
+
+
+postgresql.conf
+---------------
+We have two new parameters nvwal_path and nvwal_size, corresponding to the two
+new options of initdb. If you run initdb as above, you will find postgresql.conf
+in your PGDATA directory like as follows:
+
+ max_wal_size = 80GB
+ min_wal_size = 80GB
+ nvwal_path = '/mnt/pmem0/pgsql/nvwal'
+ nvwal_size = 80GB
+
+NOTE:
+* postgres will fail in startup if no file exists on the given nvwal_path.
+* postgres will fail in startup if the given nvwal_size is not equal to the
+ actual NVWAL file size,
+* postgres will fail in startup if the given nvwal_path is not on PMEM or you
+ forget "-o dax" option on mount.
+* wal_buffers will be ignored if nvwal_path is given.
+* You SHOULD give both max_wal_size and min_wal_size the same value as
+ nvwal_size. postgres could possibly run even though the three values are
+ not same, however, we have not tested such a case yet.
+
+
+Startup
+-------
+Same as you know:
+
+ $ pg_ctl start
+
+or use numactl as follows to let postgres run on the specified NUMA node (typi-
+cally the one on which your NVWAL file is) if you need stable performance:
+
+ $ numactl --cpunodebind=0 --membind=0 -- pg_ctl start
+
+
+References
+----------
+[1] https://pmem.io/
+[2] https://pmem.io/pmdk/
+[3] https://docs.pmem.io/ndctl-user-guide/
+[4] https://docs.pmem.io/ipmctl-user-guide/
+[5] https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
+
+
+--
+Takashi Menjo <takashi.menjou.vg AT hco.ntt.co.jp>
--
2.20.1
Hello,
+1 on the idea.
By quickly looking at the patch, I notice that there are no tests.
Is it possible to emulate somthing without the actual hardware, at least
for testing purposes?
--
Fabien.
On 24/01/2020 10:06, Takashi Menjo wrote:
I propose "non-volatile WAL buffer," a proof-of-concept new feature. It
enables WAL records to be durable without output to WAL segment files by
residing on persistent memory (PMEM) instead of DRAM. It improves database
performance by reducing copies of WAL and shortening the time of write
transactions.I attach the first patchset that can be applied to PostgreSQL 12.0 (refs/
tags/REL_12_0). Please see README.nvwal (added by the patch 0003) to use
the new feature.
I have the same comments on this that I had on the previous patch, see:
/messages/by-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8@iki.fi
- Heikki
Hello Fabien,
Thank you for your +1 :)
Is it possible to emulate somthing without the actual hardware, at least
for testing purposes?
Yes, you can emulate PMEM using DRAM on Linux, via "memmap=nnG!ssG" kernel
parameter. Please see [1]How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM) https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server and [2]how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system for emulation details. If your emulation
does not work well, please check if the kernel configuration options (like
CONFIG_ FOOBAR) for PMEM and DAX (in [1]How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM) https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server and [3]Persistent Memory Wiki https://nvdimm.wiki.kernel.org/) are set up properly.
Best regards,
Takashi
[1]: How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM) https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
[2]: how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
[3]: Persistent Memory Wiki https://nvdimm.wiki.kernel.org/
https://nvdimm.wiki.kernel.org/
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
Hello Heikki,
I have the same comments on this that I had on the previous patch, see:
Thanks. I re-read your messages [1]/messages/by-id/83eafbfd-d9c5-6623-2423-7cab1be3888c@iki.fi[2]/messages/by-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8@iki.fi. What you meant, AFAIU, is how
about using memory-mapped WAL segment files as WAL buffers, and switching
CPU instructions or msync() depending on whether the segment files are on
PMEM or not, to sync inserted WAL records.
It sounds reasonable, but I'm sorry that I haven't tested such a program
yet. I'll try it to compare with my non-volatile WAL buffer. For now, I'm
a little worried about the overhead of mmap()/munmap() for each WAL segment
file.
You also told a SIGBUS problem of memory-mapped I/O. I think it's true for
reading from bad memory blocks, as you mentioned, and also true for writing
to such blocks [3]https://pmem.io/2018/11/26/bad-blocks.htm. Handling SIGBUS properly or working around it is future
work.
Best regards,
Takashi
[1]: /messages/by-id/83eafbfd-d9c5-6623-2423-7cab1be3888c@iki.fi
[2]: /messages/by-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8@iki.fi
[3]: https://pmem.io/2018/11/26/bad-blocks.htm
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:
It sounds reasonable, but I'm sorry that I haven't tested such a program
yet. I'll try it to compare with my non-volatile WAL buffer. For now, I'm
a little worried about the overhead of mmap()/munmap() for each WAL segment
file.
I guess the question here is how the cost of one mmap() and munmap()
pair per WAL segment (normally 16MB) compares to the cost of one
write() per block (normally 8kB). It could be that mmap() is a more
expensive call than read(), but by a small enough margin that the
vastly reduced number of system calls makes it a winner. But that's
just speculation, because I don't know how heavy mmap() actually is.
I have a different concern. I think that, right now, when we reuse a
WAL segment, we write entire blocks at a time, so the old contents of
the WAL segment are overwritten without ever being read. But that
behavior might not be maintained when using mmap(). It might be that
as soon as we write the first byte to a mapped page, the old contents
have to be faulted into memory. Indeed, it's unclear how it could be
otherwise, since the VM page must be made read-write at that point and
the system cannot know that we will overwrite the whole page. But
reading in the old contents of a recycled WAL file just to overwrite
them seems like it would be disastrously expensive.
A related, but more minor, concern is whether there are any
differences in in the write-back behavior when modifying a mapped
region vs. using write(). Either way, the same pages of the same file
will get dirtied, but the kernel might not have the same idea in
either case about when the changed pages should be written back down
to disk, and that could make a big difference to performance.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hello Robert,
I think our concerns are roughly classified into two:
(1) Performance
(2) Consistency
And your "different concern" is rather into (2), I think.
I'm also worried about it, but I have no good answer for now. I suppose mmap(flags|=MAP_SHARED) called by multiple backend processes for the same file works consistently for both PMEM and non-PMEM devices. However, I have not found any evidence such as specification documents yet.
I also made a tiny program calling memcpy() and msync() on the same mmap()-ed file but mutually distinct address range in parallel, and found that there was no corrupted data. However, that result does not ensure any consistency I'm worried about. I could give it up if there *were* corrupted data...
So I will go to (1) first. I will test the way Heikki told us to answer whether the cost of mmap() and munmap() per WAL segment, etc, is reasonable or not. If it really is, then I will go to (2).
Best regards,
Takashi
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:
I think our concerns are roughly classified into two:
(1) Performance
(2) ConsistencyAnd your "different concern" is rather into (2), I think.
Actually, I think it was mostly a performance concern (writes
triggering lots of reading) but there might be a consistency issue as
well.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2020-01-27 13:54:38 -0500, Robert Haas wrote:
On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:It sounds reasonable, but I'm sorry that I haven't tested such a program
yet. I'll try it to compare with my non-volatile WAL buffer. For now, I'm
a little worried about the overhead of mmap()/munmap() for each WAL segment
file.I guess the question here is how the cost of one mmap() and munmap()
pair per WAL segment (normally 16MB) compares to the cost of one
write() per block (normally 8kB). It could be that mmap() is a more
expensive call than read(), but by a small enough margin that the
vastly reduced number of system calls makes it a winner. But that's
just speculation, because I don't know how heavy mmap() actually is.
mmap()/munmap() on a regular basis does have pretty bad scalability
impacts. I don't think they'd fully hit us, because we're not in a
threaded world however.
My issue with the proposal to go towards mmap()/munmap() is that I think
doing so forcloses a lot of improvements. Even today, on fast storage,
using the open_datasync is faster (at least when somehow hitting the
O_DIRECT path, which isn't that easy these days) - and that's despite it
being really unoptimized. I think our WAL scalability is a serious
issue. There's a fair bit that we can improve by just fix without really
changing the way we do IO:
- Split WALWriteLock into one lock for writing and one for flushing the
WAL. Right now we prevent other sessions from writing out WAL - even
to other segments - when one session is doing a WAL flush. But there's
absolutely no need for that.
- Stop increasing the size of the flush request to the max when flushing
WAL (cf "try to write/flush later additions to XLOG as well" in
XLogFlush()) - that currently reduces throughput in OLTP workloads
quite noticably. It made some sense in the spinning disk times, but I
don't think it does for a halfway decent SSD. By writing the maximum
ready to write, we hold the lock for longer, increasing latency for
the committing transaction *and* preventing more WAL from being written.
- We should immediately ask the OS to flush writes for full XLOG pages
back to the OS. Right now the IO for that will never be started before
the commit comes around in an OLTP workload, which means that we just
waste the time between the XLogWrite() and the commit.
That'll gain us 2-3x, I think. But after that I think we're going to
have to actually change more fundamentally how we do IO for WAL
writes. Using async IO I can do like 18k individual durable 8kb writes
(using O_DSYNC) a second, at a queue depth of 32. On my laptop. If I
make it 4k writes, it's 22k.
That's not directly comparable with postgres WAL flushes, of course, as
it's all separate blocks, whereas WAL will often end up overwriting the
last block. But it doesn't at all account for group commits either,
which we *constantly* end up doing.
Postgres manages somewhere between ~450 (multiple users) ~800 (single
user) individually durable WAL writes / sec on the same hardware. Yes,
that's more than an order of magnitude less. Of course some of that is
just that postgres does more than just IO - but that's not effect on the
order of a magnitude.
So, why am I bringing this up in this thread? Only because I do not see
a way to actually utilize non-pmem hardware to a much higher degree than
we are doing now by using mmap(). Doing so requires using direct IO,
which is fundamentally incompatible with using mmap().
I have a different concern. I think that, right now, when we reuse a
WAL segment, we write entire blocks at a time, so the old contents of
the WAL segment are overwritten without ever being read. But that
behavior might not be maintained when using mmap(). It might be that
as soon as we write the first byte to a mapped page, the old contents
have to be faulted into memory. Indeed, it's unclear how it could be
otherwise, since the VM page must be made read-write at that point and
the system cannot know that we will overwrite the whole page. But
reading in the old contents of a recycled WAL file just to overwrite
them seems like it would be disastrously expensive.
Yea, that's a serious concern.
A related, but more minor, concern is whether there are any
differences in in the write-back behavior when modifying a mapped
region vs. using write(). Either way, the same pages of the same file
will get dirtied, but the kernel might not have the same idea in
either case about when the changed pages should be written back down
to disk, and that could make a big difference to performance.
I don't think there's a significant difference in case of linux - no
idea about others. And either way we probably should force the kernels
hand to start flushing much sooner.
Greetings,
Andres Freund
Dear hackers,
I made another WIP patchset to mmap WAL segments as WAL buffers. Note that this is not a non-volatile WAL buffer patchset but its competitor. I am measuring and analyzing the performance of this patchset to compare with my N.V.WAL buffer.
Please wait for a several more days for the result report...
Best regards,
Takashi
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
Show quoted text
-----Original Message-----
From: Robert Haas <robertmhaas@gmail.com>
Sent: Wednesday, January 29, 2020 6:00 AM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Heikki Linnakangas <hlinnaka@iki.fi>; pgsql-hackers@postgresql.org
Subject: Re: [PoC] Non-volatile WAL bufferOn Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:
I think our concerns are roughly classified into two:
(1) Performance
(2) ConsistencyAnd your "different concern" is rather into (2), I think.
Actually, I think it was mostly a performance concern (writes triggering lots of reading) but there might be a
consistency issue as well.--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachments:
0001-Preallocate-more-WAL-segments.patchapplication/octet-stream; name=0001-Preallocate-more-WAL-segments.patchDownload
From 72728138ef92b744b64464d21ba35d4b717a55bb Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 10 Feb 2020 17:53:11 +0900
Subject: [msync 1/5] Preallocate more WAL segments
Please run ./configure with LIBS=-lpmem to build this patchset.
---
src/backend/access/transam/xlog.c | 27 ++++++++++-----------------
1 file changed, 10 insertions(+), 17 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 77ad765989..e2cd34057f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -891,7 +891,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
bool fetching_ckpt, XLogRecPtr tliRecPtr);
static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
static void XLogFileClose(void);
-static void PreallocXlogFiles(XLogRecPtr endptr);
+static void PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
static void RemoveTempXlogFiles(void);
static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
static void RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
@@ -3801,27 +3801,20 @@ XLogFileClose(void)
/*
* Preallocate log files beyond the specified log endpoint.
- *
- * XXX this is currently extremely conservative, since it forces only one
- * future log segment to exist, and even that only if we are 75% done with
- * the current one. This is only appropriate for very low-WAL-volume systems.
- * High-volume systems will be OK once they've built up a sufficient set of
- * recycled log segments, but the startup transient is likely to include
- * a lot of segment creations by foreground processes, which is not so good.
*/
static void
-PreallocXlogFiles(XLogRecPtr endptr)
+PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
{
XLogSegNo _logSegNo;
+ XLogSegNo endSegNo;
+ XLogSegNo recycleSegNo;
int lf;
bool use_existent;
- uint64 offset;
- XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
- offset = XLogSegmentOffset(endptr - 1, wal_segment_size);
- if (offset >= (uint32) (0.75 * wal_segment_size))
+ XLByteToPrevSeg(endptr, endSegNo, wal_segment_size);
+ recycleSegNo = XLOGfileslop(RedoRecPtr);
+ for (_logSegNo = endSegNo + 1; _logSegNo <= recycleSegNo; _logSegNo++)
{
- _logSegNo++;
use_existent = true;
lf = XLogFileInit(_logSegNo, &use_existent, true);
close(lf);
@@ -7692,7 +7685,7 @@ StartupXLOG(void)
/*
* Preallocate additional log files, if wanted.
*/
- PreallocXlogFiles(EndOfLog);
+ PreallocXlogFiles(RedoRecPtr, EndOfLog);
/*
* Okay, we're officially UP.
@@ -8905,7 +8898,7 @@ CreateCheckPoint(int flags)
* segments, since that may supply some of the needed files.)
*/
if (!shutdown)
- PreallocXlogFiles(recptr);
+ PreallocXlogFiles(RedoRecPtr, recptr);
/*
* Truncate pg_subtrans if possible. We can throw away all data before
@@ -9255,7 +9248,7 @@ CreateRestartPoint(int flags)
* Make more log segments if needed. (Do this after recycling old log
* segments, since that may supply some of the needed files.)
*/
- PreallocXlogFiles(endptr);
+ PreallocXlogFiles(RedoRecPtr, endptr);
/*
* ThisTimeLineID is normally not set when we're still in recovery.
--
2.20.1
0002-Use-WAL-segments-as-WAL-buffers.patchapplication/octet-stream; name=0002-Use-WAL-segments-as-WAL-buffers.patchDownload
From 76cda0eb7b660654b0aa379883ebe4952658cdf7 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 10 Feb 2020 17:53:12 +0900
Subject: [msync 2/5] Use WAL segments as WAL buffers
Note that we ignore wal_sync_method from here.
---
src/backend/access/transam/xlog.c | 833 +++++++++---------------------
1 file changed, 243 insertions(+), 590 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e2cd34057f..43f9a8affc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -15,6 +15,7 @@
#include "postgres.h"
#include <ctype.h>
+#include <libpmem.h>
#include <math.h>
#include <time.h>
#include <fcntl.h>
@@ -613,24 +614,8 @@ typedef struct XLogCtlData
XLogwrtResult LogwrtResult;
/*
- * Latest initialized page in the cache (last byte position + 1).
- *
- * To change the identity of a buffer (and InitializedUpTo), you need to
- * hold WALBufMappingLock. To change the identity of a buffer that's
- * still dirty, the old page needs to be written out first, and for that
- * you need WALWriteLock, and you need to ensure that there are no
- * in-progress insertions to the page by calling
- * WaitXLogInsertionsToFinish().
+ * This value does not change after startup.
*/
- XLogRecPtr InitializedUpTo;
-
- /*
- * These values do not change after startup, although the pointed-to pages
- * and xlblocks values certainly do. xlblock values are protected by
- * WALBufMappingLock.
- */
- char *pages; /* buffers for unwritten XLOG pages */
- XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
int XLogCacheBlck; /* highest allocated xlog buffer index */
/*
@@ -775,9 +760,16 @@ static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
* openLogFile is -1 or a kernel FD for an open log file segment.
* openLogSegNo identifies the segment. These variables are only used to
* write the XLOG, and so will normally refer to the active segment.
+ *
+ * mappedPages is mmap(2)-ed address for an open log file segment.
+ * It is used as WAL buffer instead of XLogCtl->pages.
+ *
+ * pmemMapped is true if mappedPages is on PMEM.
*/
static int openLogFile = -1;
static XLogSegNo openLogSegNo = 0;
+static char *mappedPages = NULL;
+static bool pmemMapped = 0;
/*
* These variables are used similarly to the ones above, but for reading
@@ -875,12 +867,12 @@ static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
-static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
static bool XLogCheckpointNeeded(XLogSegNo new_segno);
static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
bool find_free, XLogSegNo max_segno,
bool use_lock);
+static char *XLogFileMap(XLogSegNo segno, bool *is_pmem);
static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
int source, bool notfoundOk);
static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
@@ -891,6 +883,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
bool fetching_ckpt, XLogRecPtr tliRecPtr);
static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
static void XLogFileClose(void);
+static void XLogFileUnmap(char *pages, XLogSegNo segno);
static void PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
static void RemoveTempXlogFiles(void);
static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
@@ -940,7 +933,6 @@ static void checkXLogConsistency(XLogReaderState *record);
static void WALInsertLockAcquire(void);
static void WALInsertLockAcquireExclusive(void);
static void WALInsertLockRelease(void);
-static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
/*
* Insert an XLOG record represented by an already-constructed chain of data
@@ -1574,27 +1566,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
*/
while (CurrPos < EndPos)
{
- /*
- * The minimal action to flush the page would be to call
- * WALInsertLockUpdateInsertingAt(CurrPos) followed by
- * AdvanceXLInsertBuffer(...). The page would be left initialized
- * mostly to zeros, except for the page header (always the short
- * variant, as this is never a segment's first page).
- *
- * The large vistas of zeros are good for compressibility, but the
- * headers interrupting them every XLOG_BLCKSZ (with values that
- * differ from page to page) are not. The effect varies with
- * compression tool, but bzip2 for instance compresses about an
- * order of magnitude worse if those headers are left in place.
- *
- * Rather than complicating AdvanceXLInsertBuffer itself (which is
- * called in heavily-loaded circumstances as well as this lightly-
- * loaded one) with variant behavior, we just use GetXLogBuffer
- * (which itself calls the two methods we need) to get the pointer
- * and zero most of the page. Then we just zero the page header.
- */
- currpos = GetXLogBuffer(CurrPos);
- MemSet(currpos, 0, SizeOfXLogShortPHD);
+ /* XXX We assume that XLogFileInit does what we did here */
CurrPos += XLOG_BLCKSZ;
}
@@ -1708,29 +1680,6 @@ WALInsertLockRelease(void)
}
}
-/*
- * Update our insertingAt value, to let others know that we've finished
- * inserting up to that point.
- */
-static void
-WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
-{
- if (holdingAllLocks)
- {
- /*
- * We use the last lock to mark our actual position, see comments in
- * WALInsertLockAcquireExclusive.
- */
- LWLockUpdateVar(&WALInsertLocks[NUM_XLOGINSERT_LOCKS - 1].l.lock,
- &WALInsertLocks[NUM_XLOGINSERT_LOCKS - 1].l.insertingAt,
- insertingAt);
- }
- else
- LWLockUpdateVar(&WALInsertLocks[MyLockNo].l.lock,
- &WALInsertLocks[MyLockNo].l.insertingAt,
- insertingAt);
-}
-
/*
* Wait for any WAL insertions < upto to finish.
*
@@ -1831,123 +1780,37 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
/*
* Get a pointer to the right location in the WAL buffer containing the
* given XLogRecPtr.
- *
- * If the page is not initialized yet, it is initialized. That might require
- * evicting an old dirty buffer from the buffer cache, which means I/O.
- *
- * The caller must ensure that the page containing the requested location
- * isn't evicted yet, and won't be evicted. The way to ensure that is to
- * hold onto a WAL insertion lock with the insertingAt position set to
- * something <= ptr. GetXLogBuffer() will update insertingAt if it needs
- * to evict an old page from the buffer. (This means that once you call
- * GetXLogBuffer() with a given 'ptr', you must not access anything before
- * that point anymore, and must not call GetXLogBuffer() with an older 'ptr'
- * later, because older buffers might be recycled already)
*/
static char *
GetXLogBuffer(XLogRecPtr ptr)
{
- int idx;
- XLogRecPtr endptr;
- static uint64 cachedPage = 0;
- static char *cachedPos = NULL;
- XLogRecPtr expectedEndPtr;
+ int idx;
+ XLogPageHeader page;
+ XLogSegNo segno;
- /*
- * Fast path for the common case that we need to access again the same
- * page as last time.
- */
- if (ptr / XLOG_BLCKSZ == cachedPage)
+ /* shut-up compiler if not --enable-cassert */
+ (void) page;
+
+ XLByteToSeg(ptr, segno, wal_segment_size);
+ if (segno != openLogSegNo)
{
- Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
- Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
- return cachedPos + ptr % XLOG_BLCKSZ;
+ /* Unmap the current segment if mapped */
+ if (mappedPages != NULL)
+ XLogFileUnmap(mappedPages, openLogSegNo);
+
+ /* Map the segment we need */
+ mappedPages = XLogFileMap(segno, &pmemMapped);
+ Assert(mappedPages != NULL);
+ openLogSegNo = segno;
}
- /*
- * The XLog buffer cache is organized so that a page is always loaded to a
- * particular buffer. That way we can easily calculate the buffer a given
- * page must be loaded into, from the XLogRecPtr alone.
- */
idx = XLogRecPtrToBufIdx(ptr);
+ page = (XLogPageHeader) (mappedPages + idx * (Size) XLOG_BLCKSZ);
- /*
- * See what page is loaded in the buffer at the moment. It could be the
- * page we're looking for, or something older. It can't be anything newer
- * - that would imply the page we're looking for has already been written
- * out to disk and evicted, and the caller is responsible for making sure
- * that doesn't happen.
- *
- * However, we don't hold a lock while we read the value. If someone has
- * just initialized the page, it's possible that we get a "torn read" of
- * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
- * that case we will see a bogus value. That's ok, we'll grab the mapping
- * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
- * the page we're looking for. But it means that when we do this unlocked
- * read, we might see a value that appears to be ahead of the page we're
- * looking for. Don't PANIC on that, until we've verified the value while
- * holding the lock.
- */
- expectedEndPtr = ptr;
- expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+ Assert(page->xlp_magic == XLOG_PAGE_MAGIC);
+ Assert(page->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
- endptr = XLogCtl->xlblocks[idx];
- if (expectedEndPtr != endptr)
- {
- XLogRecPtr initializedUpto;
-
- /*
- * Before calling AdvanceXLInsertBuffer(), which can block, let others
- * know how far we're finished with inserting the record.
- *
- * NB: If 'ptr' points to just after the page header, advertise a
- * position at the beginning of the page rather than 'ptr' itself. If
- * there are no other insertions running, someone might try to flush
- * up to our advertised location. If we advertised a position after
- * the page header, someone might try to flush the page header, even
- * though page might actually not be initialized yet. As the first
- * inserter on the page, we are effectively responsible for making
- * sure that it's initialized, before we let insertingAt to move past
- * the page header.
- */
- if (ptr % XLOG_BLCKSZ == SizeOfXLogShortPHD &&
- XLogSegmentOffset(ptr, wal_segment_size) > XLOG_BLCKSZ)
- initializedUpto = ptr - SizeOfXLogShortPHD;
- else if (ptr % XLOG_BLCKSZ == SizeOfXLogLongPHD &&
- XLogSegmentOffset(ptr, wal_segment_size) < XLOG_BLCKSZ)
- initializedUpto = ptr - SizeOfXLogLongPHD;
- else
- initializedUpto = ptr;
-
- WALInsertLockUpdateInsertingAt(initializedUpto);
-
- AdvanceXLInsertBuffer(ptr, false);
- endptr = XLogCtl->xlblocks[idx];
-
- if (expectedEndPtr != endptr)
- elog(PANIC, "could not find WAL buffer for %X/%X",
- (uint32) (ptr >> 32), (uint32) ptr);
- }
- else
- {
- /*
- * Make sure the initialization of the page is visible to us, and
- * won't arrive later to overwrite the WAL data we write on the page.
- */
- pg_memory_barrier();
- }
-
- /*
- * Found the buffer holding this page. Return a pointer to the right
- * offset within the page.
- */
- cachedPage = ptr / XLOG_BLCKSZ;
- cachedPos = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
-
- Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
- Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
-
- return cachedPos + ptr % XLOG_BLCKSZ;
+ return mappedPages + ptr % wal_segment_size;
}
/*
@@ -2075,178 +1938,6 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
return result;
}
-/*
- * Initialize XLOG buffers, writing out old buffers if they still contain
- * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
- * true, initialize as many pages as we can without having to write out
- * unwritten data. Any new pages are initialized to zeros, with pages headers
- * initialized properly.
- */
-static void
-AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
-{
- XLogCtlInsert *Insert = &XLogCtl->Insert;
- int nextidx;
- XLogRecPtr OldPageRqstPtr;
- XLogwrtRqst WriteRqst;
- XLogRecPtr NewPageEndPtr = InvalidXLogRecPtr;
- XLogRecPtr NewPageBeginPtr;
- XLogPageHeader NewPage;
- int npages = 0;
-
- LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
-
- /*
- * Now that we have the lock, check if someone initialized the page
- * already.
- */
- while (upto >= XLogCtl->InitializedUpTo || opportunistic)
- {
- nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
-
- /*
- * Get ending-offset of the buffer page we need to replace (this may
- * be zero if the buffer hasn't been used yet). Fall through if it's
- * already written out.
- */
- OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
- if (LogwrtResult.Write < OldPageRqstPtr)
- {
- /*
- * Nope, got work to do. If we just want to pre-initialize as much
- * as we can without flushing, give up now.
- */
- if (opportunistic)
- break;
-
- /* Before waiting, get info_lck and update LogwrtResult */
- SpinLockAcquire(&XLogCtl->info_lck);
- if (XLogCtl->LogwrtRqst.Write < OldPageRqstPtr)
- XLogCtl->LogwrtRqst.Write = OldPageRqstPtr;
- LogwrtResult = XLogCtl->LogwrtResult;
- SpinLockRelease(&XLogCtl->info_lck);
-
- /*
- * Now that we have an up-to-date LogwrtResult value, see if we
- * still need to write it or if someone else already did.
- */
- if (LogwrtResult.Write < OldPageRqstPtr)
- {
- /*
- * Must acquire write lock. Release WALBufMappingLock first,
- * to make sure that all insertions that we need to wait for
- * can finish (up to this same position). Otherwise we risk
- * deadlock.
- */
- LWLockRelease(WALBufMappingLock);
-
- WaitXLogInsertionsToFinish(OldPageRqstPtr);
-
- LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
-
- LogwrtResult = XLogCtl->LogwrtResult;
- if (LogwrtResult.Write >= OldPageRqstPtr)
- {
- /* OK, someone wrote it already */
- LWLockRelease(WALWriteLock);
- }
- else
- {
- /* Have to write it ourselves */
- TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
- WriteRqst.Write = OldPageRqstPtr;
- WriteRqst.Flush = 0;
- XLogWrite(WriteRqst, false);
- LWLockRelease(WALWriteLock);
- TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
- }
- /* Re-acquire WALBufMappingLock and retry */
- LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
- continue;
- }
- }
-
- /*
- * Now the next buffer slot is free and we can set it up to be the
- * next output page.
- */
- NewPageBeginPtr = XLogCtl->InitializedUpTo;
- NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
-
- Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx);
-
- NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
-
- /*
- * Be sure to re-zero the buffer so that bytes beyond what we've
- * written will look like zeroes and not valid XLOG records...
- */
- MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
-
- /*
- * Fill the new page's header
- */
- NewPage->xlp_magic = XLOG_PAGE_MAGIC;
-
- /* NewPage->xlp_info = 0; */ /* done by memset */
- NewPage->xlp_tli = ThisTimeLineID;
- NewPage->xlp_pageaddr = NewPageBeginPtr;
-
- /* NewPage->xlp_rem_len = 0; */ /* done by memset */
-
- /*
- * If online backup is not in progress, mark the header to indicate
- * that WAL records beginning in this page have removable backup
- * blocks. This allows the WAL archiver to know whether it is safe to
- * compress archived WAL data by transforming full-block records into
- * the non-full-block format. It is sufficient to record this at the
- * page level because we force a page switch (in fact a segment
- * switch) when starting a backup, so the flag will be off before any
- * records can be written during the backup. At the end of a backup,
- * the last page will be marked as all unsafe when perhaps only part
- * is unsafe, but at worst the archiver would miss the opportunity to
- * compress a few records.
- */
- if (!Insert->forcePageWrites)
- NewPage->xlp_info |= XLP_BKP_REMOVABLE;
-
- /*
- * If first page of an XLOG segment file, make it a long header.
- */
- if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
- {
- XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
-
- NewLongPage->xlp_sysid = ControlFile->system_identifier;
- NewLongPage->xlp_seg_size = wal_segment_size;
- NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
- NewPage->xlp_info |= XLP_LONG_HEADER;
- }
-
- /*
- * Make sure the initialization of the page becomes visible to others
- * before the xlblocks update. GetXLogBuffer() reads xlblocks without
- * holding a lock.
- */
- pg_write_barrier();
-
- *((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
-
- XLogCtl->InitializedUpTo = NewPageEndPtr;
-
- npages++;
- }
- LWLockRelease(WALBufMappingLock);
-
-#ifdef WAL_DEBUG
- if (XLOG_DEBUG && npages > 0)
- {
- elog(DEBUG1, "initialized %d pages, up to %X/%X",
- npages, (uint32) (NewPageEndPtr >> 32), (uint32) NewPageEndPtr);
- }
-#endif
-}
-
/*
* Calculate CheckPointSegments based on max_wal_size_mb and
* checkpoint_completion_target.
@@ -2375,14 +2066,9 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
static void
XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
- bool ispartialpage;
- bool last_iteration;
bool finishing_seg;
- bool use_existent;
- int curridx;
- int npages;
- int startidx;
- uint32 startoffset;
+ XLogSegNo rqstLogSegNo;
+ XLogSegNo segno;
/* We should always be inside a critical section here */
Assert(CritSectionCount > 0);
@@ -2392,223 +2078,140 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
*/
LogwrtResult = XLogCtl->LogwrtResult;
- /*
- * Since successive pages in the xlog cache are consecutively allocated,
- * we can usually gather multiple pages together and issue just one
- * write() call. npages is the number of pages we have determined can be
- * written together; startidx is the cache block index of the first one,
- * and startoffset is the file offset at which it should go. The latter
- * two variables are only valid when npages > 0, but we must initialize
- * all of them to keep the compiler quiet.
- */
- npages = 0;
- startidx = 0;
- startoffset = 0;
+ /* Fast return if not requested to flush */
+ if (WriteRqst.Flush == 0)
+ return;
+ Assert(WriteRqst.Flush == WriteRqst.Write);
/*
- * Within the loop, curridx is the cache block index of the page to
- * consider writing. Begin at the buffer containing the next unwritten
- * page, or last partially written page.
+ * Call pmem_persist() or pmem_msync() for each segment file that contains
+ * records to be flushed.
*/
- curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);
-
- while (LogwrtResult.Write < WriteRqst.Write)
+ XLByteToPrevSeg(WriteRqst.Flush, rqstLogSegNo, wal_segment_size);
+ XLByteToSeg(LogwrtResult.Flush, segno, wal_segment_size);
+ while (segno <= rqstLogSegNo)
{
- /*
- * Make sure we're not ahead of the insert process. This could happen
- * if we're passed a bogus WriteRqst.Write that is past the end of the
- * last page that's been initialized by AdvanceXLInsertBuffer.
- */
- XLogRecPtr EndPtr = XLogCtl->xlblocks[curridx];
+ bool is_pmem;
+ char *addr;
+ char *p;
+ Size len;
+ XLogRecPtr BeginPtr;
+ XLogRecPtr EndPtr;
- if (LogwrtResult.Write >= EndPtr)
- elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
- (uint32) (LogwrtResult.Write >> 32),
- (uint32) LogwrtResult.Write,
- (uint32) (EndPtr >> 32), (uint32) EndPtr);
-
- /* Advance LogwrtResult.Write to end of current buffer page */
- LogwrtResult.Write = EndPtr;
- ispartialpage = WriteRqst.Write < LogwrtResult.Write;
-
- if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
- wal_segment_size))
+ /* Check if the segment is not mapped yet */
+ if (segno != openLogSegNo)
{
+ /* Map newly */
+ is_pmem = 0;
+ addr = XLogFileMap(segno, &is_pmem);
+
/*
- * Switch to new logfile segment. We cannot have any pending
- * pages here (since we dump what we have at segment end).
+ * Use the mapped above as WAL buffer of this process for the
+ * future. Note that it might be unmapped within this loop.
*/
- Assert(npages == 0);
- if (openLogFile >= 0)
- XLogFileClose();
- XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
- wal_segment_size);
-
- /* create/use new log file */
- use_existent = true;
- openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
+ if (openLogSegNo == 0)
+ {
+ pmemMapped = is_pmem;
+ mappedPages = addr;
+ openLogSegNo = segno;
+ }
}
-
- /* Make sure we have the current logfile open */
- if (openLogFile < 0)
+ else
{
- XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
- wal_segment_size);
- openLogFile = XLogFileOpen(openLogSegNo);
+ /* Or use existent mapping */
+ is_pmem = pmemMapped;
+ addr = mappedPages;
}
+ Assert(addr != NULL);
+ Assert(mappedPages != NULL);
+ Assert(openLogSegNo > 0);
- /* Add current page to the set of pending pages-to-dump */
- if (npages == 0)
- {
- /* first of group */
- startidx = curridx;
- startoffset = XLogSegmentOffset(LogwrtResult.Write - XLOG_BLCKSZ,
- wal_segment_size);
- }
- npages++;
+ /* Find beginning position to be flushed */
+ BeginPtr = segno * wal_segment_size;
+ if (BeginPtr < LogwrtResult.Flush)
+ BeginPtr = LogwrtResult.Flush;
+
+ /* Find ending position to be flushed */
+ EndPtr = (segno + 1) * wal_segment_size;
+ if (EndPtr > WriteRqst.Flush)
+ EndPtr = WriteRqst.Flush;
+
+ /* Convert LSN to memory address */
+ Assert(BeginPtr <= EndPtr);
+ p = addr + BeginPtr % wal_segment_size;
+ len = (Size) (EndPtr - BeginPtr);
/*
- * Dump the set if this will be the last loop iteration, or if we are
- * at the last page of the cache area (since the next page won't be
- * contiguous in memory), or if we are at the end of the logfile
- * segment.
+ * Do cache-flush or msync.
+ *
+ * Note that pmem_msync() does backoff to the page boundary.
*/
- last_iteration = WriteRqst.Write <= LogwrtResult.Write;
-
- finishing_seg = !ispartialpage &&
- (startoffset + npages * XLOG_BLCKSZ) >= wal_segment_size;
-
- if (last_iteration ||
- curridx == XLogCtl->XLogCacheBlck ||
- finishing_seg)
+ if (is_pmem)
{
- char *from;
- Size nbytes;
- Size nleft;
- int written;
-
- /* OK to write the page(s) */
- from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
- nbytes = npages * (Size) XLOG_BLCKSZ;
- nleft = nbytes;
- do
+ pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
+ pmem_persist(p, len);
+ pgstat_report_wait_end();
+ }
+ else
+ {
+ pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
+ if (pmem_msync(p, len))
{
- errno = 0;
- pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
- written = pg_pwrite(openLogFile, from, nleft, startoffset);
pgstat_report_wait_end();
- if (written <= 0)
- {
- if (errno == EINTR)
- continue;
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not write to log file %s "
- "at offset %u, length %zu: %m",
- XLogFileNameP(ThisTimeLineID, openLogSegNo),
- startoffset, nleft)));
- }
- nleft -= written;
- from += written;
- startoffset += written;
- } while (nleft > 0);
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not msync file \"%s\": %m",
+ XLogFileNameP(ThisTimeLineID, segno))));
+ }
+ pgstat_report_wait_end();
+ }
+ LogwrtResult.Flush = LogwrtResult.Write = EndPtr;
- npages = 0;
+ /* Check if whole my WAL buffers are synchronized to the segment */
+ finishing_seg = (LogwrtResult.Flush % wal_segment_size == 0) &&
+ XLByteInPrevSeg(LogwrtResult.Flush, openLogSegNo,
+ wal_segment_size);
- /*
- * If we just wrote the whole last page of a logfile segment,
- * fsync the segment immediately. This avoids having to go back
- * and re-open prior segments when an fsync request comes along
- * later. Doing it here ensures that one and only one backend will
- * perform this fsync.
- *
- * This is also the right place to notify the Archiver that the
- * segment is ready to copy to archival storage, and to update the
- * timer for archive_timeout, and to signal for a checkpoint if
- * too many logfile segments have been used since the last
- * checkpoint.
- */
+ if (segno != openLogSegNo || finishing_seg)
+ {
+ XLogFileUnmap(addr, segno);
if (finishing_seg)
{
- issue_xlog_fsync(openLogFile, openLogSegNo);
-
- /* signal that we need to wakeup walsenders later */
- WalSndWakeupRequest();
-
- LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
-
- if (XLogArchivingActive())
- XLogArchiveNotifySeg(openLogSegNo);
-
- XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
- XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
-
- /*
- * Request a checkpoint if we've consumed too much xlog since
- * the last one. For speed, we first check using the local
- * copy of RedoRecPtr, which might be out of date; if it looks
- * like a checkpoint is needed, forcibly update RedoRecPtr and
- * recheck.
- */
- if (IsUnderPostmaster && XLogCheckpointNeeded(openLogSegNo))
- {
- (void) GetRedoRecPtr();
- if (XLogCheckpointNeeded(openLogSegNo))
- RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
- }
+ Assert(segno == openLogSegNo);
+ mappedPages = NULL;
+ openLogSegNo = 0;
}
- }
- if (ispartialpage)
- {
- /* Only asked to write a partial page */
- LogwrtResult.Write = WriteRqst.Write;
- break;
- }
- curridx = NextBufIdx(curridx);
+ /* signal that we need to wakeup walsenders later */
+ WalSndWakeupRequest();
- /* If flexible, break out of loop as soon as we wrote something */
- if (flexible && npages == 0)
- break;
- }
+ if (XLogArchivingActive())
+ XLogArchiveNotifySeg(segno);
- Assert(npages == 0);
+ XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+ XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
- /*
- * If asked to flush, do so
- */
- if (LogwrtResult.Flush < WriteRqst.Flush &&
- LogwrtResult.Flush < LogwrtResult.Write)
-
- {
- /*
- * Could get here without iterating above loop, in which case we might
- * have no open file or the wrong one. However, we do not need to
- * fsync more than one file.
- */
- if (sync_method != SYNC_METHOD_OPEN &&
- sync_method != SYNC_METHOD_OPEN_DSYNC)
- {
- if (openLogFile >= 0 &&
- !XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
- wal_segment_size))
- XLogFileClose();
- if (openLogFile < 0)
+ /*
+ * Request a checkpoint if we've consumed too much xlog since
+ * the last one. For speed, we first check using the local
+ * copy of RedoRecPtr, which might be out of date; if it looks
+ * like a checkpoint is needed, forcibly update RedoRecPtr and
+ * recheck.
+ */
+ if (IsUnderPostmaster && XLogCheckpointNeeded(segno))
{
- XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
- wal_segment_size);
- openLogFile = XLogFileOpen(openLogSegNo);
+ (void) GetRedoRecPtr();
+ if (XLogCheckpointNeeded(segno))
+ RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
}
-
- issue_xlog_fsync(openLogFile, openLogSegNo);
}
- /* signal that we need to wakeup walsenders later */
- WalSndWakeupRequest();
-
- LogwrtResult.Flush = LogwrtResult.Write;
+ ++segno;
}
+ /* signal that we need to wakeup walsenders later */
+ WalSndWakeupRequest();
+
/*
* Update shared-memory status
*
@@ -3029,6 +2632,16 @@ XLogBackgroundFlush(void)
XLogFileClose();
}
}
+ else if (mappedPages != NULL)
+ {
+ if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
+ wal_segment_size))
+ {
+ XLogFileUnmap(mappedPages, openLogSegNo);
+ mappedPages = NULL;
+ openLogSegNo = 0;
+ }
+ }
return false;
}
@@ -3095,12 +2708,6 @@ XLogBackgroundFlush(void)
/* wake up walsenders now that we've released heavily contended locks */
WalSndWakeupProcessRequests();
- /*
- * Great, done. To take some work off the critical path, try to initialize
- * as many of the no-longer-needed WAL buffers for future use as we can.
- */
- AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
-
/*
* If we determined that we need to write data, but somebody else
* wrote/flushed already, it should be considered as being active, to
@@ -3257,6 +2864,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
save_errno = 0;
if (wal_init_zero)
{
+ XLogCtlInsert *Insert = &XLogCtl->Insert;
+ XLogPageHeader NewPage = (XLogPageHeader) zbuffer.data;
+ XLogRecPtr NewPageBeginPtr = logsegno * wal_segment_size;
+
/*
* Zero-fill the file. With this setting, we do this the hard way to
* ensure that all the file space has really been allocated. On
@@ -3268,6 +2879,48 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
*/
for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
{
+ memset(NewPage, 0, SizeOfXLogLongPHD);
+
+ /*
+ * Fill the new page's header
+ */
+ NewPage->xlp_magic = XLOG_PAGE_MAGIC;
+
+ /* NewPage->xlp_info = 0; */ /* done by memset */
+ NewPage->xlp_tli = ThisTimeLineID;
+ NewPage->xlp_pageaddr = NewPageBeginPtr;
+
+ /* NewPage->xlp_rem_len = 0; */ /* done by memset */
+
+ /*
+ * If online backup is not in progress, mark the header to indicate
+ * that WAL records beginning in this page have removable backup
+ * blocks. This allows the WAL archiver to know whether it is safe to
+ * compress archived WAL data by transforming full-block records into
+ * the non-full-block format. It is sufficient to record this at the
+ * page level because we force a page switch (in fact a segment
+ * switch) when starting a backup, so the flag will be off before any
+ * records can be written during the backup. At the end of a backup,
+ * the last page will be marked as all unsafe when perhaps only part
+ * is unsafe, but at worst the archiver would miss the opportunity to
+ * compress a few records.
+ */
+ if (!Insert->forcePageWrites)
+ NewPage->xlp_info |= XLP_BKP_REMOVABLE;
+
+ /*
+ * If first page of an XLOG segment file, make it a long header.
+ */
+ if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+ {
+ XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
+
+ NewLongPage->xlp_sysid = ControlFile->system_identifier;
+ NewLongPage->xlp_seg_size = wal_segment_size;
+ NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
+ NewPage->xlp_info |= XLP_LONG_HEADER;
+ }
+
errno = 0;
if (write(fd, zbuffer.data, XLOG_BLCKSZ) != XLOG_BLCKSZ)
{
@@ -3275,6 +2928,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
save_errno = errno ? errno : ENOSPC;
break;
}
+
+ NewPageBeginPtr += XLOG_BLCKSZ;
}
}
else
@@ -3610,6 +3265,36 @@ XLogFileOpen(XLogSegNo segno)
return fd;
}
+/*
+ * Memory-map a pre-existing logfile segment for WAL buffers.
+ *
+ * If success, it returns non-NULL and is_pmem is set whether the file is on
+ * PMEM or not. Otherwise, it PANICs.
+ */
+static char *
+XLogFileMap(XLogSegNo segno, bool *is_pmem)
+{
+ char path[MAXPGPATH];
+ char *addr;
+ Size mlen;
+ int pmem;
+
+ XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
+
+ mlen = 0;
+ pmem = 0;
+ addr = pmem_map_file(path, 0, 0, 0, &mlen, &pmem);
+ if (addr == NULL)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not open or mmap file \"%s\": %m", path)));
+
+ Assert(mlen == wal_segment_size);
+
+ *is_pmem = (bool) pmem;
+ return addr;
+}
+
/*
* Open a logfile segment for reading (during recovery).
*
@@ -3799,6 +3484,21 @@ XLogFileClose(void)
openLogFile = -1;
}
+/*
+ * Unmap the current logfile segment for WAL buffer.
+ */
+static void
+XLogFileUnmap(char *pages, XLogSegNo segno)
+{
+ Assert(pages != NULL);
+
+ if (pmem_unmap(pages, wal_segment_size))
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not unmap file \"%s\": %m",
+ XLogFileNameP(ThisTimeLineID, segno))));
+}
+
/*
* Preallocate log files beyond the specified log endpoint.
*/
@@ -4947,12 +4647,6 @@ XLOGShmemSize(void)
/* WAL insertion locks, plus alignment */
size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
- /* xlblocks array */
- size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
- /* extra alignment padding for XLOG I/O buffers */
- size = add_size(size, XLOG_BLCKSZ);
- /* and the buffers themselves */
- size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
/*
* Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5028,10 +4722,6 @@ XLOGShmemInit(void)
* needed here.
*/
allocptr = ((char *) XLogCtl) + sizeof(XLogCtlData);
- XLogCtl->xlblocks = (XLogRecPtr *) allocptr;
- memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
- allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
-
/* WAL insertion locks. Ensure they're aligned to the full padded size */
allocptr += sizeof(WALInsertLockPadded) -
@@ -5048,15 +4738,6 @@ XLOGShmemInit(void)
WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
}
- /*
- * Align the start of the page buffers to a full xlog block size boundary.
- * This simplifies some calculations in XLOG insertion. It is also
- * required for O_DIRECT.
- */
- allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
- XLogCtl->pages = allocptr;
- memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
-
/*
* Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
* in additional info.)
@@ -7494,40 +7175,12 @@ StartupXLOG(void)
Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
/*
- * Tricky point here: readBuf contains the *last* block that the LastRec
- * record spans, not the one it starts in. The last block is indeed the
- * one we want to use.
+ * We DO NOT need the if-else block once existed here because we use WAL
+ * segment files as WAL buffers so the last block is "already on the
+ * buffers."
+ *
+ * XXX We assume there is no torn record.
*/
- if (EndOfLog % XLOG_BLCKSZ != 0)
- {
- char *page;
- int len;
- int firstIdx;
- XLogRecPtr pageBeginPtr;
-
- pageBeginPtr = EndOfLog - (EndOfLog % XLOG_BLCKSZ);
- Assert(readOff == XLogSegmentOffset(pageBeginPtr, wal_segment_size));
-
- firstIdx = XLogRecPtrToBufIdx(EndOfLog);
-
- /* Copy the valid part of the last block, and zero the rest */
- page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];
- len = EndOfLog % XLOG_BLCKSZ;
- memcpy(page, xlogreader->readBuf, len);
- memset(page + len, 0, XLOG_BLCKSZ - len);
-
- XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
- XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
- }
- else
- {
- /*
- * There is no partial block to copy. Just set InitializedUpTo, and
- * let the first attempt to insert a log record to initialize the next
- * buffer.
- */
- XLogCtl->InitializedUpTo = EndOfLog;
- }
LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
--
2.20.1
0003-Lazy-unmap-WAL-segments.patchapplication/octet-stream; name=0003-Lazy-unmap-WAL-segments.patchDownload
From 29b1954f4bba9ffd7e28fac1c8c4302dfe4bc2a6 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 10 Feb 2020 17:53:14 +0900
Subject: [msync 3/5] Lazy-unmap WAL segments
---
src/backend/access/transam/xlog.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 43f9a8affc..317816a0b9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -768,7 +768,9 @@ static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
*/
static int openLogFile = -1;
static XLogSegNo openLogSegNo = 0;
+static XLogSegNo beingClosedLogSegNo = 0;
static char *mappedPages = NULL;
+static char *beingUnmappedPages = NULL;
static bool pmemMapped = 0;
/*
@@ -1162,6 +1164,14 @@ XLogInsertRecord(XLogRecData *rdata,
}
}
+ /* Lazy-unmap */
+ if (beingUnmappedPages != NULL)
+ {
+ XLogFileUnmap(beingUnmappedPages, beingClosedLogSegNo);
+ beingUnmappedPages = NULL;
+ beingClosedLogSegNo = 0;
+ }
+
#ifdef WAL_DEBUG
if (XLOG_DEBUG)
{
@@ -1794,9 +1804,23 @@ GetXLogBuffer(XLogRecPtr ptr)
XLByteToSeg(ptr, segno, wal_segment_size);
if (segno != openLogSegNo)
{
- /* Unmap the current segment if mapped */
+ /*
+ * We do not want to unmap the current segment here because we are in
+ * a critial section and unmap is time-consuming operation. So we
+ * just mark it to be unmapped later.
+ */
if (mappedPages != NULL)
- XLogFileUnmap(mappedPages, openLogSegNo);
+ {
+ /*
+ * If there is another being-unmapped segment, it cannot be helped;
+ * we unmap it here.
+ */
+ if (beingUnmappedPages != NULL)
+ XLogFileUnmap(beingUnmappedPages, beingClosedLogSegNo);
+
+ beingUnmappedPages = mappedPages;
+ beingClosedLogSegNo = openLogSegNo;
+ }
/* Map the segment we need */
mappedPages = XLogFileMap(segno, &pmemMapped);
--
2.20.1
0004-Speculative-map-WAL-segments.patchapplication/octet-stream; name=0004-Speculative-map-WAL-segments.patchDownload
From a1e54ccba738cc339647bba0bafd7df7e92915c3 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 10 Feb 2020 17:53:15 +0900
Subject: [msync 4/5] Speculative-map WAL segments
---
src/backend/access/transam/xlog.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 317816a0b9..9b3caa63a4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -976,6 +976,8 @@ XLogInsertRecord(XLogRecData *rdata,
info == XLOG_SWITCH);
XLogRecPtr StartPos;
XLogRecPtr EndPos;
+ XLogRecPtr ProbablyInsertPos;
+ XLogSegNo ProbablyInsertSegNo;
bool prevDoPageWrites = doPageWrites;
/* we assume that all of the record header is in the first chunk */
@@ -985,6 +987,23 @@ XLogInsertRecord(XLogRecData *rdata,
if (!XLogInsertAllowed())
elog(ERROR, "cannot make new WAL entries during recovery");
+ /* Speculatively map a segment we probably need */
+ ProbablyInsertPos = GetInsertRecPtr();
+ XLByteToSeg(ProbablyInsertPos, ProbablyInsertSegNo, wal_segment_size);
+ if (ProbablyInsertSegNo != openLogSegNo)
+ {
+ if (mappedPages != NULL)
+ {
+ Assert(beingUnmappedPages == NULL);
+ Assert(beingClosedLogSegNo == 0);
+ beingUnmappedPages = mappedPages;
+ beingClosedLogSegNo = openLogSegNo;
+ }
+ mappedPages = XLogFileMap(ProbablyInsertSegNo, &pmemMapped);
+ Assert(mappedPages != NULL);
+ openLogSegNo = ProbablyInsertSegNo;
+ }
+
/*----------
*
* We have now done all the preparatory work we can without holding a
--
2.20.1
0005-Allocate-WAL-segments-to-utilize-hugepage.patchapplication/octet-stream; name=0005-Allocate-WAL-segments-to-utilize-hugepage.patchDownload
From 0e34f41ac611cf0f4e5bdc3428b71e0f81d33cb0 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 10 Feb 2020 17:53:16 +0900
Subject: [msync 5/5] Allocate WAL segments to utilize hugepage
See also https://nvdimm.wiki.kernel.org/2mib_fs_dax
---
src/backend/access/transam/xlog.c | 17 +++++++++++++++--
1 file changed, 15 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9b3caa63a4..d3ef7bf6e5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2904,8 +2904,21 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
memset(zbuffer.data, 0, XLOG_BLCKSZ);
pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
- save_errno = 0;
- if (wal_init_zero)
+
+ /*
+ * Allocate the file by posix_allocate(3) to utilize hugepage and reduce
+ * overhead of page fault. Note that posix_fallocate(3) do not set errno
+ * on error. Instead, it returns an error number directly.
+ */
+ save_errno = posix_fallocate(fd, 0, wal_segment_size);
+
+ if (save_errno)
+ {
+ /*
+ * Do nothing on error. Go to pgstat_report_wait_end().
+ */
+ }
+ else if (wal_init_zero)
{
XLogCtlInsert *Insert = &XLogCtl->Insert;
XLogPageHeader NewPage = (XLogPageHeader) zbuffer.data;
--
2.20.1
Dear hackers,
I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and analyzed its performance with pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL, it was "obviously worse" than the original REL_12_0. VTune told me that the CPU time of memcpy() called by CopyXLogRecordToWAL() got larger than before. When I used *NVDIMM-N and ext4 with filesystem DAX* to store WAL, however, it achieved "not bad" performance compared with our previous patchset and non-volatile WAL buffer. Each CPU time of XLogInsert() and XLogFlush() was reduced like as non-volatile WAL buffer.
So I think mmap()-ing WAL segments as WAL buffers is not such a bad idea as long as we use PMEM, at least NVDIMM-N.
Excuse me but for now I'd keep myself not talking about how much the performance was, because the mmap()-ing patchset is WIP so there might be bugs which wrongfully "improve" or "degrade" performance. Also we need to know persistent memory programming and related features such as filesystem DAX, huge page faults, and WAL persistence with cache flush and memory barrier instructions to explain why the performance improved. I'd talk about all the details at the appropriate time and place. (The conference, or here later...)
Best regards,
Takashi
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
Show quoted text
-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Monday, February 10, 2020 6:30 PM
To: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>
Cc: 'pgsql-hackers@postgresql.org' <pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL bufferDear hackers,
I made another WIP patchset to mmap WAL segments as WAL buffers. Note that this is not a non-volatile WAL
buffer patchset but its competitor. I am measuring and analyzing the performance of this patchset to compare
with my N.V.WAL buffer.Please wait for a several more days for the result report...
Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center-----Original Message-----
From: Robert Haas <robertmhaas@gmail.com>
Sent: Wednesday, January 29, 2020 6:00 AM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Heikki Linnakangas <hlinnaka@iki.fi>; pgsql-hackers@postgresql.org
Subject: Re: [PoC] Non-volatile WAL bufferOn Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:
I think our concerns are roughly classified into two:
(1) Performance
(2) ConsistencyAnd your "different concern" is rather into (2), I think.
Actually, I think it was mostly a performance concern (writes
triggering lots of reading) but there might be a consistency issue as well.--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
Company
Import Notes
Reply to msg id not found:
Menjo-san,
On Mon, Feb 17, 2020 at 1:13 PM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:
I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and analyzed its performance with pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL, it was "obviously worse" than the original REL_12_0.
I apologize for not having any opinion on the patches themselves, but
let me point out that it's better to base these patches on HEAD
(master branch) than REL_12_0, because all new code is committed to
the master branch, whereas stable branches such as REL_12_0 only
receive bug fixes. Do you have any specific reason to be working on
REL_12_0?
Thanks,
Amit
Hello Amit,
I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
whereas stable branches such as REL_12_0 only receive bug fixes. Do you have any specific reason to be working
on REL_12_0?
Yes, because I think it's human-friendly to reproduce and discuss performance measurement. Of course I know all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm aware of rebasing my patchset onto master sooner or later. However, if someone, including me, says that s/he applies my patchset to "master" and measures its performance, we have to pay attention to which commit the "master" really points to. Although we have sha1 hashes to specify which commit, we should check whether the specific commit on master has patches affecting performance or not because master's HEAD gets new patches day by day. On the other hand, a release tag clearly points the commit all we probably know. Also we can check more easily the features and improvements by using release notes and user manuals.
Best regards,
Takashi
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
Show quoted text
-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 1:39 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas <hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL bufferMenjo-san,
On Mon, Feb 17, 2020 at 1:13 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:
I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and
analyzed its performance with pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL, it was
"obviously worse" than the original REL_12_0.I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
whereas stable branches such as REL_12_0 only receive bug fixes. Do you have any specific reason to be working
on REL_12_0?Thanks,
Amit
Hello,
On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:
Hello Amit,
I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
whereas stable branches such as REL_12_0 only receive bug fixes. Do you have any specific reason to be working
on REL_12_0?Yes, because I think it's human-friendly to reproduce and discuss performance measurement. Of course I know all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm aware of rebasing my patchset onto master sooner or later. However, if someone, including me, says that s/he applies my patchset to "master" and measures its performance, we have to pay attention to which commit the "master" really points to. Although we have sha1 hashes to specify which commit, we should check whether the specific commit on master has patches affecting performance or not because master's HEAD gets new patches day by day. On the other hand, a release tag clearly points the commit all we probably know. Also we can check more easily the features and improvements by using release notes and user manuals.
Thanks for clarifying. I see where you're coming from.
While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least two
numbers -- performance with a branch's HEAD without patch applied and
that with patch applied -- which can be enough in most cases to see
the difference the patch makes. Sure, the numbers might change on
each report, but that's fine I'd think. If you continue to develop
against the stable branch, you might miss to notice impact from any
relevant developments in the master branch, even developments which
possibly require rethinking the architecture of your own changes,
although maybe that rarely occurs.
Thanks,
Amit
Hi,
On 2020-02-17 13:12:37 +0900, Takashi Menjo wrote:
I applied my patchset that mmap()-s WAL segments as WAL buffers to
refs/tags/REL_12_0, and measured and analyzed its performance with
pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL,
it was "obviously worse" than the original REL_12_0. VTune told me
that the CPU time of memcpy() called by CopyXLogRecordToWAL() got
larger than before.
FWIW, this might largely be because of page faults. In contrast to
before we wouldn't reuse the same pages (because they've been
munmap()/mmap()ed), so the first time they're touched, we'll incur page
faults. Did you try mmap()ing with MAP_POPULATE? It's probably also
worthwhile to try to use MAP_HUGETLB.
Still doubtful it's the right direction, but I'd rather have good
numbers to back me up :)
Greetings,
Andres Freund
Dear Amit,
Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...
I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later.
Best regards,
Takashi
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
Show quoted text
-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas <hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL bufferHello,
On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:
Hello Amit,
I apologize for not having any opinion on the patches themselves,
but let me point out that it's better to base these patches on HEAD
(master branch) than REL_12_0, because all new code is committed to
the master branch, whereas stable branches such as REL_12_0 only receive bug fixes. Do you have anyspecific reason to be working on REL_12_0?
Yes, because I think it's human-friendly to reproduce and discuss performance measurement. Of course I know
all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm
aware of rebasing my patchset onto master sooner or later. However, if someone, including me, says that s/he
applies my patchset to "master" and measures its performance, we have to pay attention to which commit the
"master" really points to. Although we have sha1 hashes to specify which commit, we should check whether the
specific commit on master has patches affecting performance or not because master's HEAD gets new patches day
by day. On the other hand, a release tag clearly points the commit all we probably know. Also we can check more
easily the features and improvements by using release notes and user manuals.Thanks for clarifying. I see where you're coming from.
While I do sometimes see people reporting numbers with the latest stable release' branch, that's normally just one
of the baselines.
The more important baseline for ongoing development is the master branch's HEAD, which is also what people
volunteering to test your patches would use. Anyone who reports would have to give at least two numbers --
performance with a branch's HEAD without patch applied and that with patch applied -- which can be enough in
most cases to see the difference the patch makes. Sure, the numbers might change on each report, but that's fine
I'd think. If you continue to develop against the stable branch, you might miss to notice impact from any relevant
developments in the master branch, even developments which possibly require rethinking the architecture of your
own changes, although maybe that rarely occurs.Thanks,
Amit
Dear hackers,
I rebased my non-volatile WAL buffer's patchset onto master. A new v2 patchset is attached to this mail.
I also measured performance before and after patchset, varying -c/--client and -j/--jobs options of pgbench, for each scaling factor s = 50 or 1000. The results are presented in the following tables and the attached charts. Conditions, steps, and other details will be shown later.
Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)
Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)
Both throughput and average latency are improved for each scaling factor. Throughput seemed to almost reach the upper limit when (c,j)=(36,18).
The percentage in s=1000 case looks larger than in s=50 case. I think larger scaling factor leads to less contentions on the same tables and/or indexes, that is, less lock and unlock operations. In such a situation, write-ahead logging appears to be more significant for performance.
Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch
Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown in the tables above.
(1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes
pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname
I gave no -b option to use the built-in "TPC-B (sort-of)" query.
Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)
Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA
Best regards,
Takashi
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
Show quoted text
-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>; 'PostgreSQL-development'
<pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL bufferDear Amit,
Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...
I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later.
Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas
<hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL bufferHello,
On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:
Hello Amit,
I apologize for not having any opinion on the patches themselves,
but let me point out that it's better to base these patches on
HEAD (master branch) than REL_12_0, because all new code is
committed to the master branch, whereas stable branches such as
REL_12_0 only receive bug fixes. Do you have anyspecific reason to be working on REL_12_0?
Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I knowall new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone, including
me, says that s/he applies my patchset to "master" and measures its
performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has patches affecting performance or notbecause master's HEAD gets new patches day by day. On the other hand, a release tag clearly points the commit
all we probably know. Also we can check more easily the features and improvements by using release notes and
user manuals.Thanks for clarifying. I see where you're coming from.
While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least two
numbers -- performance with a branch's HEAD without patch applied and
that with patch applied -- which can be enough in most cases to see
the difference the patch makes. Sure, the numbers might change on
each report, but that's fine I'd think. If you continue to develop against the stable branch, you might miss tonotice impact from any relevant developments in the master branch, even developments which possibly require
rethinking the architecture of your own changes, although maybe that rarely occurs.Thanks,
Amit
Attachments:
v2-0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=v2-0001-Support-GUCs-for-external-WAL-buffer.patchDownload
From db976d96affc0b120c79f6ac666fc4fc663b13d2 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:10:41 +0900
Subject: [PATCH v2 1/3] Support GUCs for external WAL buffer
To implement non-volatile WAL buffer, we add two new GUCs nvwal_path
and nvwal_size. Now postgres maps a file at that path onto memory to
use it as WAL buffer. Note that the buffer is still volatile for now.
---
configure | 262 ++++++++++++++++++
configure.in | 43 +++
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/nv_xlog_buffer.c | 95 +++++++
src/backend/access/transam/xlog.c | 164 ++++++++++-
src/backend/utils/misc/guc.c | 23 +-
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/bin/initdb/initdb.c | 95 ++++++-
src/include/access/nv_xlog_buffer.h | 71 +++++
src/include/access/xlog.h | 2 +
src/include/pg_config.h.in | 6 +
src/include/utils/guc.h | 4 +
12 files changed, 748 insertions(+), 22 deletions(-)
create mode 100644 src/backend/access/transam/nv_xlog_buffer.c
create mode 100644 src/include/access/nv_xlog_buffer.h
diff --git a/configure b/configure
index 93ee4a2937..72ebaa525d 100755
--- a/configure
+++ b/configure
@@ -864,6 +864,7 @@ with_libxml
with_libxslt
with_system_tzdata
with_zlib
+with_nvwal
with_gnu_ld
enable_largefile
'
@@ -1566,6 +1567,7 @@ Optional Packages:
--with-system-tzdata=DIR
use system time zone data in DIR
--without-zlib do not use Zlib
+ --with-nvwal use non-volatile WAL buffer (NVWAL)
--with-gnu-ld assume the C compiler uses GNU ld [default=no]
Some influential environment variables:
@@ -8307,6 +8309,203 @@ fi
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with non-volatile WAL buffer (NVWAL)" >&5
+$as_echo_n "checking whether to build with non-volatile WAL buffer (NVWAL)... " >&6; }
+
+
+
+# Check whether --with-nvwal was given.
+if test "${with_nvwal+set}" = set; then :
+ withval=$with_nvwal;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_NVWAL 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-nvwal option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_nvwal=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_nvwal" >&5
+$as_echo "$with_nvwal" >&6; }
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+ freebsd1*|freebsd2*) elf=no;;
+ freebsd3*|freebsd4*) elf=yes;;
+esac
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for grep that handles long lines and -e" >&5
+$as_echo_n "checking for grep that handles long lines and -e... " >&6; }
+if ${ac_cv_path_GREP+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ if test -z "$GREP"; then
+ ac_path_GREP_found=false
+ # Loop through the user's path and test for each of PROGNAME-LIST
+ as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+ IFS=$as_save_IFS
+ test -z "$as_dir" && as_dir=.
+ for ac_prog in grep ggrep; do
+ for ac_exec_ext in '' $ac_executable_extensions; do
+ ac_path_GREP="$as_dir/$ac_prog$ac_exec_ext"
+ as_fn_executable_p "$ac_path_GREP" || continue
+# Check for GNU ac_path_GREP and select it if it is found.
+ # Check for GNU $ac_path_GREP
+case `"$ac_path_GREP" --version 2>&1` in
+*GNU*)
+ ac_cv_path_GREP="$ac_path_GREP" ac_path_GREP_found=:;;
+*)
+ ac_count=0
+ $as_echo_n 0123456789 >"conftest.in"
+ while :
+ do
+ cat "conftest.in" "conftest.in" >"conftest.tmp"
+ mv "conftest.tmp" "conftest.in"
+ cp "conftest.in" "conftest.nl"
+ $as_echo 'GREP' >> "conftest.nl"
+ "$ac_path_GREP" -e 'GREP$' -e '-(cannot match)-' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+ diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+ as_fn_arith $ac_count + 1 && ac_count=$as_val
+ if test $ac_count -gt ${ac_path_GREP_max-0}; then
+ # Best one so far, save it but keep looking for a better one
+ ac_cv_path_GREP="$ac_path_GREP"
+ ac_path_GREP_max=$ac_count
+ fi
+ # 10*(2^10) chars as input seems more than enough
+ test $ac_count -gt 10 && break
+ done
+ rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+ $ac_path_GREP_found && break 3
+ done
+ done
+ done
+IFS=$as_save_IFS
+ if test -z "$ac_cv_path_GREP"; then
+ as_fn_error $? "no acceptable grep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+ fi
+else
+ ac_cv_path_GREP=$GREP
+fi
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_GREP" >&5
+$as_echo "$ac_cv_path_GREP" >&6; }
+ GREP="$ac_cv_path_GREP"
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for egrep" >&5
+$as_echo_n "checking for egrep... " >&6; }
+if ${ac_cv_path_EGREP+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ if echo a | $GREP -E '(a|b)' >/dev/null 2>&1
+ then ac_cv_path_EGREP="$GREP -E"
+ else
+ if test -z "$EGREP"; then
+ ac_path_EGREP_found=false
+ # Loop through the user's path and test for each of PROGNAME-LIST
+ as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+ IFS=$as_save_IFS
+ test -z "$as_dir" && as_dir=.
+ for ac_prog in egrep; do
+ for ac_exec_ext in '' $ac_executable_extensions; do
+ ac_path_EGREP="$as_dir/$ac_prog$ac_exec_ext"
+ as_fn_executable_p "$ac_path_EGREP" || continue
+# Check for GNU ac_path_EGREP and select it if it is found.
+ # Check for GNU $ac_path_EGREP
+case `"$ac_path_EGREP" --version 2>&1` in
+*GNU*)
+ ac_cv_path_EGREP="$ac_path_EGREP" ac_path_EGREP_found=:;;
+*)
+ ac_count=0
+ $as_echo_n 0123456789 >"conftest.in"
+ while :
+ do
+ cat "conftest.in" "conftest.in" >"conftest.tmp"
+ mv "conftest.tmp" "conftest.in"
+ cp "conftest.in" "conftest.nl"
+ $as_echo 'EGREP' >> "conftest.nl"
+ "$ac_path_EGREP" 'EGREP$' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+ diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+ as_fn_arith $ac_count + 1 && ac_count=$as_val
+ if test $ac_count -gt ${ac_path_EGREP_max-0}; then
+ # Best one so far, save it but keep looking for a better one
+ ac_cv_path_EGREP="$ac_path_EGREP"
+ ac_path_EGREP_max=$ac_count
+ fi
+ # 10*(2^10) chars as input seems more than enough
+ test $ac_count -gt 10 && break
+ done
+ rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+ $ac_path_EGREP_found && break 3
+ done
+ done
+ done
+IFS=$as_save_IFS
+ if test -z "$ac_cv_path_EGREP"; then
+ as_fn_error $? "no acceptable egrep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+ fi
+else
+ ac_cv_path_EGREP=$EGREP
+fi
+
+ fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_EGREP" >&5
+$as_echo "$ac_cv_path_EGREP" >&6; }
+ EGREP="$ac_cv_path_EGREP"
+
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#if __ELF__
+ yes
+#endif
+
+_ACEOF
+if (eval "$ac_cpp conftest.$ac_ext") 2>&5 |
+ $EGREP "yes" >/dev/null 2>&1; then :
+ ELF_SYS=true
+else
+ if test "X$elf" = "Xyes" ; then
+ ELF_SYS=true
+else
+ ELF_SYS=
+fi
+fi
+rm -f conftest*
+
+
+
#
# Assignments
#
@@ -12664,6 +12863,57 @@ fi
fi
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_pmem_pmem_map_file=yes
+else
+ ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+ LIBS="-lpmem $LIBS"
+
+else
+ as_fn_error $? "library 'libpmem' is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+fi
+
##
## Header files
@@ -13343,6 +13593,18 @@ fi
done
+fi
+
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+ ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+
fi
if test "$PORTNAME" = "win32" ; then
diff --git a/configure.in b/configure.in
index e2ae4e2d3e..4b3f1b4c42 100644
--- a/configure.in
+++ b/configure.in
@@ -968,6 +968,38 @@ PGAC_ARG_BOOL(with, zlib, yes,
[do not use Zlib])
AC_SUBST(with_zlib)
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+AC_MSG_CHECKING([whether to build with non-volatile WAL buffer (NVWAL)])
+PGAC_ARG_BOOL(with, nvwal, no, [use non-volatile WAL buffer (NVWAL)],
+ [AC_DEFINE([USE_NVWAL], 1, [Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal)])])
+AC_MSG_RESULT([$with_nvwal])
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+ freebsd1*|freebsd2*) elf=no;;
+ freebsd3*|freebsd4*) elf=yes;;
+esac
+
+AC_EGREP_CPP(yes,
+[#if __ELF__
+ yes
+#endif
+],
+[ELF_SYS=true],
+[if test "X$elf" = "Xyes" ; then
+ ELF_SYS=true
+else
+ ELF_SYS=
+fi])
+AC_SUBST(ELF_SYS)
+
#
# Assignments
#
@@ -1269,6 +1301,12 @@ elif test "$with_uuid" = ossp ; then
fi
AC_SUBST(UUID_LIBS)
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+ AC_CHECK_LIB(pmem, pmem_map_file, [],
+ [AC_MSG_ERROR([library 'libpmem' is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
##
## Header files
@@ -1446,6 +1484,11 @@ elif test "$with_uuid" = ossp ; then
[AC_MSG_ERROR([header file <ossp/uuid.h> or <uuid.h> is required for OSSP UUID])])])
fi
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+ AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
if test "$PORTNAME" = "win32" ; then
AC_CHECK_HEADERS(crtdefs.h)
fi
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..b41a710e7e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -32,7 +32,8 @@ OBJS = \
xlogfuncs.o \
xloginsert.o \
xlogreader.o \
- xlogutils.o
+ xlogutils.o \
+ nv_xlog_buffer.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/nv_xlog_buffer.c b/src/backend/access/transam/nv_xlog_buffer.c
new file mode 100644
index 0000000000..cfc6a6376b
--- /dev/null
+++ b/src/backend/access/transam/nv_xlog_buffer.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * nv_xlog_buffer.c
+ * PostgreSQL non-volatile WAL buffer
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/nv_xlog_buffer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#ifdef USE_NVWAL
+
+#include <libpmem.h>
+#include "access/nv_xlog_buffer.h"
+
+#include "miscadmin.h" /* IsBootstrapProcessingMode */
+#include "common/file_perm.h" /* pg_file_create_mode */
+
+/*
+ * Maps non-volatile WAL buffer on shared memory.
+ *
+ * Returns a mapped address if success; PANICs and never return otherwise.
+ */
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+ void *addr;
+ size_t map_len = 0;
+ int is_pmem = 0;
+
+ Assert(fname != NULL);
+ Assert(fsize > 0);
+
+ if (IsBootstrapProcessingMode())
+ {
+ /*
+ * Create and map a new file if we are in bootstrap mode (typically
+ * executed by initdb).
+ */
+ addr = pmem_map_file(fname, fsize, PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+ pg_file_create_mode, &map_len, &is_pmem);
+ }
+ else
+ {
+ /*
+ * Map an existing file. The second argument (len) should be zero,
+ * the third argument (flags) should have neither PMEM_FILE_CREATE nor
+ * PMEM_FILE_EXCL, and the fourth argument (mode) will be ignored.
+ */
+ addr = pmem_map_file(fname, 0, 0, 0, &map_len, &is_pmem);
+ }
+
+ if (addr == NULL)
+ elog(PANIC, "could not map non-volatile WAL buffer '%s': %m", fname);
+
+ if (map_len != fsize)
+ elog(PANIC, "size of non-volatile WAL buffer '%s' is invalid; "
+ "expected %zu; actual %zu",
+ fname, fsize, map_len);
+
+ if (!is_pmem)
+ elog(PANIC, "non-volatile WAL buffer '%s' is not on persistent memory",
+ fname);
+
+ /*
+ * Assert page boundary alignment (8KiB as default). It should pass because
+ * PMDK considers hugepage boundary alignment (2MiB or 1GiB on x64).
+ */
+ Assert((uint64) addr % XLOG_BLCKSZ == 0);
+
+ elog(LOG, "non-volatile WAL buffer '%s' is mapped on [%p-%p)",
+ fname, addr, (char *) addr + map_len);
+ return addr;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+ Assert(addr != NULL);
+
+ if (pmem_unmap(addr, fsize) < 0)
+ {
+ elog(WARNING, "could not unmap non-volatile WAL buffer: %m");
+ return;
+ }
+
+ elog(LOG, "non-volatile WAL buffer unmapped");
+}
+
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4361568882..24aed4e76e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
#include "access/xloginsert.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
+#include "access/nv_xlog_buffer.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
@@ -852,6 +853,12 @@ static bool InRedo = false;
/* Have we launched bgwriter during recovery? */
static bool bgwriterLaunched = false;
+/* For non-volatile WAL buffer (NVWAL) */
+char *NvwalPath = NULL; /* a GUC parameter */
+int NvwalSizeMB = 1024; /* a direct GUC parameter */
+static Size NvwalSize = 0; /* an indirect GUC parameter */
+static bool NvwalAvail = false;
+
/* For WALInsertLockAcquire/Release functions */
static int MyLockNo = 0;
static bool holdingAllLocks = false;
@@ -4947,6 +4954,76 @@ check_wal_buffers(int *newval, void **extra, GucSource source)
return true;
}
+/*
+ * GUC check_hook for nvwal_path.
+ */
+bool
+check_nvwal_path(char **newval, void **extra, GucSource source)
+{
+#ifndef USE_NVWAL
+ Assert(!NvwalAvail);
+
+ if (**newval != '\0')
+ {
+ GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+ GUC_check_errmsg("nvwal_path is invalid parameter without NVWAL");
+ return false;
+ }
+#endif
+
+ return true;
+}
+
+void
+assign_nvwal_path(const char *newval, void *extra)
+{
+ /* true if not empty; false if empty */
+ NvwalAvail = (bool) (*newval != '\0');
+}
+
+/*
+ * GUC check_hook for nvwal_size.
+ *
+ * It checks the boundary only and DOES NOT check if the size is multiple
+ * of wal_segment_size because the segment size (probably stored in the
+ * control file) have not been set properly here yet.
+ *
+ * See XLOGShmemSize for more validation.
+ */
+bool
+check_nvwal_size(int *newval, void **extra, GucSource source)
+{
+#ifdef USE_NVWAL
+ Size buf_size;
+ int64 npages;
+
+ Assert(*newval > 0);
+
+ buf_size = (Size) (*newval) * 1024 * 1024;
+ npages = (int64) buf_size / XLOG_BLCKSZ;
+ Assert(npages > 0);
+
+ if (npages > INT_MAX)
+ {
+ /* XLOG_BLCKSZ could be so small that npages exceeds INT_MAX */
+ GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+ GUC_check_errmsg("invalid value for nvwal_size (%dMB): "
+ "the number of WAL pages too large; "
+ "buf_size %zu; XLOG_BLCKSZ %d",
+ *newval, buf_size, (int) XLOG_BLCKSZ);
+ return false;
+ }
+#endif
+
+ return true;
+}
+
+void
+assign_nvwal_size(int newval, void *extra)
+{
+ NvwalSize = (Size) newval * 1024 * 1024;
+}
+
/*
* Read the control file, set respective GUCs.
*
@@ -4975,13 +5052,49 @@ XLOGShmemSize(void)
{
Size size;
+ /*
+ * If we use non-volatile WAL buffer, we don't use the given wal_buffers.
+ * Instead, we set it the value based on the size of the file for the
+ * buffer. This should be done here because of xlblocks array calculation.
+ */
+ if (NvwalAvail)
+ {
+ char buf[32];
+ int64 npages;
+
+ Assert(NvwalSizeMB > 0);
+ Assert(NvwalSize > 0);
+ Assert(wal_segment_size > 0);
+ Assert(wal_segment_size % XLOG_BLCKSZ == 0);
+
+ /*
+ * At last, we can check if the size of non-volatile WAL buffer
+ * (nvwal_size) is multiple of WAL segment size.
+ *
+ * Note that NvwalSize has already been calculated in assign_nvwal_size.
+ */
+ if (NvwalSize % wal_segment_size != 0)
+ {
+ elog(PANIC,
+ "invalid value for nvwal_size (%dMB): "
+ "it should be multiple of WAL segment size; "
+ "NvwalSize %zu; wal_segment_size %d",
+ NvwalSizeMB, NvwalSize, wal_segment_size);
+ }
+
+ npages = (int64) NvwalSize / XLOG_BLCKSZ;
+ Assert(npages > 0 && npages <= INT_MAX);
+
+ snprintf(buf, sizeof(buf), "%d", (int) npages);
+ SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
+ }
/*
* If the value of wal_buffers is -1, use the preferred auto-tune value.
* This isn't an amazingly clean place to do this, but we must wait till
* NBuffers has received its final value, and must do it before using the
* value of XLOGbuffers to do anything important.
*/
- if (XLOGbuffers == -1)
+ else if (XLOGbuffers == -1)
{
char buf[32];
@@ -4997,10 +5110,13 @@ XLOGShmemSize(void)
size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
/* xlblocks array */
size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
- /* extra alignment padding for XLOG I/O buffers */
- size = add_size(size, XLOG_BLCKSZ);
- /* and the buffers themselves */
- size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+ if (!NvwalAvail)
+ {
+ /* extra alignment padding for XLOG I/O buffers */
+ size = add_size(size, XLOG_BLCKSZ);
+ /* and the buffers themselves */
+ size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+ }
/*
* Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5097,13 +5213,32 @@ XLOGShmemInit(void)
}
/*
- * Align the start of the page buffers to a full xlog block size boundary.
- * This simplifies some calculations in XLOG insertion. It is also
- * required for O_DIRECT.
+ * Open and memory-map a file for non-volatile XLOG buffer. The PMDK will
+ * align the start of the buffer to 2-MiB boundary if the size of the
+ * buffer is larger than or equal to 4 MiB.
*/
- allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
- XLogCtl->pages = allocptr;
- memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+ if (NvwalAvail)
+ {
+ /* Logging and error-handling should be done in the function */
+ XLogCtl->pages = MapNonVolatileXLogBuffer(NvwalPath, NvwalSize);
+
+ /*
+ * Do not memset non-volatile XLOG buffer (XLogCtl->pages) here
+ * because it would contain records for recovery. We should do so in
+ * checkpoint after the recovery completes successfully.
+ */
+ }
+ else
+ {
+ /*
+ * Align the start of the page buffers to a full xlog block size
+ * boundary. This simplifies some calculations in XLOG insertion. It
+ * is also required for O_DIRECT.
+ */
+ allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+ XLogCtl->pages = allocptr;
+ memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+ }
/*
* Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
@@ -8400,6 +8535,13 @@ ShutdownXLOG(int code, Datum arg)
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
+
+ /*
+ * If we use non-volatile XLOG buffer, unmap it.
+ */
+ if (NvwalAvail)
+ UnmapNonVolatileXLogBuffer(XLogCtl->pages, NvwalSize);
+
ShutdownCLOG();
ShutdownCommitTs();
ShutdownSUBTRANS();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 464f264d9a..4befd4d276 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2703,7 +2703,7 @@ static struct config_int ConfigureNamesInt[] =
GUC_UNIT_XBLOCKS
},
&XLOGbuffers,
- -1, -1, (INT_MAX / XLOG_BLCKSZ),
+ -1, -1, INT_MAX,
check_wal_buffers, NULL, NULL
},
@@ -3304,6 +3304,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, assign_tcp_user_timeout, show_tcp_user_timeout
},
+ {
+ {"nvwal_size", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Size of non-volatile WAL buffer (NVWAL)."),
+ NULL,
+ GUC_UNIT_MB
+ },
+ &NvwalSizeMB,
+ 1024, 1, INT_MAX,
+ check_nvwal_size, assign_nvwal_size, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -4330,6 +4341,16 @@ static struct config_string ConfigureNamesString[] =
check_backtrace_functions, assign_backtrace_functions, NULL
},
+ {
+ {"nvwal_path", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Path to file for non-volatile WAL buffer (NVWAL)."),
+ NULL
+ },
+ &NvwalPath,
+ "",
+ check_nvwal_path, assign_nvwal_path, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e58e4788a8..0c23c4d26b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -224,6 +224,8 @@
#checkpoint_timeout = 5min # range 30s-1d
#max_wal_size = 1GB
#min_wal_size = 80MB
+#nvwal_path = '/path/to/nvwal'
+#nvwal_size = 1GB
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_flush_after = 0 # measured in pages, 0 disables
#checkpoint_warning = 30s # 0 disables
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index a6577486ce..869f95915e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -144,7 +144,10 @@ static bool show_setting = false;
static bool data_checksums = false;
static char *xlog_dir = NULL;
static char *str_wal_segment_size_mb = NULL;
+static char *nvwal_path = NULL;
+static char *str_nvwal_size_mb = NULL;
static int wal_segment_size_mb;
+static int nvwal_size_mb;
/* internal vars */
@@ -1103,14 +1106,78 @@ setup_config(void)
conflines = replace_token(conflines, "#port = 5432", repltok);
#endif
- /* set default max_wal_size and min_wal_size */
- snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
- pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
- conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
-
- snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
- pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
- conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+ if (nvwal_path != NULL)
+ {
+ int nr_segs;
+
+ if (str_nvwal_size_mb == NULL)
+ nvwal_size_mb = 1024;
+ else
+ {
+ char *endptr;
+
+ /* check that the argument is a number */
+ nvwal_size_mb = strtol(str_nvwal_size_mb, &endptr, 10);
+
+ /* verify that the size of non-volatile WAL buffer is valid */
+ if (endptr == str_nvwal_size_mb || *endptr != '\0')
+ {
+ pg_log_error("argument of --nvwal-size must be a number; "
+ "str_nvwal_size_mb '%s'",
+ str_nvwal_size_mb);
+ exit(1);
+ }
+ if (nvwal_size_mb <= 0)
+ {
+ pg_log_error("argument of --nvwal-size must be a positive number; "
+ "str_nvwal_size_mb '%s'; nvwal_size_mb %d",
+ str_nvwal_size_mb, nvwal_size_mb);
+ exit(1);
+ }
+ if (nvwal_size_mb % wal_segment_size_mb != 0)
+ {
+ pg_log_error("argument of --nvwal-size must be multiple of WAL segment size; "
+ "str_nvwal_size_mb '%s'; nvwal_size_mb %d; wal_segment_size_mb %d",
+ str_nvwal_size_mb, nvwal_size_mb, wal_segment_size_mb);
+ exit(1);
+ }
+ }
+
+ /*
+ * XXX We set {min_,max_,nv}wal_size to the same value. Note that
+ * postgres might bootstrap and run if the three config does not have
+ * the same value, but have not been tested yet.
+ */
+ nr_segs = nvwal_size_mb / wal_segment_size_mb;
+
+ snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "nvwal_path = '%s'",
+ nvwal_path);
+ conflines = replace_token(conflines,
+ "#nvwal_path = '/path/to/nvwal'", repltok);
+
+ snprintf(repltok, sizeof(repltok), "nvwal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#nvwal_size = 1GB", repltok);
+ }
+ else
+ {
+ /* set default max_wal_size and min_wal_size */
+ snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+ pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
+ conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+ pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
+ conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+ }
snprintf(repltok, sizeof(repltok), "lc_messages = '%s'",
escape_quotes(lc_messages));
@@ -2309,6 +2376,8 @@ usage(const char *progname)
printf(_(" -W, --pwprompt prompt for a password for the new superuser\n"));
printf(_(" -X, --waldir=WALDIR location for the write-ahead log directory\n"));
printf(_(" --wal-segsize=SIZE size of WAL segments, in megabytes\n"));
+ printf(_(" -P, --nvwal-path=FILE path to file for non-volatile WAL buffer (NVWAL)\n"));
+ printf(_(" -Q, --nvwal-size=SIZE size of NVWAL, in megabytes\n"));
printf(_("\nLess commonly used options:\n"));
printf(_(" -d, --debug generate lots of debugging output\n"));
printf(_(" -k, --data-checksums use data page checksums\n"));
@@ -2982,6 +3051,8 @@ main(int argc, char *argv[])
{"sync-only", no_argument, NULL, 'S'},
{"waldir", required_argument, NULL, 'X'},
{"wal-segsize", required_argument, NULL, 12},
+ {"nvwal-path", required_argument, NULL, 'P'},
+ {"nvwal-size", required_argument, NULL, 'Q'},
{"data-checksums", no_argument, NULL, 'k'},
{"allow-group-access", no_argument, NULL, 'g'},
{NULL, 0, NULL, 0}
@@ -3025,7 +3096,7 @@ main(int argc, char *argv[])
/* process command-line options */
- while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:g", long_options, &option_index)) != -1)
+ while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:P:Q:g", long_options, &option_index)) != -1)
{
switch (c)
{
@@ -3119,6 +3190,12 @@ main(int argc, char *argv[])
case 12:
str_wal_segment_size_mb = pg_strdup(optarg);
break;
+ case 'P':
+ nvwal_path = pg_strdup(optarg);
+ break;
+ case 'Q':
+ str_nvwal_size_mb = pg_strdup(optarg);
+ break;
case 'g':
SetDataDirectoryCreatePerm(PG_DIR_MODE_GROUP);
break;
diff --git a/src/include/access/nv_xlog_buffer.h b/src/include/access/nv_xlog_buffer.h
new file mode 100644
index 0000000000..b58878c92b
--- /dev/null
+++ b/src/include/access/nv_xlog_buffer.h
@@ -0,0 +1,71 @@
+/*
+ * nv_xlog_buffer.h
+ *
+ * PostgreSQL non-volatile WAL buffer
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nv_xlog_buffer.h
+ */
+#ifndef NV_XLOG_BUFFER_H
+#define NV_XLOG_BUFFER_H
+
+extern void *MapNonVolatileXLogBuffer(const char *fname, Size fsize);
+extern void UnmapNonVolatileXLogBuffer(void *addr, Size fsize);
+
+#ifdef USE_NVWAL
+#include <libpmem.h>
+
+#define nv_memset_persist pmem_memset_persist
+#define nv_memcpy_nodrain pmem_memcpy_nodrain
+#define nv_flush pmem_flush
+#define nv_drain pmem_drain
+#define nv_persist pmem_persist
+
+#else
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+ return NULL;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+ return;
+}
+
+static inline void *
+nv_memset_persist(void *pmemdest, int c, size_t len)
+{
+ return NULL;
+}
+
+static inline void *
+nv_memcpy_nodrain(void *pmemdest, const void *src,
+ size_t len)
+{
+ return NULL;
+}
+
+static inline void
+nv_flush(void *pmemdest, size_t len)
+{
+ return;
+}
+
+static inline void
+nv_drain(void)
+{
+ return;
+}
+
+static inline void
+nv_persist(const void *addr, size_t len)
+{
+ return;
+}
+
+#endif /* USE_NVWAL */
+#endif /* NV_XLOG_BUFFER_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..174423901a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -129,6 +129,8 @@ extern int recoveryTargetAction;
extern int recovery_min_apply_delay;
extern char *PrimaryConnInfo;
extern char *PrimarySlotName;
+extern char *NvwalPath;
+extern int NvwalSizeMB;
/* indirectly set via GUC system */
extern TransactionId recoveryTargetXid;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 4fa0f770aa..1b6fb49f76 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -325,6 +325,9 @@
/* Define to 1 if you have the `pam' library (-lpam). */
#undef HAVE_LIBPAM
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
/* Define if you have a function readline library */
#undef HAVE_LIBREADLINE
@@ -871,6 +874,9 @@
/* Define to select named POSIX semaphores. */
#undef USE_NAMED_POSIX_SEMAPHORES
+/* Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal) */
+#undef USE_NVWAL
+
/* Define to build with OpenSSL support. (--with-openssl) */
#undef USE_OPENSSL
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ce93ace76c..d4a345c7f0 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -437,6 +437,10 @@ extern void assign_search_path(const char *newval, void *extra);
/* in access/transam/xlog.c */
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_nvwal_path(char **newval, void **extra, GucSource source);
+extern void assign_nvwal_path(const char *newval, void *extra);
+extern bool check_nvwal_size(int *newval, void **extra, GucSource source);
+extern void assign_nvwal_size(int newval, void *extra);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
#endif /* GUC_H */
--
2.17.1
v2-0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=v2-0002-Non-volatile-WAL-buffer.patchDownload
From 39d2f4e1b11eef84e1f1be8e8ff4f2f22ba85a37 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:10:42 +0900
Subject: [PATCH v2 2/3] Non-volatile WAL buffer
Now external WAL buffer becomes non-volatile.
Bumps PG_CONTROL_VERSION.
---
src/backend/access/transam/xlog.c | 1033 ++++++++++++++++++++---
src/backend/access/transam/xlogreader.c | 24 +
src/bin/pg_controldata/pg_controldata.c | 3 +
src/include/access/xlog.h | 8 +
src/include/catalog/pg_control.h | 17 +-
5 files changed, 973 insertions(+), 112 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24aed4e76e..2c6861f77e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -643,6 +643,13 @@ typedef struct XLogCtlData
TimeLineID ThisTimeLineID;
TimeLineID PrevTimeLineID;
+ /*
+ * Used for non-volatile WAL buffer (NVWAL).
+ *
+ * All the records up to this LSN are persistent in NVWAL.
+ */
+ XLogRecPtr persistentUpTo;
+
/*
* SharedRecoveryInProgress indicates if we're still in crash or archive
* recovery. Protected by info_lck.
@@ -766,11 +773,12 @@ typedef enum
XLOG_FROM_ANY = 0, /* request to read WAL from any source */
XLOG_FROM_ARCHIVE, /* restored using restore_command */
XLOG_FROM_PG_WAL, /* existing file in pg_wal */
+ XLOG_FROM_NVWAL, /* non-volatile WAL buffer */
XLOG_FROM_STREAM /* streamed from master */
} XLogSource;
/* human-readable names for XLogSources, for debugging output */
-static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
+static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "nvwal", "stream"};
/*
* openLogFile is -1 or a kernel FD for an open log file segment.
@@ -901,6 +909,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
bool fetching_ckpt, XLogRecPtr tliRecPtr);
static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
static void XLogFileClose(void);
+static void PreallocNonVolatileXlogBuffer(void);
static void PreallocXlogFiles(XLogRecPtr endptr);
static void RemoveTempXlogFiles(void);
static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
@@ -1181,6 +1190,43 @@ XLogInsertRecord(XLogRecData *rdata,
}
}
+ /*
+ * Request a checkpoint here if non-volatile WAL buffer is used and we
+ * have consumed too much WAL since the last checkpoint.
+ *
+ * We first screen under the condition (1) OR (2) below:
+ *
+ * (1) The record was the first one in a certain segment.
+ * (2) The record was inserted across segments.
+ *
+ * We then check the segment number which the record was inserted into.
+ */
+ if (NvwalAvail && inserted &&
+ (StartPos % wal_segment_size == SizeOfXLogLongPHD ||
+ StartPos / wal_segment_size < EndPos / wal_segment_size))
+ {
+ XLogSegNo end_segno;
+
+ XLByteToSeg(EndPos, end_segno, wal_segment_size);
+
+ /*
+ * NOTE: We do not signal walsender here because the inserted record
+ * have not drained by NVWAL buffer yet.
+ *
+ * NOTE: We do not signal walarchiver here because the inserted record
+ * have not flushed to a segment file. So we don't need to update
+ * XLogCtl->lastSegSwitch{Time,LSN}, used only by CheckArchiveTimeout.
+ */
+
+ /* Two-step checking for speed (see also XLogWrite) */
+ if (IsUnderPostmaster && XLogCheckpointNeeded(end_segno))
+ {
+ (void) GetRedoRecPtr();
+ if (XLogCheckpointNeeded(end_segno))
+ RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+ }
+ }
+
#ifdef WAL_DEBUG
if (XLOG_DEBUG)
{
@@ -2105,6 +2151,15 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
XLogRecPtr NewPageBeginPtr;
XLogPageHeader NewPage;
int npages = 0;
+ bool is_firstpage;
+
+ if (NvwalAvail)
+ elog(DEBUG1, "XLogCtl->InitializedUpTo %X/%X; upto %X/%X; opportunistic %s",
+ (uint32) (XLogCtl->InitializedUpTo >> 32),
+ (uint32) XLogCtl->InitializedUpTo,
+ (uint32) (upto >> 32),
+ (uint32) upto,
+ opportunistic ? "true" : "false");
LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
@@ -2166,7 +2221,25 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
{
/* Have to write it ourselves */
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
- WriteRqst.Write = OldPageRqstPtr;
+
+ if (NvwalAvail)
+ {
+ /*
+ * If we use non-volatile WAL buffer, it is a special
+ * but expected case to write the buffer pages out to
+ * segment files, and for simplicity, it is done in
+ * segment by segment.
+ */
+ XLogRecPtr OldSegEndPtr;
+
+ OldSegEndPtr = OldPageRqstPtr - XLOG_BLCKSZ + wal_segment_size;
+ Assert(OldSegEndPtr % wal_segment_size == 0);
+
+ WriteRqst.Write = OldSegEndPtr;
+ }
+ else
+ WriteRqst.Write = OldPageRqstPtr;
+
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, false);
LWLockRelease(WALWriteLock);
@@ -2193,7 +2266,20 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
* Be sure to re-zero the buffer so that bytes beyond what we've
* written will look like zeroes and not valid XLOG records...
*/
- MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
+ if (NvwalAvail)
+ {
+ /*
+ * We do not take the way that combines MemSet() and pmem_persist()
+ * because pmem_persist() may use slow and strong-ordered cache
+ * flush instruction if weak-ordered fast one is not supported.
+ * Instead, we first fill the buffer with zero by
+ * pmem_memset_persist() that can leverage non-temporal fast store
+ * instructions, then make the header persistent later.
+ */
+ nv_memset_persist(NewPage, 0, XLOG_BLCKSZ);
+ }
+ else
+ MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
/*
* Fill the new page's header
@@ -2225,7 +2311,8 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
/*
* If first page of an XLOG segment file, make it a long header.
*/
- if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+ is_firstpage = ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0);
+ if (is_firstpage)
{
XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
@@ -2240,7 +2327,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
* before the xlblocks update. GetXLogBuffer() reads xlblocks without
* holding a lock.
*/
- pg_write_barrier();
+ if (NvwalAvail)
+ {
+ /* Make the header persistent on PMEM */
+ nv_persist(NewPage, is_firstpage ? SizeOfXLogLongPHD : SizeOfXLogShortPHD);
+ }
+ else
+ pg_write_barrier();
*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
@@ -2250,6 +2343,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
}
LWLockRelease(WALBufMappingLock);
+ if (NvwalAvail)
+ elog(DEBUG1, "ControlFile->discardedUpTo %X/%X; XLogCtl->InitializedUpTo %X/%X",
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo,
+ (uint32) (XLogCtl->InitializedUpTo >> 32),
+ (uint32) XLogCtl->InitializedUpTo);
+
#ifdef WAL_DEBUG
if (XLOG_DEBUG && npages > 0)
{
@@ -2631,6 +2731,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
LogwrtResult.Flush = LogwrtResult.Write;
}
+ /*
+ * Update discardedUpTo if NVWAL is used. A new value should not fall
+ * behind the old one.
+ */
+ if (NvwalAvail)
+ {
+ Assert(LogwrtResult.Write == LogwrtResult.Flush);
+
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ if (ControlFile->discardedUpTo < LogwrtResult.Write)
+ {
+ ControlFile->discardedUpTo = LogwrtResult.Write;
+ UpdateControlFile();
+ }
+ LWLockRelease(ControlFileLock);
+ }
+
/*
* Update shared-memory status
*
@@ -2835,6 +2952,123 @@ XLogFlush(XLogRecPtr record)
return;
}
+ if (NvwalAvail)
+ {
+ XLogRecPtr FromPos;
+
+ /*
+ * No page on the NVWAL is to be flushed to segment files. Instead,
+ * we wait all the insertions preceding this one complete. We will
+ * wait for all the records to be persistent on the NVWAL below.
+ */
+ record = WaitXLogInsertionsToFinish(record);
+
+ /*
+ * Check if another backend already have done what I am doing.
+ *
+ * We can compare something <= XLogCtl->persistentUpTo without
+ * holding XLogCtl->info_lck spinlock because persistentUpTo is
+ * monotonically increasing and can be loaded atomically on each
+ * NVWAL-supported platform (now x64 only).
+ */
+ FromPos = *((volatile XLogRecPtr *) &XLogCtl->persistentUpTo);
+ if (record <= FromPos)
+ return;
+
+ /*
+ * In a very rare case, we rounded whole the NVWAL. We do not need
+ * to care old pages here because they already have been evicted to
+ * segment files at record insertion.
+ *
+ * In such a case, we flush whole the NVWAL. We also log it as
+ * warning because it can be time-consuming operation.
+ *
+ * TODO Advance XLogCtl->persistentUpTo at the end of XLogWrite, and
+ * we can remove the following first if-block.
+ */
+ if (record - FromPos > NvwalSize)
+ {
+ elog(WARNING, "flush whole the NVWAL; FromPos %X/%X; record %X/%X",
+ (uint32) (FromPos >> 32), (uint32) FromPos,
+ (uint32) (record >> 32), (uint32) record);
+
+ nv_flush(XLogCtl->pages, NvwalSize);
+ }
+ else
+ {
+ char *frompos;
+ char *uptopos;
+ size_t fromoff;
+ size_t uptooff;
+
+ /*
+ * Flush each record that is probably not flushed yet.
+ *
+ * We have two reasons why we say "probably". The first is because
+ * such a record copied with non-temporal store instruction has
+ * already "flushed" but we cannot distinguish it. nv_flush is
+ * harmless for it in consistency.
+ *
+ * The second reason is that the target record might have already
+ * been evicted to a segment file until now. Also in this case,
+ * nv_flush is harmless in consistency.
+ */
+ uptooff = record % NvwalSize;
+ uptopos = XLogCtl->pages + uptooff;
+ fromoff = FromPos % NvwalSize;
+ frompos = XLogCtl->pages + fromoff;
+
+ /* Handles rotation */
+ if (uptopos <= frompos)
+ {
+ nv_flush(frompos, NvwalSize - fromoff);
+ fromoff = 0;
+ frompos = XLogCtl->pages;
+ }
+
+ nv_flush(frompos, uptooff - fromoff);
+ }
+
+ /*
+ * To guarantee durability ("D" of ACID), we should satisfy the
+ * following two for each transaction X:
+ *
+ * (1) All the WAL records inserted by X, including the commit record
+ * of X, should persist on NVWAL before the server commits X.
+ *
+ * (2) All the WAL records inserted by any other transactions than
+ * X, that have less LSN than the commit record just inserted
+ * by X, should persist on NVWAL before the server commits X.
+ *
+ * The (1) can be satisfied by a store barrier after the commit record
+ * of X is flushed because each WAL record on X is already flushed in
+ * the end of its insertion. The (2) can be satisfied by waiting for
+ * any record insertions that have less LSN than the commit record just
+ * inserted by X, and by a store barrier as well.
+ *
+ * Now is the time. Have a store barrier.
+ */
+ nv_drain();
+
+ /*
+ * Remember where the last persistent record is. A new value should
+ * not fall behind the old one.
+ */
+ SpinLockAcquire(&XLogCtl->info_lck);
+ if (XLogCtl->persistentUpTo < record)
+ XLogCtl->persistentUpTo = record;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ /*
+ * The records up to the returned "record" have been persisntent on
+ * NVWAL. Now signal walsenders.
+ */
+ WalSndWakeupRequest();
+ WalSndWakeupProcessRequests();
+
+ return;
+ }
+
/* Quick exit if already known flushed */
if (record <= LogwrtResult.Flush)
return;
@@ -3018,6 +3252,13 @@ XLogBackgroundFlush(void)
if (RecoveryInProgress())
return false;
+ /*
+ * Quick exit if NVWAL buffer is used and archiving is not active. In this
+ * case, we need no WAL segment file in pg_wal directory.
+ */
+ if (NvwalAvail && !XLogArchivingActive())
+ return false;
+
/* read LogwrtResult and update local state */
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
@@ -3036,6 +3277,18 @@ XLogBackgroundFlush(void)
flexible = false; /* ensure it all gets written */
}
+ /*
+ * If NVWAL is used, back off to the last compeleted segment boundary
+ * for writing the buffer page to files in segment by segment. We do so
+ * nowhere but here after XLogCtl->asyncXactLSN is loaded because it
+ * should be considered.
+ */
+ if (NvwalAvail)
+ {
+ WriteRqst.Write -= WriteRqst.Write % wal_segment_size;
+ flexible = false; /* ensure it all gets written */
+ }
+
/*
* If already known flushed, we're done. Just need to check if we are
* holding an open file handle to a logfile that's no longer in use,
@@ -3062,7 +3315,12 @@ XLogBackgroundFlush(void)
flushbytes =
WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
- if (WalWriterFlushAfter == 0 || lastflush == 0)
+ if (NvwalAvail)
+ {
+ WriteRqst.Flush = WriteRqst.Write;
+ lastflush = now;
+ }
+ else if (WalWriterFlushAfter == 0 || lastflush == 0)
{
/* first call, or block based limits disabled */
WriteRqst.Flush = WriteRqst.Write;
@@ -3121,7 +3379,28 @@ XLogBackgroundFlush(void)
* Great, done. To take some work off the critical path, try to initialize
* as many of the no-longer-needed WAL buffers for future use as we can.
*/
- AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
+ if (NvwalAvail && max_wal_senders == 0)
+ {
+ XLogRecPtr upto;
+
+ /*
+ * If NVWAL is used and there is no walsender, nobody is to load
+ * segments on the buffer. So let's recycle segments up to {where we
+ * have requested to write and flush} + NvwalSize.
+ *
+ * Note that if NVWAL is used and a walsender seems running, we have to
+ * do nothing; keep the written pages on the buffer for walsenders to be
+ * loaded from the buffer, not from the segment files. Note that the
+ * buffer pages are eventually to be recycled by checkpoint.
+ */
+ Assert(WriteRqst.Write == WriteRqst.Flush);
+ Assert(WriteRqst.Write % wal_segment_size == 0);
+
+ upto = WriteRqst.Write + NvwalSize;
+ AdvanceXLInsertBuffer(upto - XLOG_BLCKSZ, false);
+ }
+ else
+ AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
/*
* If we determined that we need to write data, but somebody else
@@ -3829,6 +4108,43 @@ XLogFileClose(void)
ReleaseExternalFD();
}
+/*
+ * Preallocate non-volatile XLOG buffers.
+ *
+ * This zeroes buffers and prepare page headers up to
+ * ControlFile->discardedUpTo + S, where S is the total size of
+ * the non-volatile XLOG buffers.
+ *
+ * It is caller's responsibility to update ControlFile->discardedUpTo
+ * and to set XLogCtl->InitializedUpTo properly.
+ */
+static void
+PreallocNonVolatileXlogBuffer(void)
+{
+ XLogRecPtr newupto,
+ InitializedUpTo;
+
+ Assert(NvwalAvail);
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ newupto = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ InitializedUpTo = XLogCtl->InitializedUpTo;
+
+ newupto += NvwalSize;
+ Assert(newupto % wal_segment_size == 0);
+
+ if (newupto <= InitializedUpTo)
+ return;
+
+ /*
+ * Subtracting XLOG_BLCKSZ is important, because AdvanceXLInsertBuffer
+ * handles the first argument as the beginning of pages, not the end.
+ */
+ AdvanceXLInsertBuffer(newupto - XLOG_BLCKSZ, false);
+}
+
/*
* Preallocate log files beyond the specified log endpoint.
*
@@ -4124,8 +4440,11 @@ RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr)
* Before deleting the file, see if it can be recycled as a future log
* segment. Only recycle normal files, pg_standby for example can create
* symbolic links pointing to a separate archive directory.
+ *
+ * If NVWAL buffer is used, a log segment file is never to be recycled
+ * (that is, always go into else block).
*/
- if (wal_recycle &&
+ if (!NvwalAvail && wal_recycle &&
endlogSegNo <= recycleSegNo &&
lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
InstallXLogFileSegment(&endlogSegNo, path,
@@ -4533,6 +4852,7 @@ InitControlFile(uint64 sysidentifier)
memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, MOCK_AUTH_NONCE_LEN);
ControlFile->state = DB_SHUTDOWNED;
ControlFile->unloggedLSN = FirstNormalUnloggedLSN;
+ ControlFile->discardedUpTo = (NvwalAvail) ? wal_segment_size : InvalidXLogRecPtr;
/* Set important parameter values for use when replaying WAL */
ControlFile->MaxConnections = MaxConnections;
@@ -5365,41 +5685,58 @@ BootStrapXLOG(void)
record->xl_crc = crc;
/* Create first XLOG segment file */
- use_existent = false;
- openLogFile = XLogFileInit(1, &use_existent, false);
+ if (NvwalAvail)
+ {
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+ nv_memcpy_nodrain(XLogCtl->pages + wal_segment_size, page, XLOG_BLCKSZ);
+ pgstat_report_wait_end();
- /*
- * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
- * close the file again in a moment.
- */
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+ nv_drain();
+ pgstat_report_wait_end();
- /* Write the first page with the initial record */
- errno = 0;
- pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
- if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
- {
- /* if write didn't set errno, assume problem is no disk space */
- if (errno == 0)
- errno = ENOSPC;
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not write bootstrap write-ahead log file: %m")));
+ /*
+ * Other WAL stuffs will be initialized in startup process.
+ */
}
- pgstat_report_wait_end();
+ else
+ {
+ use_existent = false;
+ openLogFile = XLogFileInit(1, &use_existent, false);
- pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
- if (pg_fsync(openLogFile) != 0)
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not fsync bootstrap write-ahead log file: %m")));
- pgstat_report_wait_end();
+ /*
+ * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
+ * close the file again in a moment.
+ */
- if (close(openLogFile) != 0)
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not close bootstrap write-ahead log file: %m")));
+ /* Write the first page with the initial record */
+ errno = 0;
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+ if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write bootstrap write-ahead log file: %m")));
+ }
+ pgstat_report_wait_end();
- openLogFile = -1;
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+ if (pg_fsync(openLogFile) != 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not fsync bootstrap write-ahead log file: %m")));
+ pgstat_report_wait_end();
+
+ if (close(openLogFile) != 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not close bootstrap write-ahead log file: %m")));
+
+ openLogFile = -1;
+ }
/* Now create pg_control */
InitControlFile(sysidentifier);
@@ -5653,41 +5990,47 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
* happens in the middle of a segment, copy data from the last WAL segment
* of the old timeline up to the switch point, to the starting WAL segment
* on the new timeline.
+ *
+ * If non-volatile WAL buffer is used, no new segment file is created. Data
+ * up to the switch point will be copied into NVWAL buffer by StartupXLOG().
*/
- if (endLogSegNo == startLogSegNo)
+ if (!NvwalAvail)
{
- /*
- * Make a copy of the file on the new timeline.
- *
- * Writing WAL isn't allowed yet, so there are no locking
- * considerations. But we should be just as tense as XLogFileInit to
- * avoid emplacing a bogus file.
- */
- XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
- XLogSegmentOffset(endOfLog, wal_segment_size));
- }
- else
- {
- /*
- * The switch happened at a segment boundary, so just create the next
- * segment on the new timeline.
- */
- bool use_existent = true;
- int fd;
+ if (endLogSegNo == startLogSegNo)
+ {
+ /*
+ * Make a copy of the file on the new timeline.
+ *
+ * Writing WAL isn't allowed yet, so there are no locking
+ * considerations. But we should be just as tense as XLogFileInit to
+ * avoid emplacing a bogus file.
+ */
+ XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
+ XLogSegmentOffset(endOfLog, wal_segment_size));
+ }
+ else
+ {
+ /*
+ * The switch happened at a segment boundary, so just create the next
+ * segment on the new timeline.
+ */
+ bool use_existent = true;
+ int fd;
- fd = XLogFileInit(startLogSegNo, &use_existent, true);
+ fd = XLogFileInit(startLogSegNo, &use_existent, true);
- if (close(fd) != 0)
- {
- char xlogfname[MAXFNAMELEN];
- int save_errno = errno;
+ if (close(fd) != 0)
+ {
+ char xlogfname[MAXFNAMELEN];
+ int save_errno = errno;
- XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
- wal_segment_size);
- errno = save_errno;
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not close file \"%s\": %m", xlogfname)));
+ XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
+ wal_segment_size);
+ errno = save_errno;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m", xlogfname)));
+ }
}
}
@@ -6919,6 +7262,11 @@ StartupXLOG(void)
InRecovery = true;
}
+ /* Dump discardedUpTo just before REDO */
+ elog(LOG, "ControlFile->discardedUpTo %X/%X",
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo);
+
/* REDO */
if (InRecovery)
{
@@ -7691,10 +8039,88 @@ StartupXLOG(void)
Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+ if (NvwalAvail)
+ {
+ XLogRecPtr discardedUpTo;
+
+ discardedUpTo = ControlFile->discardedUpTo;
+ Assert(discardedUpTo == InvalidXLogRecPtr ||
+ discardedUpTo % wal_segment_size == 0);
+
+ if (discardedUpTo == InvalidXLogRecPtr)
+ {
+ elog(DEBUG1, "brand-new NVWAL");
+
+ /* The following "Tricky point" is to initialize the buffer */
+ }
+ else if (EndOfLog <= discardedUpTo)
+ {
+ elog(DEBUG1, "no record on NVWAL has been UNDONE");
+
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = InvalidXLogRecPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+
+ nv_memset_persist(XLogCtl->pages, 0, NvwalSize);
+
+ /* The following "Tricky point" is to initialize the buffer */
+ }
+ else
+ {
+ int last_idx;
+ int idx;
+ XLogRecPtr ptr;
+
+ elog(DEBUG1, "some records on NVWAL have been UNDONE; keep them");
+
+ /*
+ * Initialize xlblock array because we decided to keep UNDONE
+ * records on NVWAL buffer; or each page on the buffer that meets
+ * xlblocks == 0 (initialized as so by XLOGShmemInit) is to be
+ * accidentally cleared by the following AdvanceXLInsertBuffer!
+ *
+ * Two cases can be considered:
+ *
+ * 1) EndOfLog is on a page boundary (divisible by XLOG_BLCKSZ):
+ * Initialize up to (and including) the page containing the last
+ * record. That page should end with EndOfLog. The one more
+ * next page "N" beginning with EndOfLog is to be untouched
+ * because, in such a very corner case that all the NVWAL
+ * buffer pages are already filled, page N is on the same
+ * location as the first page "F" beginning with discardedUpTo.
+ * Of cource we should not overwrite the page F.
+ *
+ * In this case, we first get XLogRecPtrToBufIdx(EndOfLog) as
+ * last_idx, indicating the page N. Then, we go forward from
+ * the page F up to (but excluding) page N that have the same
+ * index as the page F.
+ *
+ * 2) EndOfLog is not on a page boundary: Initialize all the pages
+ * but the page "L" having the last record. The page L is to be
+ * initialized by the following "Tricky point", including its
+ * content.
+ *
+ * In either case, XLogCtl->InitializedUpTo is to be initialized in
+ * the following "Tricky" if-else block.
+ */
+
+ last_idx = XLogRecPtrToBufIdx(EndOfLog);
+
+ ptr = discardedUpTo;
+ for (idx = XLogRecPtrToBufIdx(ptr); idx != last_idx;
+ idx = NextBufIdx(idx))
+ {
+ ptr += XLOG_BLCKSZ;
+ XLogCtl->xlblocks[idx] = ptr;
+ }
+ }
+ }
+
/*
- * Tricky point here: readBuf contains the *last* block that the LastRec
- * record spans, not the one it starts in. The last block is indeed the
- * one we want to use.
+ * Tricky point here: readBuf contains the *last* block that the
+ * LastRec record spans, not the one it starts in. The last block is
+ * indeed the one we want to use.
*/
if (EndOfLog % XLOG_BLCKSZ != 0)
{
@@ -7714,6 +8140,9 @@ StartupXLOG(void)
memcpy(page, xlogreader->readBuf, len);
memset(page + len, 0, XLOG_BLCKSZ - len);
+ if (NvwalAvail)
+ nv_persist(page, XLOG_BLCKSZ);
+
XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
}
@@ -7727,12 +8156,54 @@ StartupXLOG(void)
XLogCtl->InitializedUpTo = EndOfLog;
}
- LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+ if (NvwalAvail)
+ {
+ XLogRecPtr SegBeginPtr;
- XLogCtl->LogwrtResult = LogwrtResult;
+ /*
+ * If NVWAL buffer is used, writing records out to segment files should
+ * be done in segment by segment. So Logwrt{Rqst,Result} (and also
+ * discardedUpTo) should be multiple of wal_segment_size. Let's get
+ * them back off to the last segment boundary.
+ */
- XLogCtl->LogwrtRqst.Write = EndOfLog;
- XLogCtl->LogwrtRqst.Flush = EndOfLog;
+ SegBeginPtr = EndOfLog - (EndOfLog % wal_segment_size);
+ LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+ XLogCtl->LogwrtResult = LogwrtResult;
+ XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+ XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+
+ /*
+ * persistentUpTo does not need to be multiple of wal_segment_size,
+ * and should be drained-up-to LSN. walsender will use it to load
+ * records from NVWAL buffer.
+ */
+ XLogCtl->persistentUpTo = EndOfLog;
+
+ /* Update discardedUpTo in pg_control if still invalid */
+ if (ControlFile->discardedUpTo == InvalidXLogRecPtr)
+ {
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = SegBeginPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+ }
+
+ elog(DEBUG1, "EndOfLog: %X/%X",
+ (uint32) (EndOfLog >> 32), (uint32) EndOfLog);
+
+ elog(DEBUG1, "SegBeginPtr: %X/%X",
+ (uint32) (SegBeginPtr >> 32), (uint32) SegBeginPtr);
+ }
+ else
+ {
+ LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+
+ XLogCtl->LogwrtResult = LogwrtResult;
+
+ XLogCtl->LogwrtRqst.Write = EndOfLog;
+ XLogCtl->LogwrtRqst.Flush = EndOfLog;
+ }
/*
* Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7863,6 +8334,7 @@ StartupXLOG(void)
char origpath[MAXPGPATH];
char partialfname[MAXFNAMELEN];
char partialpath[MAXPGPATH];
+ XLogRecPtr discardedUpTo;
XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
@@ -7874,6 +8346,53 @@ StartupXLOG(void)
*/
XLogArchiveCleanup(partialfname);
+ /*
+ * If NVWAL is also used for archival recovery, write old
+ * records out to segment files to archive them. Note that we
+ * need locks related to WAL because LocalXLogInsertAllowed
+ * already got to -1.
+ */
+ discardedUpTo = ControlFile->discardedUpTo;
+ if (NvwalAvail && discardedUpTo != InvalidXLogRecPtr &&
+ discardedUpTo < EndOfLog)
+ {
+ XLogwrtRqst WriteRqst;
+ TimeLineID thisTLI = ThisTimeLineID;
+ XLogRecPtr SegBeginPtr =
+ EndOfLog - (EndOfLog % wal_segment_size);
+
+ /*
+ * XXX Assume that all the records have the same TLI.
+ */
+ ThisTimeLineID = EndOfLogTLI;
+
+ WriteRqst.Write = EndOfLog;
+ WriteRqst.Flush = 0;
+
+ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ XLogWrite(WriteRqst, false);
+
+ /*
+ * Force back-off to the last segment boundary.
+ */
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = SegBeginPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+
+ LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ XLogCtl->LogwrtResult = LogwrtResult;
+ XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+ XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ LWLockRelease(WALWriteLock);
+
+ ThisTimeLineID = thisTLI;
+ }
+
durable_rename(origpath, partialpath, ERROR);
XLogArchiveNotify(partialfname);
}
@@ -7883,7 +8402,10 @@ StartupXLOG(void)
/*
* Preallocate additional log files, if wanted.
*/
- PreallocXlogFiles(EndOfLog);
+ if (NvwalAvail)
+ PreallocNonVolatileXlogBuffer();
+ else
+ PreallocXlogFiles(EndOfLog);
/*
* Okay, we're officially UP.
@@ -8428,10 +8950,24 @@ GetInsertRecPtr(void)
/*
* GetFlushRecPtr -- Returns the current flush position, ie, the last WAL
* position known to be fsync'd to disk.
+ *
+ * If NVWAL is used, this returns the last persistent WAL position instead.
*/
XLogRecPtr
GetFlushRecPtr(void)
{
+ if (NvwalAvail)
+ {
+ XLogRecPtr ret;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ LogwrtResult = XLogCtl->LogwrtResult;
+ ret = XLogCtl->persistentUpTo;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ return ret;
+ }
+
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);
@@ -8731,6 +9267,9 @@ CreateCheckPoint(int flags)
VirtualTransactionId *vxids;
int nvxids;
+ /* for non-volatile WAL buffer */
+ XLogRecPtr newDiscardedUpTo = 0;
+
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
* issued at a different time.
@@ -9042,6 +9581,22 @@ CreateCheckPoint(int flags)
*/
PriorRedoPtr = ControlFile->checkPointCopy.redo;
+ /*
+ * If non-volatile WAL buffer is used, discardedUpTo should be updated and
+ * persist on the control file. So the new value should be caluculated
+ * here.
+ *
+ * TODO Do not copy and paste codes...
+ */
+ if (NvwalAvail)
+ {
+ XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+ KeepLogSeg(recptr, &_logSegNo);
+ _logSegNo--;
+
+ newDiscardedUpTo = _logSegNo * wal_segment_size;
+ }
+
/*
* Update the control file.
*/
@@ -9050,6 +9605,16 @@ CreateCheckPoint(int flags)
ControlFile->state = DB_SHUTDOWNED;
ControlFile->checkPoint = ProcLastRecPtr;
ControlFile->checkPointCopy = checkPoint;
+ if (NvwalAvail)
+ {
+ /*
+ * A new value should not fall behind the old one.
+ */
+ if (ControlFile->discardedUpTo < newDiscardedUpTo)
+ ControlFile->discardedUpTo = newDiscardedUpTo;
+ else
+ newDiscardedUpTo = ControlFile->discardedUpTo;
+ }
ControlFile->time = (pg_time_t) time(NULL);
/* crash recovery should always recover to the end of WAL */
ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
@@ -9067,6 +9632,44 @@ CreateCheckPoint(int flags)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * If we use non-volatile XLOG buffer, update XLogCtl->Logwrt{Rqst,Result}
+ * so that the XLOG records older than newDiscardedUpTo are treated as
+ * "already written and flushed."
+ */
+ if (NvwalAvail)
+ {
+ Assert(newDiscardedUpTo > 0);
+
+ /* Update process-local variables */
+ LogwrtResult.Write = LogwrtResult.Flush = newDiscardedUpTo;
+
+ /*
+ * Update shared-memory variables. We need both light-weight lock and
+ * spin lock to update them.
+ */
+ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&XLogCtl->info_lck);
+
+ /*
+ * Note that there can be a corner case that process-local
+ * LogwrtResult falls behind shared XLogCtl->LogwrtResult if whole the
+ * non-volatile XLOG buffer is filled and some pages are written out
+ * to segment files between UpdateControlFile and LWLockAcquire above.
+ *
+ * TODO For now, we ignore that case because it can hardly occur.
+ */
+ XLogCtl->LogwrtResult = LogwrtResult;
+
+ if (XLogCtl->LogwrtRqst.Write < newDiscardedUpTo)
+ XLogCtl->LogwrtRqst.Write = newDiscardedUpTo;
+ if (XLogCtl->LogwrtRqst.Flush < newDiscardedUpTo)
+ XLogCtl->LogwrtRqst.Flush = newDiscardedUpTo;
+
+ SpinLockRelease(&XLogCtl->info_lck);
+ LWLockRelease(WALWriteLock);
+ }
+
/* Update shared-memory copy of checkpoint XID/epoch */
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->ckptFullXid = checkPoint.nextFullXid;
@@ -9090,21 +9693,31 @@ CreateCheckPoint(int flags)
if (PriorRedoPtr != InvalidXLogRecPtr)
UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
- /*
- * Delete old log files, those no longer needed for last checkpoint to
- * prevent the disk holding the xlog from growing full.
- */
- XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
- KeepLogSeg(recptr, &_logSegNo);
- _logSegNo--;
- RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ if (NvwalAvail)
+ RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ else
+ {
+ /*
+ * Delete old log files, those no longer needed for last checkpoint to
+ * prevent the disk holding the xlog from growing full.
+ */
+ XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+ KeepLogSeg(recptr, &_logSegNo);
+ _logSegNo--;
+ RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ }
/*
* Make more log segments if needed. (Do this after recycling old log
* segments, since that may supply some of the needed files.)
*/
if (!shutdown)
- PreallocXlogFiles(recptr);
+ {
+ if (NvwalAvail)
+ PreallocNonVolatileXlogBuffer();
+ else
+ PreallocXlogFiles(recptr);
+ }
/*
* Truncate pg_subtrans if possible. We can throw away all data before
@@ -11751,6 +12364,116 @@ CancelBackup(void)
}
}
+/*
+ * Is NVWAL used?
+ */
+bool
+IsNvwalAvail(void)
+{
+ return NvwalAvail;
+}
+
+/*
+ * Returns the size we can load from NVWAL and sets nvwalptr to load-from LSN.
+ */
+Size
+GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
+{
+ XLogRecPtr readUpTo;
+ XLogRecPtr discardedUpTo;
+
+ Assert(IsNvwalAvail());
+ Assert(nvwalptr != NULL);
+
+ readUpTo = target + count;
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ discardedUpTo = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ /* Check if all the records are on WAL segment files */
+ if (readUpTo <= discardedUpTo)
+ return 0;
+
+ /* Check if all the records are on NVWAL */
+ if (discardedUpTo <= target)
+ {
+ *nvwalptr = target;
+ return count;
+ }
+
+ /* Some on WAL segment files, some on NVWAL */
+ *nvwalptr = discardedUpTo;
+ return (Size) (readUpTo - discardedUpTo);
+}
+
+/*
+ * It is like WALRead @ xlogreader.c, but loads from non-volatile WAL
+ * buffer.
+ */
+bool
+CopyXLogRecordsFromNVWAL(char *buf, XLogRecPtr startptr, Size count)
+{
+ char *p;
+ XLogRecPtr recptr;
+ Size nbytes;
+
+ Assert(NvwalAvail);
+
+ p = buf;
+ recptr = startptr;
+ nbytes = count;
+
+ /*
+ * Hold shared WALBufMappingLock to let others not rotate WAL buffer
+ * while reading WAL records from it. We do not need exclusive lock
+ * because we will not rotate the buffer in this function.
+ */
+ LWLockAcquire(WALBufMappingLock, LW_SHARED);
+
+ while (nbytes > 0)
+ {
+ char *src;
+ Size off;
+ Size max_read;
+ Size readbytes;
+ XLogRecPtr discardedUpTo;
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ discardedUpTo = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ /* Check if the records we need have been already evicted or not */
+ if (recptr < discardedUpTo)
+ {
+ LWLockRelease(WALBufMappingLock);
+
+ /* TODO error handling? */
+ return false;
+ }
+
+ /*
+ * Get the target address on no-volatile WAL buffer and the size we
+ * can load from it at once because the buffer can rotate and we
+ * might have to load what we want devided into two or more.
+ */
+ off = recptr % NvwalSize;
+ src = XLogCtl->pages + off;
+ max_read = NvwalSize - off;
+ readbytes = (nbytes < max_read) ? nbytes : max_read;
+
+ memcpy(p, src, readbytes);
+
+ /* Update state for load */
+ recptr += readbytes;
+ nbytes -= readbytes;
+ p += readbytes;
+ }
+
+ LWLockRelease(WALBufMappingLock);
+ return true;
+}
+
/*
* Read the XLOG page containing RecPtr into readBuf (if not read already).
* Returns number of bytes read, if the page is read successfully, or -1
@@ -11818,7 +12541,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
retry:
/* See if we need to retrieve more data */
- if (readFile < 0 ||
+ if ((readSource != XLOG_FROM_NVWAL && readFile < 0) ||
(readSource == XLOG_FROM_STREAM &&
receivedUpto < targetPagePtr + reqLen))
{
@@ -11830,10 +12553,68 @@ retry:
if (readFile >= 0)
close(readFile);
readFile = -1;
- readLen = 0;
- readSource = 0;
- return -1;
+ /*
+ * Try non-volatile WAL buffer as last resort.
+ *
+ * XXX It is not supported yet on stanby mode.
+ */
+ if (NvwalAvail && !StandbyMode && readSource != XLOG_FROM_STREAM)
+ {
+ XLogRecPtr discardedUpTo;
+
+ elog(DEBUG1, "see if NVWAL has records to be UNDONE");
+
+ discardedUpTo = ControlFile->discardedUpTo;
+ if (discardedUpTo != InvalidXLogRecPtr &&
+ discardedUpTo <= targetPagePtr)
+ {
+ elog(DEBUG1, "recovering NVWAL");
+
+ /* Loading records from non-volatile WAL buffer */
+ currentSource = XLOG_FROM_NVWAL;
+ lastSourceFailed = false;
+
+ /* Report recovery progress in PS display */
+ set_ps_display("recovering NVWAL", false);
+
+ /* Track source of data */
+ readSource = XLOG_FROM_NVWAL;
+ XLogReceiptSource = XLOG_FROM_NVWAL;
+
+ /* Track receipt time */
+ XLogReceiptTime = GetCurrentTimestamp();
+
+ /*
+ * Construct expectedTLEs. This is necessary to recover
+ * only from NVWAL because its filename does not have any
+ * TLI information.
+ */
+ if (!expectedTLEs)
+ {
+ TimeLineHistoryEntry *entry;
+
+ entry = (TimeLineHistoryEntry *) palloc(sizeof(TimeLineHistoryEntry));
+ entry->tli = recoveryTargetTLI;
+ entry->begin = entry->end = InvalidXLogRecPtr;
+
+ expectedTLEs = list_make1(entry);
+
+ elog(DEBUG1, "expectedTLEs: [%u]", (uint32) recoveryTargetTLI);
+ }
+ }
+ }
+ else
+ elog(DEBUG1, "do not recover NVWAL");
+
+ /* See if the try above succeeded or not */
+ if (readSource != XLOG_FROM_NVWAL)
+ {
+ readLen = 0;
+ readSource = 0;
+
+ return -1;
+ }
}
}
@@ -11841,7 +12622,7 @@ retry:
* At this point, we have the right segment open and if we're streaming we
* know the requested record is in it.
*/
- Assert(readFile != -1);
+ Assert(readFile != -1 || readSource == XLOG_FROM_NVWAL);
/*
* If the current segment is being streamed from master, calculate how
@@ -11860,41 +12641,60 @@ retry:
else
readLen = XLOG_BLCKSZ;
- /* Read the requested page */
readOff = targetPageOff;
- pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
- r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
- if (r != XLOG_BLCKSZ)
+ if (currentSource == XLOG_FROM_NVWAL)
{
- char fname[MAXFNAMELEN];
- int save_errno = errno;
+ Size offset = (Size) (targetPagePtr % NvwalSize);
+ char *readpos = XLogCtl->pages + offset;
+ Assert(readLen == XLOG_BLCKSZ);
+ Assert(offset % XLOG_BLCKSZ == 0);
+
+ /* Load the requested page from non-volatile WAL buffer */
+ pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+ memcpy(readBuf, readpos, readLen);
pgstat_report_wait_end();
- XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
- if (r < 0)
+
+ /* There are not any other clues of TLI... */
+ xlogreader->seg.ws_tli = ((XLogPageHeader) readBuf)->xlp_tli;
+ }
+ else
+ {
+ /* Read the requested page from file */
+ pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+ r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+ if (r != XLOG_BLCKSZ)
{
- errno = save_errno;
- ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
- (errcode_for_file_access(),
- errmsg("could not read from log segment %s, offset %u: %m",
- fname, readOff)));
+ char fname[MAXFNAMELEN];
+ int save_errno = errno;
+
+ pgstat_report_wait_end();
+ XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+ if (r < 0)
+ {
+ errno = save_errno;
+ ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+ (errcode_for_file_access(),
+ errmsg("could not read from log segment %s, offset %u: %m",
+ fname, readOff)));
+ }
+ else
+ ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+ fname, readOff, r, (Size) XLOG_BLCKSZ)));
+ goto next_record_is_invalid;
}
- else
- ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("could not read from log segment %s, offset %u: read %d of %zu",
- fname, readOff, r, (Size) XLOG_BLCKSZ)));
- goto next_record_is_invalid;
+ pgstat_report_wait_end();
+
+ xlogreader->seg.ws_tli = curFileTLI;
}
- pgstat_report_wait_end();
Assert(targetSegNo == readSegNo);
Assert(targetPageOff == readOff);
Assert(reqLen <= readLen);
- xlogreader->seg.ws_tli = curFileTLI;
-
/*
* Check the page header immediately, so that we can retry immediately if
* it's not valid. This may seem unnecessary, because XLogReadRecord()
@@ -11928,6 +12728,17 @@ retry:
goto next_record_is_invalid;
}
+ /*
+ * Updating curFileTLI on each page verified if non-volatile WAL buffer
+ * is used because there is no TimeLineID information in NVWAL's filename.
+ */
+ if (readSource == XLOG_FROM_NVWAL &&
+ curFileTLI != xlogreader->latestPageTLI)
+ {
+ curFileTLI = xlogreader->latestPageTLI;
+ elog(DEBUG1, "curFileTLI: %u", curFileTLI);
+ }
+
return readLen;
next_record_is_invalid:
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 32f02256ed..c40a4f1400 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1058,11 +1058,24 @@ WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
char *p;
XLogRecPtr recptr;
Size nbytes;
+#ifndef FRONTEND
+ XLogRecPtr recptr_nvwal = 0;
+ Size nbytes_nvwal = 0;
+#endif
p = buf;
recptr = startptr;
nbytes = count;
+#ifndef FRONTEND
+ /* Try to load records directly from NVWAL if used */
+ if (IsNvwalAvail())
+ {
+ nbytes_nvwal = GetLoadableSizeFromNvwal(startptr, count, &recptr_nvwal);
+ nbytes = count - nbytes_nvwal;
+ }
+#endif
+
while (nbytes > 0)
{
uint32 startoff;
@@ -1127,6 +1140,17 @@ WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
p += readbytes;
}
+#ifndef FRONTEND
+ if (IsNvwalAvail())
+ {
+ if (!CopyXLogRecordsFromNVWAL(p, recptr_nvwal, nbytes_nvwal))
+ {
+ /* TODO graceful error handling */
+ elog(PANIC, "some records on NVWAL had been discarded");
+ }
+ }
+#endif
+
return true;
}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..4c594e915f 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -272,6 +272,9 @@ main(int argc, char *argv[])
ControlFile->checkPointCopy.oldestCommitTsXid);
printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
ControlFile->checkPointCopy.newestCommitTsXid);
+ printf(_("discarded Up To: %X/%X\n"),
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo);
printf(_("Time of latest checkpoint: %s\n"),
ckpttime_str);
printf(_("Fake LSN counter for unlogged rels: %X/%X\n"),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 174423901a..ccf2671bd9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -324,6 +324,14 @@ extern void XLogRequestWalReceiverReply(void);
extern void assign_max_wal_size(int newval, void *extra);
extern void assign_checkpoint_completion_target(double newval, void *extra);
+extern bool IsNvwalAvail(void);
+extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
+ Size count,
+ XLogRecPtr *nvwalptr);
+extern bool CopyXLogRecordsFromNVWAL(char *buf,
+ XLogRecPtr startptr,
+ Size count);
+
/*
* Routines to start, stop, and get status of a base backup.
*/
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..fe71992a69 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
/* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION 1300
+#define PG_CONTROL_VERSION 1301
/* Nonce key length, see below */
#define MOCK_AUTH_NONCE_LEN 32
@@ -132,6 +132,21 @@ typedef struct ControlFileData
XLogRecPtr unloggedLSN; /* current fake LSN value, for unlogged rels */
+ /*
+ * Used for non-volatile WAL buffer (NVWAL).
+ *
+ * discardedUpTo is updated to the oldest LSN in the NVWAL when either a
+ * checkpoint or a restartpoint is completed successfully, or whole the
+ * NVWAL is filled with WAL records and a new record is being inserted.
+ * This field tells that the NVWAL contains WAL records in the range of
+ * [discardedUpTo, discardedUpTo+S), where S is the size of the NVWAL.
+ * Note that the WAL records whose LSN are less than discardedUpTo would
+ * remain in WAL segment files and be needed for recovery.
+ *
+ * It is set to zero when NVWAL is not used.
+ */
+ XLogRecPtr discardedUpTo;
+
/*
* These two values determine the minimum point we must recover up to
* before starting up:
--
2.17.1
v2-0003-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v2-0003-README-for-non-volatile-WAL-buffer.patchDownload
From 7a886ea7529b4d0e2273a13cd8d9209b652099c4 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:10:44 +0900
Subject: [PATCH v2 3/3] README for non-volatile WAL buffer
---
README.nvwal | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 184 insertions(+)
create mode 100644 README.nvwal
diff --git a/README.nvwal b/README.nvwal
new file mode 100644
index 0000000000..b6b9d576e7
--- /dev/null
+++ b/README.nvwal
@@ -0,0 +1,184 @@
+Non-volatile WAL buffer
+=======================
+Here is a PostgreSQL branch with a proof-of-concept "non-volatile WAL buffer"
+(NVWAL) feature. Putting the WAL buffer pages on persistent memory (PMEM) [1],
+inserting WAL records into it directly, and eliminating I/O for WAL segment
+files, PostgreSQL gets lower latency and higher throughput.
+
+
+Prerequisites and recommends
+----------------------------
+* An x64 system
+ * (Recommended) Supporting CLFLUSHOPT or CLWB instruction
+ * See if lscpu shows "clflushopt" or "clwb" flag
+* An OS supporting PMEM
+ * Linux: 4.15 or later (tested on 5.2)
+ * Windows: (Sorry but we have not tested on Windows yet.)
+* A filesystem supporting DAX (tested on ext4)
+* libpmem in PMDK [2] 1.4 or later (tested on 1.7)
+* ndctl [3] (tested on 61.2)
+* ipmctl [4] if you use Intel DCPMM
+* sudo privilege
+* All other prerequisites of original PostgreSQL
+* (Recommended) PMEM module(s) (NVDIMM-N or Intel DCPMM)
+ * You can emulate PMEM using DRAM [5] even if you have no PMEM module.
+* (Recommended) numactl
+
+
+Build and install PostgreSQL with NVWAL feature
+-----------------------------------------------
+We have a new configure option --with-nvwal.
+
+I believe it is good to install under your home directory with --prefix option.
+If you do so, please DO NOT forget "export PATH".
+
+ $ ./configure --with-nvwal --prefix="$HOME/postgres"
+ $ make
+ $ make install
+ $ export PATH="$HOME/postgres/bin:$PATH"
+
+NOTE: ./configure --with-nvwal will fail if libpmem is not found.
+
+
+Prepare DAX filesystem
+----------------------
+Here we use NVDIMM-N or emulated PMEM, make ext4 filesystem on namespace0.0
+(/dev/pmem0), and mount it onto /mnt/pmem0. Please DO NOT forget "-o dax" option
+on mount. For Intel DCPMM and ipmctl, please see [4].
+
+ $ ndctl list
+ [
+ {
+ "dev":"namespace1.0",
+ "mode":"raw",
+ "size":103079215104,
+ "sector_size":512,
+ "blockdev":"pmem1",
+ "numa_node":1
+ },
+ {
+ "dev":"namespace0.0",
+ "mode":"raw",
+ "size":103079215104,
+ "sector_size":512,
+ "blockdev":"pmem0",
+ "numa_node":0
+ }
+ ]
+
+ $ sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0
+ {
+ "dev":"namespace0.0",
+ "mode":"fsdax",
+ "map":"dev",
+ "size":"94.50 GiB (101.47 GB)",
+ "uuid":"e7da9d65-140b-4e1e-90ec-6548023a1b6e",
+ "sector_size":512,
+ "blockdev":"pmem0",
+ "numa_node":0
+ }
+
+ $ ls -l /dev/pmem0
+ brw-rw---- 1 root disk 259, 3 Jan 6 17:06 /dev/pmem0
+
+ $ sudo mkfs.ext4 -q -F /dev/pmem0
+ $ sudo mkdir -p /mnt/pmem0
+ $ sudo mount -o dax /dev/pmem0 /mnt/pmem0
+ $ mount -l | grep ^/dev/pmem0
+ /dev/pmem0 on /mnt/pmem0 type ext4 (rw,relatime,dax)
+
+
+Enable transparent huge page
+----------------------------
+Of course transparent huge page would not be suitable for database workload,
+but it improves performance of PMEM by reducing overhead of page walk.
+
+ $ ls -l /sys/kernel/mm/transparent_hugepage/enabled
+ -rw-r--r-- 1 root root 4096 Dec 3 10:38 /sys/kernel/mm/transparent_hugepage/enabled
+
+ $ echo always | sudo dd of=/sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null
+ $ cat /sys/kernel/mm/transparent_hugepage/enabled
+ [always] madvise never
+
+
+initdb
+------
+We have two new options:
+
+ -P, --nvwal-path=FILE path to file for non-volatile WAL buffer (NVWAL)
+ -Q, --nvwal-size=SIZE size of NVWAL, in megabytes
+
+If you want to create a new 80GB (81920MB) NVWAL file on /mnt/pmem0/pgsql/nvwal,
+please run initdb as follows:
+
+ $ sudo mkdir -p /mnt/pmem0/pgsql
+ $ sudo chown "$USER:$USER" /mnt/pmem0/pgsql
+ $ export PGDATA="$HOME/pgdata"
+ $ initdb -P /mnt/pmem0/pgsql/nvwal -Q 81920
+
+You will find there is no WAL segment file to be created in PGDATA/pg_wal
+directory. That is okay; your NVWAL file has the content of the first WAL
+segment file.
+
+NOTE:
+* initdb will fail if the given NVWAL size is not multiple of WAL segment
+ size. The segment size is given with initdb --wal-segsize, or is 16MB as
+ default.
+* postgres (executed by initdb) will fail in bootstrap if the directory in
+ which the NVWAL file is being created (/mnt/pmem0/pgsql for example
+ above) does not exist.
+* postgres (executed by initdb) will fail in bootstrap if an entry already
+ exists on the given path.
+* postgres (executed by initdb) will fail in bootstrap if the given path is
+ not on PMEM or you forget "-o dax" option on mount.
+* Resizing an NVWAL file is NOT supported yet. Please be careful to decide
+ how large your NVWAL file is to be.
+* "-Q 1024" (1024MB) will be assumed if -P is given but -Q is not.
+
+
+postgresql.conf
+---------------
+We have two new parameters nvwal_path and nvwal_size, corresponding to the two
+new options of initdb. If you run initdb as above, you will find postgresql.conf
+in your PGDATA directory like as follows:
+
+ max_wal_size = 80GB
+ min_wal_size = 80GB
+ nvwal_path = '/mnt/pmem0/pgsql/nvwal'
+ nvwal_size = 80GB
+
+NOTE:
+* postgres will fail in startup if no file exists on the given nvwal_path.
+* postgres will fail in startup if the given nvwal_size is not equal to the
+ actual NVWAL file size,
+* postgres will fail in startup if the given nvwal_path is not on PMEM or you
+ forget "-o dax" option on mount.
+* wal_buffers will be ignored if nvwal_path is given.
+* You SHOULD give both max_wal_size and min_wal_size the same value as
+ nvwal_size. postgres could possibly run even though the three values are
+ not same, however, we have not tested such a case yet.
+
+
+Startup
+-------
+Same as you know:
+
+ $ pg_ctl start
+
+or use numactl as follows to let postgres run on the specified NUMA node (typi-
+cally the one on which your NVWAL file is) if you need stable performance:
+
+ $ numactl --cpunodebind=0 --membind=0 -- pg_ctl start
+
+
+References
+----------
+[1] https://pmem.io/
+[2] https://pmem.io/pmdk/
+[3] https://docs.pmem.io/ndctl-user-guide/
+[4] https://docs.pmem.io/ipmctl-user-guide/
+[5] https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
+
+
+--
+Takashi Menjo <takashi.menjou.vg AT hco.ntt.co.jp>
--
2.17.1
Import Notes
Reply to msg id not found:
Dear Andres,
Thank you for your advice about MAP_POPULATE flag. I rebased my msync patchset onto master and added a commit to append that flag
when mmap. A new v2 patchset is attached to this mail. Note that this patchset is NOT non-volatile WAL buffer's one.
I also measured performance of the following three versions, varying -c/--client and -j/--jobs options of pgbench, for each scaling
factor s = 50 or 1000.
- Before patchset (say "before")
- After patchset except patch 0005 not to use MAP_POPULATE ("after (no populate)")
- After full patchset to use MAP_POPULATE ("after (populate)")
The results are presented in the following tables and the attached charts. Conditions, steps, and other details will be shown
later. Note that, unlike the measurement of non-volatile WAL buffer I sent recently [1]/messages/by-id/002701d5fd03$6e1d97a0$4a58c6e0$@hco.ntt.co.jp_1, I used an NVMe SSD for pg_wal to evaluate
this patchset with traditional mmap-ed files, that is, direct access (DAX) is not supported and there are page caches.
Results (s=50)
==============
Throughput [10^3 TPS]
( c, j) before after after
(no populate) (populate)
------- -------------------------------------
( 8, 8) 30.9 28.1 (- 9.2%) 28.3 (- 8.6%)
(18,18) 61.5 46.1 (-25.0%) 47.7 (-22.3%)
(36,18) 67.0 45.9 (-31.5%) 48.4 (-27.8%)
(54,18) 68.3 47.0 (-31.3%) 49.6 (-27.5%)
Average Latency [ms]
( c, j) before after after
(no populate) (populate)
------- --------------------------------------
( 8, 8) 0.259 0.285 (+10.0%) 0.283 (+ 9.3%)
(18,18) 0.293 0.391 (+33.4%) 0.377 (+28.7%)
(36,18) 0.537 0.784 (+46.0%) 0.744 (+38.5%)
(54,18) 0.790 1.149 (+45.4%) 1.090 (+38.0%)
Results (s=1000)
================
Throghput [10^3 TPS]
( c, j) before after after
(no populate) (populate)
------- ------------------------------------
( 8, 8) 32.0 29.6 (- 7.6%) 29.1 (- 9.0%)
(18,18) 66.1 49.2 (-25.6%) 50.4 (-23.7%)
(36,18) 76.4 51.0 (-33.3%) 53.4 (-30.1%)
(54,18) 80.1 54.3 (-32.2%) 57.2 (-28.6%)
Average latency [10^3 TPS]
( c, j) before after after
(no populate) (populate)
------- --------------------------------------
( 8, 8) 0.250 0.271 (+ 8.4%) 0.275 (+10.0%)
(18,18) 0.272 0.366 (+34.6%) 0.357 (+31.3%)
(36,18) 0.471 0.706 (+49.9%) 0.674 (+43.1%)
(54,18) 0.674 0.995 (+47.6%) 0.944 (+40.1%)
I'd say MAP_POPULATE made performance a little better in large #clients cases, comparing "populate" with "no populate". However,
comparing "after" with "before", I found both throughput and average latency degraded. VTune told me that "after (populate)" still
spent larger CPU time for memcpy-ing WAL records into mmap-ed segments than "before".
I also made a microbenchmark to see the behavior of mmap and msync. I found that:
- A major fault occured at mmap with MAP_POPULATE, instead at first access to the mmap-ed space.
- Some minor faults also occured at mmap with MAP_POPULATE, and no additional fault occured when I loaded from the mmap-ed space.
But once I stored to that space, a minor fault occured.
- When I stored to the page that had been msync-ed, a minor fault occurred.
So I think one of the remaining causes of performance degrade is minor faults when mmap-ed pages get dirtied. And it seems not to
be solved by MAP_POPULATE only, as far as I see.
Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use two NVMe SSDs; one for PGDATA, another for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- Use the attached postgresql.conf
Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown in the
tables above.
(1) Run initdb with proper -D and -X options
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes
pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname
I gave no -b option to use the built-in "TPC-B (sort-of)" query.
Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)
Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA x2
Best regards,
Takashi
[1]: /messages/by-id/002701d5fd03$6e1d97a0$4a58c6e0$@hco.ntt.co.jp_1
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
Show quoted text
-----Original Message-----
From: Andres Freund <andres@anarazel.de>
Sent: Thursday, February 20, 2020 2:04 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>;
pgsql-hackers@postgresql.org
Subject: Re: [PoC] Non-volatile WAL bufferHi,
On 2020-02-17 13:12:37 +0900, Takashi Menjo wrote:
I applied my patchset that mmap()-s WAL segments as WAL buffers to
refs/tags/REL_12_0, and measured and analyzed its performance with
pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL,
it was "obviously worse" than the original REL_12_0. VTune told me
that the CPU time of memcpy() called by CopyXLogRecordToWAL() got
larger than before.FWIW, this might largely be because of page faults. In contrast to before we wouldn't reuse the same pages
(because they've been munmap()/mmap()ed), so the first time they're touched, we'll incur page faults. Did you
try mmap()ing with MAP_POPULATE? It's probably also worthwhile to try to use MAP_HUGETLB.Still doubtful it's the right direction, but I'd rather have good numbers to back me up :)
Greetings,
Andres Freund
Attachments:
v2-0001-Preallocate-more-WAL-segments.patchapplication/octet-stream; name=v2-0001-Preallocate-more-WAL-segments.patchDownload
From 1afcff4eacdcb8c7d9c5547432d546d16ebef3a2 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:13:59 +0900
Subject: [PATCH v2 1/5] Preallocate more WAL segments
---
src/backend/access/transam/xlog.c | 27 ++++++++++-----------------
1 file changed, 10 insertions(+), 17 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4361568882..b0362dce44 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -894,7 +894,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
bool fetching_ckpt, XLogRecPtr tliRecPtr);
static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
static void XLogFileClose(void);
-static void PreallocXlogFiles(XLogRecPtr endptr);
+static void PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
static void RemoveTempXlogFiles(void);
static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
static void RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr);
@@ -3824,27 +3824,20 @@ XLogFileClose(void)
/*
* Preallocate log files beyond the specified log endpoint.
- *
- * XXX this is currently extremely conservative, since it forces only one
- * future log segment to exist, and even that only if we are 75% done with
- * the current one. This is only appropriate for very low-WAL-volume systems.
- * High-volume systems will be OK once they've built up a sufficient set of
- * recycled log segments, but the startup transient is likely to include
- * a lot of segment creations by foreground processes, which is not so good.
*/
static void
-PreallocXlogFiles(XLogRecPtr endptr)
+PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
{
XLogSegNo _logSegNo;
+ XLogSegNo endSegNo;
+ XLogSegNo recycleSegNo;
int lf;
bool use_existent;
- uint64 offset;
- XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
- offset = XLogSegmentOffset(endptr - 1, wal_segment_size);
- if (offset >= (uint32) (0.75 * wal_segment_size))
+ XLByteToPrevSeg(endptr, endSegNo, wal_segment_size);
+ recycleSegNo = XLOGfileslop(RedoRecPtr);
+ for (_logSegNo = endSegNo + 1; _logSegNo <= recycleSegNo; _logSegNo++)
{
- _logSegNo++;
use_existent = true;
lf = XLogFileInit(_logSegNo, &use_existent, true);
close(lf);
@@ -7748,7 +7741,7 @@ StartupXLOG(void)
/*
* Preallocate additional log files, if wanted.
*/
- PreallocXlogFiles(EndOfLog);
+ PreallocXlogFiles(RedoRecPtr, EndOfLog);
/*
* Okay, we're officially UP.
@@ -8962,7 +8955,7 @@ CreateCheckPoint(int flags)
* segments, since that may supply some of the needed files.)
*/
if (!shutdown)
- PreallocXlogFiles(recptr);
+ PreallocXlogFiles(RedoRecPtr, recptr);
/*
* Truncate pg_subtrans if possible. We can throw away all data before
@@ -9312,7 +9305,7 @@ CreateRestartPoint(int flags)
* Make more log segments if needed. (Do this after recycling old log
* segments, since that may supply some of the needed files.)
*/
- PreallocXlogFiles(endptr);
+ PreallocXlogFiles(RedoRecPtr, endptr);
/*
* ThisTimeLineID is normally not set when we're still in recovery.
--
2.17.1
v2-0002-Use-WAL-segments-as-WAL-buffers.patchapplication/octet-stream; name=v2-0002-Use-WAL-segments-as-WAL-buffers.patchDownload
From a228fe4588a65494b3ae2b3295461defbba55a71 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:00 +0900
Subject: [PATCH v2 2/5] Use WAL segments as WAL buffers
Please run ./configure with LIBS=-lpmem to build.
Note that we ignore wal_sync_method from here.
---
src/backend/access/transam/xlog.c | 967 +++++++++++-------------------
1 file changed, 366 insertions(+), 601 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b0362dce44..423eb839b5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -18,9 +18,11 @@
#include <math.h>
#include <time.h>
#include <fcntl.h>
+#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/time.h>
#include <unistd.h>
+#include <libpmem.h>
#include "access/clog.h"
#include "access/commit_ts.h"
@@ -613,24 +615,8 @@ typedef struct XLogCtlData
XLogwrtResult LogwrtResult;
/*
- * Latest initialized page in the cache (last byte position + 1).
- *
- * To change the identity of a buffer (and InitializedUpTo), you need to
- * hold WALBufMappingLock. To change the identity of a buffer that's
- * still dirty, the old page needs to be written out first, and for that
- * you need WALWriteLock, and you need to ensure that there are no
- * in-progress insertions to the page by calling
- * WaitXLogInsertionsToFinish().
+ * This value does not change after startup.
*/
- XLogRecPtr InitializedUpTo;
-
- /*
- * These values do not change after startup, although the pointed-to pages
- * and xlblocks values certainly do. xlblocks values are protected by
- * WALBufMappingLock.
- */
- char *pages; /* buffers for unwritten XLOG pages */
- XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
int XLogCacheBlck; /* highest allocated xlog buffer index */
/*
@@ -776,9 +762,26 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
* openLogSegNo identifies the segment. These variables are only used to
* write the XLOG, and so will normally refer to the active segment.
* Note: call Reserve/ReleaseExternalFD to track consumption of this FD.
+ *
+ * mappedPages is mmap(2)-ed address for an open log file segment.
+ * It is used as WAL buffer instead of XLogCtl->pages.
+ *
+ * pmemMapped is true if mappedPages is on PMEM.
*/
static int openLogFile = -1;
static XLogSegNo openLogSegNo = 0;
+static char *mappedPages = NULL;
+static bool pmemMapped = 0;
+
+/* 2MiB hugepage mask used by XLogFileMapHint */
+#define PG_HUGEPAGE_MASK ((((uintptr_t) 1) << 21) - 1)
+
+#ifndef MAP_SHARED_VALIDATE
+#define MAP_SHARED_VALIDATE 0x3
+#endif
+#ifndef MAP_SYNC
+#define MAP_SYNC 0x80000
+#endif
/*
* These variables are used similarly to the ones above, but for reading
@@ -879,12 +882,15 @@ static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
-static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
static bool XLogCheckpointNeeded(XLogSegNo new_segno);
static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
bool find_free, XLogSegNo max_segno,
bool use_lock);
+static void *XLogFileMapHint(void);
+static void *XLogFileMapUtil(void *hint, int fd, bool dax);
+static char *XLogFileMap(XLogSegNo segno, bool *is_pmem);
+static void XLogFileUnmap(char *pages, XLogSegNo segno);
static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
int source, bool notfoundOk);
static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
@@ -944,7 +950,6 @@ static void checkXLogConsistency(XLogReaderState *record);
static void WALInsertLockAcquire(void);
static void WALInsertLockAcquireExclusive(void);
static void WALInsertLockRelease(void);
-static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
/*
* Insert an XLOG record represented by an already-constructed chain of data
@@ -1579,27 +1584,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
*/
while (CurrPos < EndPos)
{
- /*
- * The minimal action to flush the page would be to call
- * WALInsertLockUpdateInsertingAt(CurrPos) followed by
- * AdvanceXLInsertBuffer(...). The page would be left initialized
- * mostly to zeros, except for the page header (always the short
- * variant, as this is never a segment's first page).
- *
- * The large vistas of zeros are good for compressibility, but the
- * headers interrupting them every XLOG_BLCKSZ (with values that
- * differ from page to page) are not. The effect varies with
- * compression tool, but bzip2 for instance compresses about an
- * order of magnitude worse if those headers are left in place.
- *
- * Rather than complicating AdvanceXLInsertBuffer itself (which is
- * called in heavily-loaded circumstances as well as this lightly-
- * loaded one) with variant behavior, we just use GetXLogBuffer
- * (which itself calls the two methods we need) to get the pointer
- * and zero most of the page. Then we just zero the page header.
- */
- currpos = GetXLogBuffer(CurrPos);
- MemSet(currpos, 0, SizeOfXLogShortPHD);
+ /* XXX We assume that XLogFileInit does what we did here */
CurrPos += XLOG_BLCKSZ;
}
@@ -1713,29 +1698,6 @@ WALInsertLockRelease(void)
}
}
-/*
- * Update our insertingAt value, to let others know that we've finished
- * inserting up to that point.
- */
-static void
-WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
-{
- if (holdingAllLocks)
- {
- /*
- * We use the last lock to mark our actual position, see comments in
- * WALInsertLockAcquireExclusive.
- */
- LWLockUpdateVar(&WALInsertLocks[NUM_XLOGINSERT_LOCKS - 1].l.lock,
- &WALInsertLocks[NUM_XLOGINSERT_LOCKS - 1].l.insertingAt,
- insertingAt);
- }
- else
- LWLockUpdateVar(&WALInsertLocks[MyLockNo].l.lock,
- &WALInsertLocks[MyLockNo].l.insertingAt,
- insertingAt);
-}
-
/*
* Wait for any WAL insertions < upto to finish.
*
@@ -1836,123 +1798,37 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
/*
* Get a pointer to the right location in the WAL buffer containing the
* given XLogRecPtr.
- *
- * If the page is not initialized yet, it is initialized. That might require
- * evicting an old dirty buffer from the buffer cache, which means I/O.
- *
- * The caller must ensure that the page containing the requested location
- * isn't evicted yet, and won't be evicted. The way to ensure that is to
- * hold onto a WAL insertion lock with the insertingAt position set to
- * something <= ptr. GetXLogBuffer() will update insertingAt if it needs
- * to evict an old page from the buffer. (This means that once you call
- * GetXLogBuffer() with a given 'ptr', you must not access anything before
- * that point anymore, and must not call GetXLogBuffer() with an older 'ptr'
- * later, because older buffers might be recycled already)
*/
static char *
GetXLogBuffer(XLogRecPtr ptr)
{
- int idx;
- XLogRecPtr endptr;
- static uint64 cachedPage = 0;
- static char *cachedPos = NULL;
- XLogRecPtr expectedEndPtr;
+ int idx;
+ XLogPageHeader page;
+ XLogSegNo segno;
- /*
- * Fast path for the common case that we need to access again the same
- * page as last time.
- */
- if (ptr / XLOG_BLCKSZ == cachedPage)
+ /* shut-up compiler if not --enable-cassert */
+ (void) page;
+
+ XLByteToSeg(ptr, segno, wal_segment_size);
+ if (segno != openLogSegNo)
{
- Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
- Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
- return cachedPos + ptr % XLOG_BLCKSZ;
+ /* Unmap the current segment if mapped */
+ if (mappedPages != NULL)
+ XLogFileUnmap(mappedPages, openLogSegNo);
+
+ /* Map the segment we need */
+ mappedPages = XLogFileMap(segno, &pmemMapped);
+ Assert(mappedPages != NULL);
+ openLogSegNo = segno;
}
- /*
- * The XLog buffer cache is organized so that a page is always loaded to a
- * particular buffer. That way we can easily calculate the buffer a given
- * page must be loaded into, from the XLogRecPtr alone.
- */
idx = XLogRecPtrToBufIdx(ptr);
+ page = (XLogPageHeader) (mappedPages + idx * (Size) XLOG_BLCKSZ);
- /*
- * See what page is loaded in the buffer at the moment. It could be the
- * page we're looking for, or something older. It can't be anything newer
- * - that would imply the page we're looking for has already been written
- * out to disk and evicted, and the caller is responsible for making sure
- * that doesn't happen.
- *
- * However, we don't hold a lock while we read the value. If someone has
- * just initialized the page, it's possible that we get a "torn read" of
- * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
- * that case we will see a bogus value. That's ok, we'll grab the mapping
- * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
- * the page we're looking for. But it means that when we do this unlocked
- * read, we might see a value that appears to be ahead of the page we're
- * looking for. Don't PANIC on that, until we've verified the value while
- * holding the lock.
- */
- expectedEndPtr = ptr;
- expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+ Assert(page->xlp_magic == XLOG_PAGE_MAGIC);
+ Assert(page->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
- endptr = XLogCtl->xlblocks[idx];
- if (expectedEndPtr != endptr)
- {
- XLogRecPtr initializedUpto;
-
- /*
- * Before calling AdvanceXLInsertBuffer(), which can block, let others
- * know how far we're finished with inserting the record.
- *
- * NB: If 'ptr' points to just after the page header, advertise a
- * position at the beginning of the page rather than 'ptr' itself. If
- * there are no other insertions running, someone might try to flush
- * up to our advertised location. If we advertised a position after
- * the page header, someone might try to flush the page header, even
- * though page might actually not be initialized yet. As the first
- * inserter on the page, we are effectively responsible for making
- * sure that it's initialized, before we let insertingAt to move past
- * the page header.
- */
- if (ptr % XLOG_BLCKSZ == SizeOfXLogShortPHD &&
- XLogSegmentOffset(ptr, wal_segment_size) > XLOG_BLCKSZ)
- initializedUpto = ptr - SizeOfXLogShortPHD;
- else if (ptr % XLOG_BLCKSZ == SizeOfXLogLongPHD &&
- XLogSegmentOffset(ptr, wal_segment_size) < XLOG_BLCKSZ)
- initializedUpto = ptr - SizeOfXLogLongPHD;
- else
- initializedUpto = ptr;
-
- WALInsertLockUpdateInsertingAt(initializedUpto);
-
- AdvanceXLInsertBuffer(ptr, false);
- endptr = XLogCtl->xlblocks[idx];
-
- if (expectedEndPtr != endptr)
- elog(PANIC, "could not find WAL buffer for %X/%X",
- (uint32) (ptr >> 32), (uint32) ptr);
- }
- else
- {
- /*
- * Make sure the initialization of the page is visible to us, and
- * won't arrive later to overwrite the WAL data we write on the page.
- */
- pg_memory_barrier();
- }
-
- /*
- * Found the buffer holding this page. Return a pointer to the right
- * offset within the page.
- */
- cachedPage = ptr / XLOG_BLCKSZ;
- cachedPos = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
-
- Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
- Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
-
- return cachedPos + ptr % XLOG_BLCKSZ;
+ return mappedPages + ptr % wal_segment_size;
}
/*
@@ -2080,178 +1956,6 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
return result;
}
-/*
- * Initialize XLOG buffers, writing out old buffers if they still contain
- * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
- * true, initialize as many pages as we can without having to write out
- * unwritten data. Any new pages are initialized to zeros, with pages headers
- * initialized properly.
- */
-static void
-AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
-{
- XLogCtlInsert *Insert = &XLogCtl->Insert;
- int nextidx;
- XLogRecPtr OldPageRqstPtr;
- XLogwrtRqst WriteRqst;
- XLogRecPtr NewPageEndPtr = InvalidXLogRecPtr;
- XLogRecPtr NewPageBeginPtr;
- XLogPageHeader NewPage;
- int npages = 0;
-
- LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
-
- /*
- * Now that we have the lock, check if someone initialized the page
- * already.
- */
- while (upto >= XLogCtl->InitializedUpTo || opportunistic)
- {
- nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
-
- /*
- * Get ending-offset of the buffer page we need to replace (this may
- * be zero if the buffer hasn't been used yet). Fall through if it's
- * already written out.
- */
- OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
- if (LogwrtResult.Write < OldPageRqstPtr)
- {
- /*
- * Nope, got work to do. If we just want to pre-initialize as much
- * as we can without flushing, give up now.
- */
- if (opportunistic)
- break;
-
- /* Before waiting, get info_lck and update LogwrtResult */
- SpinLockAcquire(&XLogCtl->info_lck);
- if (XLogCtl->LogwrtRqst.Write < OldPageRqstPtr)
- XLogCtl->LogwrtRqst.Write = OldPageRqstPtr;
- LogwrtResult = XLogCtl->LogwrtResult;
- SpinLockRelease(&XLogCtl->info_lck);
-
- /*
- * Now that we have an up-to-date LogwrtResult value, see if we
- * still need to write it or if someone else already did.
- */
- if (LogwrtResult.Write < OldPageRqstPtr)
- {
- /*
- * Must acquire write lock. Release WALBufMappingLock first,
- * to make sure that all insertions that we need to wait for
- * can finish (up to this same position). Otherwise we risk
- * deadlock.
- */
- LWLockRelease(WALBufMappingLock);
-
- WaitXLogInsertionsToFinish(OldPageRqstPtr);
-
- LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
-
- LogwrtResult = XLogCtl->LogwrtResult;
- if (LogwrtResult.Write >= OldPageRqstPtr)
- {
- /* OK, someone wrote it already */
- LWLockRelease(WALWriteLock);
- }
- else
- {
- /* Have to write it ourselves */
- TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
- WriteRqst.Write = OldPageRqstPtr;
- WriteRqst.Flush = 0;
- XLogWrite(WriteRqst, false);
- LWLockRelease(WALWriteLock);
- TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
- }
- /* Re-acquire WALBufMappingLock and retry */
- LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
- continue;
- }
- }
-
- /*
- * Now the next buffer slot is free and we can set it up to be the
- * next output page.
- */
- NewPageBeginPtr = XLogCtl->InitializedUpTo;
- NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
-
- Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx);
-
- NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
-
- /*
- * Be sure to re-zero the buffer so that bytes beyond what we've
- * written will look like zeroes and not valid XLOG records...
- */
- MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
-
- /*
- * Fill the new page's header
- */
- NewPage->xlp_magic = XLOG_PAGE_MAGIC;
-
- /* NewPage->xlp_info = 0; */ /* done by memset */
- NewPage->xlp_tli = ThisTimeLineID;
- NewPage->xlp_pageaddr = NewPageBeginPtr;
-
- /* NewPage->xlp_rem_len = 0; */ /* done by memset */
-
- /*
- * If online backup is not in progress, mark the header to indicate
- * that WAL records beginning in this page have removable backup
- * blocks. This allows the WAL archiver to know whether it is safe to
- * compress archived WAL data by transforming full-block records into
- * the non-full-block format. It is sufficient to record this at the
- * page level because we force a page switch (in fact a segment
- * switch) when starting a backup, so the flag will be off before any
- * records can be written during the backup. At the end of a backup,
- * the last page will be marked as all unsafe when perhaps only part
- * is unsafe, but at worst the archiver would miss the opportunity to
- * compress a few records.
- */
- if (!Insert->forcePageWrites)
- NewPage->xlp_info |= XLP_BKP_REMOVABLE;
-
- /*
- * If first page of an XLOG segment file, make it a long header.
- */
- if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
- {
- XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
-
- NewLongPage->xlp_sysid = ControlFile->system_identifier;
- NewLongPage->xlp_seg_size = wal_segment_size;
- NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
- NewPage->xlp_info |= XLP_LONG_HEADER;
- }
-
- /*
- * Make sure the initialization of the page becomes visible to others
- * before the xlblocks update. GetXLogBuffer() reads xlblocks without
- * holding a lock.
- */
- pg_write_barrier();
-
- *((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
-
- XLogCtl->InitializedUpTo = NewPageEndPtr;
-
- npages++;
- }
- LWLockRelease(WALBufMappingLock);
-
-#ifdef WAL_DEBUG
- if (XLOG_DEBUG && npages > 0)
- {
- elog(DEBUG1, "initialized %d pages, up to %X/%X",
- npages, (uint32) (NewPageEndPtr >> 32), (uint32) NewPageEndPtr);
- }
-#endif
-}
-
/*
* Calculate CheckPointSegments based on max_wal_size_mb and
* checkpoint_completion_target.
@@ -2380,14 +2084,9 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
static void
XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
{
- bool ispartialpage;
- bool last_iteration;
bool finishing_seg;
- bool use_existent;
- int curridx;
- int npages;
- int startidx;
- uint32 startoffset;
+ XLogSegNo rqstLogSegNo;
+ XLogSegNo segno;
/* We should always be inside a critical section here */
Assert(CritSectionCount > 0);
@@ -2397,233 +2096,149 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
*/
LogwrtResult = XLogCtl->LogwrtResult;
- /*
- * Since successive pages in the xlog cache are consecutively allocated,
- * we can usually gather multiple pages together and issue just one
- * write() call. npages is the number of pages we have determined can be
- * written together; startidx is the cache block index of the first one,
- * and startoffset is the file offset at which it should go. The latter
- * two variables are only valid when npages > 0, but we must initialize
- * all of them to keep the compiler quiet.
- */
- npages = 0;
- startidx = 0;
- startoffset = 0;
+ /* Fast return if not requested to flush */
+ if (WriteRqst.Flush == 0)
+ return;
+ Assert(WriteRqst.Flush == WriteRqst.Write);
/*
- * Within the loop, curridx is the cache block index of the page to
- * consider writing. Begin at the buffer containing the next unwritten
- * page, or last partially written page.
+ * Call pmem_persist() or pmem_msync() for each segment file that contains
+ * records to be flushed.
*/
- curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);
-
- while (LogwrtResult.Write < WriteRqst.Write)
+ XLByteToPrevSeg(WriteRqst.Flush, rqstLogSegNo, wal_segment_size);
+ XLByteToSeg(LogwrtResult.Flush, segno, wal_segment_size);
+ while (segno <= rqstLogSegNo)
{
- /*
- * Make sure we're not ahead of the insert process. This could happen
- * if we're passed a bogus WriteRqst.Write that is past the end of the
- * last page that's been initialized by AdvanceXLInsertBuffer.
- */
- XLogRecPtr EndPtr = XLogCtl->xlblocks[curridx];
+ bool is_pmem;
+ char *addr;
+ char *p;
+ Size len;
+ XLogRecPtr BeginPtr;
+ XLogRecPtr EndPtr;
- if (LogwrtResult.Write >= EndPtr)
- elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
- (uint32) (LogwrtResult.Write >> 32),
- (uint32) LogwrtResult.Write,
- (uint32) (EndPtr >> 32), (uint32) EndPtr);
-
- /* Advance LogwrtResult.Write to end of current buffer page */
- LogwrtResult.Write = EndPtr;
- ispartialpage = WriteRqst.Write < LogwrtResult.Write;
-
- if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
- wal_segment_size))
+ /* Check if the segment is not mapped yet */
+ if (segno != openLogSegNo)
{
+ /* Map newly */
+ is_pmem = 0;
+ addr = XLogFileMap(segno, &is_pmem);
+
/*
- * Switch to new logfile segment. We cannot have any pending
- * pages here (since we dump what we have at segment end).
+ * Use the mapped above as WAL buffer of this process for the
+ * future. Note that it might be unmapped within this loop.
*/
- Assert(npages == 0);
- if (openLogFile >= 0)
- XLogFileClose();
- XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
- wal_segment_size);
-
- /* create/use new log file */
- use_existent = true;
- openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
- ReserveExternalFD();
+ if (openLogSegNo == 0)
+ {
+ pmemMapped = is_pmem;
+ mappedPages = addr;
+ openLogSegNo = segno;
+ }
}
-
- /* Make sure we have the current logfile open */
- if (openLogFile < 0)
+ else
{
- XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
- wal_segment_size);
- openLogFile = XLogFileOpen(openLogSegNo);
- ReserveExternalFD();
+ /* Or use existent mapping */
+ is_pmem = pmemMapped;
+ addr = mappedPages;
}
+ Assert(addr != NULL);
+ Assert(mappedPages != NULL);
+ Assert(openLogSegNo > 0);
- /* Add current page to the set of pending pages-to-dump */
- if (npages == 0)
- {
- /* first of group */
- startidx = curridx;
- startoffset = XLogSegmentOffset(LogwrtResult.Write - XLOG_BLCKSZ,
- wal_segment_size);
- }
- npages++;
+ /* Find beginning position to be flushed */
+ BeginPtr = segno * wal_segment_size;
+ if (BeginPtr < LogwrtResult.Flush)
+ BeginPtr = LogwrtResult.Flush;
+
+ /* Find ending position to be flushed */
+ EndPtr = (segno + 1) * wal_segment_size;
+ if (EndPtr > WriteRqst.Flush)
+ EndPtr = WriteRqst.Flush;
+
+ /* Convert LSN to memory address */
+ Assert(BeginPtr <= EndPtr);
+ p = addr + BeginPtr % wal_segment_size;
+ len = (Size) (EndPtr - BeginPtr);
/*
- * Dump the set if this will be the last loop iteration, or if we are
- * at the last page of the cache area (since the next page won't be
- * contiguous in memory), or if we are at the end of the logfile
- * segment.
+ * Do cache-flush or msync.
+ *
+ * Note that pmem_msync() does backoff to the page boundary.
*/
- last_iteration = WriteRqst.Write <= LogwrtResult.Write;
-
- finishing_seg = !ispartialpage &&
- (startoffset + npages * XLOG_BLCKSZ) >= wal_segment_size;
-
- if (last_iteration ||
- curridx == XLogCtl->XLogCacheBlck ||
- finishing_seg)
+ if (is_pmem)
{
- char *from;
- Size nbytes;
- Size nleft;
- int written;
-
- /* OK to write the page(s) */
- from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
- nbytes = npages * (Size) XLOG_BLCKSZ;
- nleft = nbytes;
- do
+ pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
+ pmem_persist(p, len);
+ pgstat_report_wait_end();
+ }
+ else
+ {
+ pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
+ if (pmem_msync(p, len))
{
- errno = 0;
- pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
- written = pg_pwrite(openLogFile, from, nleft, startoffset);
+ char xlogfname[MAXFNAMELEN];
+ int save_errno;
+
pgstat_report_wait_end();
- if (written <= 0)
- {
- char xlogfname[MAXFNAMELEN];
- int save_errno;
- if (errno == EINTR)
- continue;
+ save_errno = errno;
+ XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
+ wal_segment_size);
+ errno = save_errno;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not msync to log file %s "
+ "at address %p, length %zu: %m",
+ xlogfname, p, len)));
+ }
+ pgstat_report_wait_end();
+ }
+ LogwrtResult.Flush = LogwrtResult.Write = EndPtr;
- save_errno = errno;
- XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
- wal_segment_size);
- errno = save_errno;
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not write to log file %s "
- "at offset %u, length %zu: %m",
- xlogfname, startoffset, nleft)));
- }
- nleft -= written;
- from += written;
- startoffset += written;
- } while (nleft > 0);
+ /* Check if whole my WAL buffers are synchronized to the segment */
+ finishing_seg = (LogwrtResult.Flush % wal_segment_size == 0) &&
+ XLByteInPrevSeg(LogwrtResult.Flush, openLogSegNo,
+ wal_segment_size);
- npages = 0;
-
- /*
- * If we just wrote the whole last page of a logfile segment,
- * fsync the segment immediately. This avoids having to go back
- * and re-open prior segments when an fsync request comes along
- * later. Doing it here ensures that one and only one backend will
- * perform this fsync.
- *
- * This is also the right place to notify the Archiver that the
- * segment is ready to copy to archival storage, and to update the
- * timer for archive_timeout, and to signal for a checkpoint if
- * too many logfile segments have been used since the last
- * checkpoint.
- */
+ if (segno != openLogSegNo || finishing_seg)
+ {
+ XLogFileUnmap(addr, segno);
if (finishing_seg)
{
- issue_xlog_fsync(openLogFile, openLogSegNo);
-
- /* signal that we need to wakeup walsenders later */
- WalSndWakeupRequest();
-
- LogwrtResult.Flush = LogwrtResult.Write; /* end of page */
-
- if (XLogArchivingActive())
- XLogArchiveNotifySeg(openLogSegNo);
-
- XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
- XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
-
- /*
- * Request a checkpoint if we've consumed too much xlog since
- * the last one. For speed, we first check using the local
- * copy of RedoRecPtr, which might be out of date; if it looks
- * like a checkpoint is needed, forcibly update RedoRecPtr and
- * recheck.
- */
- if (IsUnderPostmaster && XLogCheckpointNeeded(openLogSegNo))
- {
- (void) GetRedoRecPtr();
- if (XLogCheckpointNeeded(openLogSegNo))
- RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
- }
+ Assert(segno == openLogSegNo);
+ mappedPages = NULL;
+ openLogSegNo = 0;
}
- }
- if (ispartialpage)
- {
- /* Only asked to write a partial page */
- LogwrtResult.Write = WriteRqst.Write;
- break;
- }
- curridx = NextBufIdx(curridx);
+ /* signal that we need to wakeup walsenders later */
+ WalSndWakeupRequest();
- /* If flexible, break out of loop as soon as we wrote something */
- if (flexible && npages == 0)
- break;
- }
+ if (XLogArchivingActive())
+ XLogArchiveNotifySeg(segno);
- Assert(npages == 0);
+ XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+ XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
- /*
- * If asked to flush, do so
- */
- if (LogwrtResult.Flush < WriteRqst.Flush &&
- LogwrtResult.Flush < LogwrtResult.Write)
-
- {
- /*
- * Could get here without iterating above loop, in which case we might
- * have no open file or the wrong one. However, we do not need to
- * fsync more than one file.
- */
- if (sync_method != SYNC_METHOD_OPEN &&
- sync_method != SYNC_METHOD_OPEN_DSYNC)
- {
- if (openLogFile >= 0 &&
- !XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
- wal_segment_size))
- XLogFileClose();
- if (openLogFile < 0)
+ /*
+ * Request a checkpoint if we've consumed too much xlog since
+ * the last one. For speed, we first check using the local
+ * copy of RedoRecPtr, which might be out of date; if it looks
+ * like a checkpoint is needed, forcibly update RedoRecPtr and
+ * recheck.
+ */
+ if (IsUnderPostmaster && XLogCheckpointNeeded(segno))
{
- XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
- wal_segment_size);
- openLogFile = XLogFileOpen(openLogSegNo);
- ReserveExternalFD();
+ (void) GetRedoRecPtr();
+ if (XLogCheckpointNeeded(segno))
+ RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
}
-
- issue_xlog_fsync(openLogFile, openLogSegNo);
}
- /* signal that we need to wakeup walsenders later */
- WalSndWakeupRequest();
-
- LogwrtResult.Flush = LogwrtResult.Write;
+ ++segno;
}
+ /* signal that we need to wakeup walsenders later */
+ WalSndWakeupRequest();
+
/*
* Update shared-memory status
*
@@ -3044,6 +2659,16 @@ XLogBackgroundFlush(void)
XLogFileClose();
}
}
+ else if (mappedPages != NULL)
+ {
+ if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
+ wal_segment_size))
+ {
+ XLogFileUnmap(mappedPages, openLogSegNo);
+ mappedPages = NULL;
+ openLogSegNo = 0;
+ }
+ }
return false;
}
@@ -3110,12 +2735,6 @@ XLogBackgroundFlush(void)
/* wake up walsenders now that we've released heavily contended locks */
WalSndWakeupProcessRequests();
- /*
- * Great, done. To take some work off the critical path, try to initialize
- * as many of the no-longer-needed WAL buffers for future use as we can.
- */
- AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
-
/*
* If we determined that we need to write data, but somebody else
* wrote/flushed already, it should be considered as being active, to
@@ -3269,9 +2888,26 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
memset(zbuffer.data, 0, XLOG_BLCKSZ);
pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
- save_errno = 0;
- if (wal_init_zero)
+
+ /*
+ * Allocate the file by posix_allocate(3) to utilize hugepage and reduce
+ * overhead of page fault. Note that posix_fallocate(3) do not set errno
+ * on error. Instead, it returns an error number directly.
+ */
+ save_errno = posix_fallocate(fd, 0, wal_segment_size);
+
+ if (save_errno)
{
+ /*
+ * Do nothing on error. Go to pgstat_report_wait_end().
+ */
+ }
+ else if (wal_init_zero)
+ {
+ XLogCtlInsert *Insert = &XLogCtl->Insert;
+ XLogPageHeader NewPage = (XLogPageHeader) zbuffer.data;
+ XLogRecPtr NewPageBeginPtr = logsegno * wal_segment_size;
+
/*
* Zero-fill the file. With this setting, we do this the hard way to
* ensure that all the file space has really been allocated. On
@@ -3283,6 +2919,48 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
*/
for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
{
+ memset(NewPage, 0, SizeOfXLogLongPHD);
+
+ /*
+ * Fill the new page's header
+ */
+ NewPage->xlp_magic = XLOG_PAGE_MAGIC;
+
+ /* NewPage->xlp_info = 0; */ /* done by memset */
+ NewPage->xlp_tli = ThisTimeLineID;
+ NewPage->xlp_pageaddr = NewPageBeginPtr;
+
+ /* NewPage->xlp_rem_len = 0; */ /* done by memset */
+
+ /*
+ * If online backup is not in progress, mark the header to indicate
+ * that WAL records beginning in this page have removable backup
+ * blocks. This allows the WAL archiver to know whether it is safe to
+ * compress archived WAL data by transforming full-block records into
+ * the non-full-block format. It is sufficient to record this at the
+ * page level because we force a page switch (in fact a segment
+ * switch) when starting a backup, so the flag will be off before any
+ * records can be written during the backup. At the end of a backup,
+ * the last page will be marked as all unsafe when perhaps only part
+ * is unsafe, but at worst the archiver would miss the opportunity to
+ * compress a few records.
+ */
+ if (!Insert->forcePageWrites)
+ NewPage->xlp_info |= XLP_BKP_REMOVABLE;
+
+ /*
+ * If first page of an XLOG segment file, make it a long header.
+ */
+ if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+ {
+ XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
+
+ NewLongPage->xlp_sysid = ControlFile->system_identifier;
+ NewLongPage->xlp_seg_size = wal_segment_size;
+ NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
+ NewPage->xlp_info |= XLP_LONG_HEADER;
+ }
+
errno = 0;
if (write(fd, zbuffer.data, XLOG_BLCKSZ) != XLOG_BLCKSZ)
{
@@ -3290,6 +2968,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
save_errno = errno ? errno : ENOSPC;
break;
}
+
+ NewPageBeginPtr += XLOG_BLCKSZ;
}
}
else
@@ -3605,6 +3285,138 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
return true;
}
+/*
+ * Get a hint address for hugepage boundary mapping.
+ *
+ * Returns non-NULL if success, or PANICs otherwise.
+ */
+static void *
+XLogFileMapHint(void)
+{
+ void *hint;
+ Size len;
+
+ len = (Size) wal_segment_size + PG_HUGEPAGE_MASK + 1;
+ hint = mmap(NULL, len, PROT_READ | PROT_WRITE,
+ MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+
+ if (hint == MAP_FAILED)
+ elog(PANIC, "could not get hint address");
+
+ if (munmap(hint, len) != 0)
+ elog(PANIC, "could not unmap hint address");
+
+ /* Go forward onto the nearest hugepage boundary */
+ return (void *) (((uintptr_t) hint + PG_HUGEPAGE_MASK) & ~PG_HUGEPAGE_MASK);
+}
+
+static void *
+XLogFileMapUtil(void *hint, int fd, bool dax)
+{
+ int flags;
+
+ if (dax)
+ flags = MAP_SHARED_VALIDATE | MAP_SYNC;
+ else
+ flags = MAP_SHARED;
+
+ return mmap(hint, wal_segment_size, PROT_READ | PROT_WRITE, flags, fd, 0);
+}
+
+/*
+ * Memory-map a pre-existing logfile segment for WAL buffers.
+ *
+ * If success, it returns non-NULL and is_pmem is set whether the file is on
+ * PMEM or not. Otherwise, it PANICs.
+ */
+static char *
+XLogFileMap(XLogSegNo segno, bool *is_pmem)
+{
+ char path[MAXPGPATH];
+ char *addr;
+ void *hint;
+ int fd;
+ struct stat stat_buf;
+
+ XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
+
+ fd = BasicOpenFile(path, O_RDWR | PG_BINARY);
+ if (fd < 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not open file \"%s\": %m", path)));
+
+ if (fstat(fd, &stat_buf) != 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not fstat file \"%s\": %m", path)));
+
+ if (stat_buf.st_size != wal_segment_size)
+ elog(PANIC,
+ "invalid logfile segment size; path \"%s\" actual %d expected %d",
+ path, (int) stat_buf.st_size, wal_segment_size);
+
+ hint = XLogFileMapHint();
+
+ /*
+ * Try DAX mapping first (dax=true).
+ *
+ * If not supported, then do regular mapping (dax=false).
+ */
+ addr = XLogFileMapUtil(hint, fd, true);
+
+ if (addr != MAP_FAILED)
+ {
+ *is_pmem = true;
+ }
+ else if (errno == EOPNOTSUPP || errno == EINVAL)
+ {
+ addr = XLogFileMapUtil(hint, fd, false);
+
+ if (addr == MAP_FAILED)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not mmap file \"%s\": %m", path)));
+
+ *is_pmem = false;
+ }
+
+ /* Check if the logfile segment is mapped onto hugepage boundary */
+ if ((uintptr_t) addr & PG_HUGEPAGE_MASK)
+ elog(WARNING,
+ "logfile segment is not mapped onto hugepage boundary; path \"%s\" actual %p expected %p",
+ path, addr, hint);
+
+ /* We don't need the file descriptor anymore, so close it */
+ if (close(fd) != 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m", path)));
+
+ return addr;
+}
+
+/*
+ * Unmap a given logfile segment for WAL buffer.
+ */
+static void
+XLogFileUnmap(char *pages, XLogSegNo segno)
+{
+ Assert(pages != NULL);
+
+ if (munmap(pages, wal_segment_size) != 0)
+ {
+ char xlogfname[MAXFNAMELEN];
+ int save_errno = errno;
+
+ XLogFileName(xlogfname, ThisTimeLineID, segno, wal_segment_size);
+ errno = save_errno;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not unmap file \"%s\": %m", xlogfname)));
+ }
+}
+
/*
* Open a pre-existing logfile segment for writing.
*/
@@ -4988,12 +4800,6 @@ XLOGShmemSize(void)
/* WAL insertion locks, plus alignment */
size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
- /* xlblocks array */
- size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
- /* extra alignment padding for XLOG I/O buffers */
- size = add_size(size, XLOG_BLCKSZ);
- /* and the buffers themselves */
- size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
/*
* Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5069,10 +4875,6 @@ XLOGShmemInit(void)
* needed here.
*/
allocptr = ((char *) XLogCtl) + sizeof(XLogCtlData);
- XLogCtl->xlblocks = (XLogRecPtr *) allocptr;
- memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
- allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
-
/* WAL insertion locks. Ensure they're aligned to the full padded size */
allocptr += sizeof(WALInsertLockPadded) -
@@ -5089,15 +4891,6 @@ XLOGShmemInit(void)
WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
}
- /*
- * Align the start of the page buffers to a full xlog block size boundary.
- * This simplifies some calculations in XLOG insertion. It is also
- * required for O_DIRECT.
- */
- allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
- XLogCtl->pages = allocptr;
- memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
-
/*
* Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
* in additional info.)
@@ -7550,40 +7343,12 @@ StartupXLOG(void)
Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
/*
- * Tricky point here: readBuf contains the *last* block that the LastRec
- * record spans, not the one it starts in. The last block is indeed the
- * one we want to use.
+ * We DO NOT need the if-else block once existed here because we use WAL
+ * segment files as WAL buffers so the last block is "already on the
+ * buffers."
+ *
+ * XXX We assume there is no torn record.
*/
- if (EndOfLog % XLOG_BLCKSZ != 0)
- {
- char *page;
- int len;
- int firstIdx;
- XLogRecPtr pageBeginPtr;
-
- pageBeginPtr = EndOfLog - (EndOfLog % XLOG_BLCKSZ);
- Assert(readOff == XLogSegmentOffset(pageBeginPtr, wal_segment_size));
-
- firstIdx = XLogRecPtrToBufIdx(EndOfLog);
-
- /* Copy the valid part of the last block, and zero the rest */
- page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];
- len = EndOfLog % XLOG_BLCKSZ;
- memcpy(page, xlogreader->readBuf, len);
- memset(page + len, 0, XLOG_BLCKSZ - len);
-
- XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
- XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
- }
- else
- {
- /*
- * There is no partial block to copy. Just set InitializedUpTo, and
- * let the first attempt to insert a log record to initialize the next
- * buffer.
- */
- XLogCtl->InitializedUpTo = EndOfLog;
- }
LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
--
2.17.1
v2-0003-Lazy-unmap-WAL-segments.patchapplication/octet-stream; name=v2-0003-Lazy-unmap-WAL-segments.patchDownload
From cf15df350201cd2c5383f04ea52b9ddc534c997a Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:02 +0900
Subject: [PATCH v2 3/5] Lazy-unmap WAL segments
---
src/backend/access/transam/xlog.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 423eb839b5..ff7d0b69bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -770,7 +770,9 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
*/
static int openLogFile = -1;
static XLogSegNo openLogSegNo = 0;
+static XLogSegNo beingClosedLogSegNo = 0;
static char *mappedPages = NULL;
+static char *beingUnmappedPages = NULL;
static bool pmemMapped = 0;
/* 2MiB hugepage mask used by XLogFileMapHint */
@@ -1179,6 +1181,14 @@ XLogInsertRecord(XLogRecData *rdata,
}
}
+ /* Lazy-unmap */
+ if (beingUnmappedPages != NULL)
+ {
+ XLogFileUnmap(beingUnmappedPages, beingClosedLogSegNo);
+ beingUnmappedPages = NULL;
+ beingClosedLogSegNo = 0;
+ }
+
#ifdef WAL_DEBUG
if (XLOG_DEBUG)
{
@@ -1812,9 +1822,23 @@ GetXLogBuffer(XLogRecPtr ptr)
XLByteToSeg(ptr, segno, wal_segment_size);
if (segno != openLogSegNo)
{
- /* Unmap the current segment if mapped */
+ /*
+ * We do not want to unmap the current segment here because we are in
+ * a critial section and unmap is time-consuming operation. So we
+ * just mark it to be unmapped later.
+ */
if (mappedPages != NULL)
- XLogFileUnmap(mappedPages, openLogSegNo);
+ {
+ /*
+ * If there is another being-unmapped segment, it cannot be helped;
+ * we unmap it here.
+ */
+ if (beingUnmappedPages != NULL)
+ XLogFileUnmap(beingUnmappedPages, beingClosedLogSegNo);
+
+ beingUnmappedPages = mappedPages;
+ beingClosedLogSegNo = openLogSegNo;
+ }
/* Map the segment we need */
mappedPages = XLogFileMap(segno, &pmemMapped);
--
2.17.1
v2-0004-Speculative-map-WAL-segments.patchapplication/octet-stream; name=v2-0004-Speculative-map-WAL-segments.patchDownload
From 111d5892f076cc0504e9ec2866ac5297de1862df Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:03 +0900
Subject: [PATCH v2 4/5] Speculative-map WAL segments
---
src/backend/access/transam/xlog.c | 19 +++++++++++++++++++
1 file changed, 19 insertions(+)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ff7d0b69bd..382256369d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -993,6 +993,8 @@ XLogInsertRecord(XLogRecData *rdata,
info == XLOG_SWITCH);
XLogRecPtr StartPos;
XLogRecPtr EndPos;
+ XLogRecPtr ProbablyInsertPos;
+ XLogSegNo ProbablyInsertSegNo;
bool prevDoPageWrites = doPageWrites;
/* we assume that all of the record header is in the first chunk */
@@ -1002,6 +1004,23 @@ XLogInsertRecord(XLogRecData *rdata,
if (!XLogInsertAllowed())
elog(ERROR, "cannot make new WAL entries during recovery");
+ /* Speculatively map a segment we probably need */
+ ProbablyInsertPos = GetInsertRecPtr();
+ XLByteToSeg(ProbablyInsertPos, ProbablyInsertSegNo, wal_segment_size);
+ if (ProbablyInsertSegNo != openLogSegNo)
+ {
+ if (mappedPages != NULL)
+ {
+ Assert(beingUnmappedPages == NULL);
+ Assert(beingClosedLogSegNo == 0);
+ beingUnmappedPages = mappedPages;
+ beingClosedLogSegNo = openLogSegNo;
+ }
+ mappedPages = XLogFileMap(ProbablyInsertSegNo, &pmemMapped);
+ Assert(mappedPages != NULL);
+ openLogSegNo = ProbablyInsertSegNo;
+ }
+
/*----------
*
* We have now done all the preparatory work we can without holding a
--
2.17.1
v2-0005-Map-WAL-segments-with-MAP_POPULATE-if-non-DAX.patchapplication/octet-stream; name=v2-0005-Map-WAL-segments-with-MAP_POPULATE-if-non-DAX.patchDownload
From a3ba57b33ac23f8db46e7f92e72a558db6ccd64a Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:04 +0900
Subject: [PATCH v2 5/5] Map WAL segments with MAP_POPULATE if non-DAX
---
src/backend/access/transam/xlog.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 382256369d..5c387846e5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3361,7 +3361,7 @@ XLogFileMapUtil(void *hint, int fd, bool dax)
if (dax)
flags = MAP_SHARED_VALIDATE | MAP_SYNC;
else
- flags = MAP_SHARED;
+ flags = MAP_SHARED | MAP_POPULATE;
return mmap(hint, wal_segment_size, PROT_READ | PROT_WRITE, flags, fd, 0);
}
--
2.17.1
Dear hackers,
I update my non-volatile WAL buffer's patchset to v3. Now we can use it in streaming replication mode.
Updates from v2:
- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.
- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one.
Best regards,
Takashi
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
Show quoted text
-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>; 'Amit Langote'
<amitlangote09@gmail.com>
Subject: RE: [PoC] Non-volatile WAL bufferDear hackers,
I rebased my non-volatile WAL buffer's patchset onto master. A new v2 patchset is attached to this mail.
I also measured performance before and after patchset, varying -c/--client and -j/--jobs options of pgbench, for
each scaling factor s = 50 or 1000. The results are presented in the following tables and the attached charts.
Conditions, steps, and other details will be shown later.Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)Both throughput and average latency are improved for each scaling factor. Throughput seemed to almost reach
the upper limit when (c,j)=(36,18).The percentage in s=1000 case looks larger than in s=50 case. I think larger scaling factor leads to less
contentions on the same tables and/or indexes, that is, less lock and unlock operations. In such a situation,
write-ahead logging appears to be more significant for performance.Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patchSteps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown
in the tables above.(1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutespgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbnameI gave no -b option to use the built-in "TPC-B (sort-of)" query.
Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GABest regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>;'PostgreSQL-development'
<pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL bufferDear Amit,
Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...
I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later.
Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software
Innovation Center-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas
<hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL bufferHello,
On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:
Hello Amit,
I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have anyspecific reason to be working on REL_12_0?
Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I knowall new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or notbecause master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using release notes and user manuals.Thanks for clarifying. I see where you're coming from.
While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss tonotice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architecture of your own changes, although maybe thatrarely occurs.
Thanks,
Amit
Attachments:
v3-0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=v3-0001-Support-GUCs-for-external-WAL-buffer.patchDownload
From 931ab8fa7e9181f6b69601ad279e0ee5acb103d4 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:56 +0900
Subject: [PATCH v3 1/5] Support GUCs for external WAL buffer
To implement non-volatile WAL buffer, we add two new GUCs nvwal_path
and nvwal_size. Now postgres maps a file at that path onto memory to
use it as WAL buffer. Note that the buffer is still volatile for now.
---
configure | 262 ++++++++++++++++++
configure.in | 43 +++
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/nv_xlog_buffer.c | 95 +++++++
src/backend/access/transam/xlog.c | 164 ++++++++++-
src/backend/utils/misc/guc.c | 23 +-
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/bin/initdb/initdb.c | 93 ++++++-
src/include/access/nv_xlog_buffer.h | 71 +++++
src/include/access/xlog.h | 2 +
src/include/pg_config.h.in | 6 +
src/include/utils/guc.h | 4 +
12 files changed, 747 insertions(+), 21 deletions(-)
create mode 100644 src/backend/access/transam/nv_xlog_buffer.c
create mode 100644 src/include/access/nv_xlog_buffer.h
diff --git a/configure b/configure
index 2feff37fe3..3f16feeb54 100755
--- a/configure
+++ b/configure
@@ -866,6 +866,7 @@ with_libxml
with_libxslt
with_system_tzdata
with_zlib
+with_nvwal
with_gnu_ld
enable_largefile
'
@@ -1570,6 +1571,7 @@ Optional Packages:
--with-system-tzdata=DIR
use system time zone data in DIR
--without-zlib do not use Zlib
+ --with-nvwal use non-volatile WAL buffer (NVWAL)
--with-gnu-ld assume the C compiler uses GNU ld [default=no]
Some influential environment variables:
@@ -8504,6 +8506,203 @@ fi
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with non-volatile WAL buffer (NVWAL)" >&5
+$as_echo_n "checking whether to build with non-volatile WAL buffer (NVWAL)... " >&6; }
+
+
+
+# Check whether --with-nvwal was given.
+if test "${with_nvwal+set}" = set; then :
+ withval=$with_nvwal;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_NVWAL 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-nvwal option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_nvwal=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_nvwal" >&5
+$as_echo "$with_nvwal" >&6; }
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+ freebsd1*|freebsd2*) elf=no;;
+ freebsd3*|freebsd4*) elf=yes;;
+esac
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for grep that handles long lines and -e" >&5
+$as_echo_n "checking for grep that handles long lines and -e... " >&6; }
+if ${ac_cv_path_GREP+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ if test -z "$GREP"; then
+ ac_path_GREP_found=false
+ # Loop through the user's path and test for each of PROGNAME-LIST
+ as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+ IFS=$as_save_IFS
+ test -z "$as_dir" && as_dir=.
+ for ac_prog in grep ggrep; do
+ for ac_exec_ext in '' $ac_executable_extensions; do
+ ac_path_GREP="$as_dir/$ac_prog$ac_exec_ext"
+ as_fn_executable_p "$ac_path_GREP" || continue
+# Check for GNU ac_path_GREP and select it if it is found.
+ # Check for GNU $ac_path_GREP
+case `"$ac_path_GREP" --version 2>&1` in
+*GNU*)
+ ac_cv_path_GREP="$ac_path_GREP" ac_path_GREP_found=:;;
+*)
+ ac_count=0
+ $as_echo_n 0123456789 >"conftest.in"
+ while :
+ do
+ cat "conftest.in" "conftest.in" >"conftest.tmp"
+ mv "conftest.tmp" "conftest.in"
+ cp "conftest.in" "conftest.nl"
+ $as_echo 'GREP' >> "conftest.nl"
+ "$ac_path_GREP" -e 'GREP$' -e '-(cannot match)-' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+ diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+ as_fn_arith $ac_count + 1 && ac_count=$as_val
+ if test $ac_count -gt ${ac_path_GREP_max-0}; then
+ # Best one so far, save it but keep looking for a better one
+ ac_cv_path_GREP="$ac_path_GREP"
+ ac_path_GREP_max=$ac_count
+ fi
+ # 10*(2^10) chars as input seems more than enough
+ test $ac_count -gt 10 && break
+ done
+ rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+ $ac_path_GREP_found && break 3
+ done
+ done
+ done
+IFS=$as_save_IFS
+ if test -z "$ac_cv_path_GREP"; then
+ as_fn_error $? "no acceptable grep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+ fi
+else
+ ac_cv_path_GREP=$GREP
+fi
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_GREP" >&5
+$as_echo "$ac_cv_path_GREP" >&6; }
+ GREP="$ac_cv_path_GREP"
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for egrep" >&5
+$as_echo_n "checking for egrep... " >&6; }
+if ${ac_cv_path_EGREP+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ if echo a | $GREP -E '(a|b)' >/dev/null 2>&1
+ then ac_cv_path_EGREP="$GREP -E"
+ else
+ if test -z "$EGREP"; then
+ ac_path_EGREP_found=false
+ # Loop through the user's path and test for each of PROGNAME-LIST
+ as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+ IFS=$as_save_IFS
+ test -z "$as_dir" && as_dir=.
+ for ac_prog in egrep; do
+ for ac_exec_ext in '' $ac_executable_extensions; do
+ ac_path_EGREP="$as_dir/$ac_prog$ac_exec_ext"
+ as_fn_executable_p "$ac_path_EGREP" || continue
+# Check for GNU ac_path_EGREP and select it if it is found.
+ # Check for GNU $ac_path_EGREP
+case `"$ac_path_EGREP" --version 2>&1` in
+*GNU*)
+ ac_cv_path_EGREP="$ac_path_EGREP" ac_path_EGREP_found=:;;
+*)
+ ac_count=0
+ $as_echo_n 0123456789 >"conftest.in"
+ while :
+ do
+ cat "conftest.in" "conftest.in" >"conftest.tmp"
+ mv "conftest.tmp" "conftest.in"
+ cp "conftest.in" "conftest.nl"
+ $as_echo 'EGREP' >> "conftest.nl"
+ "$ac_path_EGREP" 'EGREP$' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+ diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+ as_fn_arith $ac_count + 1 && ac_count=$as_val
+ if test $ac_count -gt ${ac_path_EGREP_max-0}; then
+ # Best one so far, save it but keep looking for a better one
+ ac_cv_path_EGREP="$ac_path_EGREP"
+ ac_path_EGREP_max=$ac_count
+ fi
+ # 10*(2^10) chars as input seems more than enough
+ test $ac_count -gt 10 && break
+ done
+ rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+ $ac_path_EGREP_found && break 3
+ done
+ done
+ done
+IFS=$as_save_IFS
+ if test -z "$ac_cv_path_EGREP"; then
+ as_fn_error $? "no acceptable egrep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+ fi
+else
+ ac_cv_path_EGREP=$EGREP
+fi
+
+ fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_EGREP" >&5
+$as_echo "$ac_cv_path_EGREP" >&6; }
+ EGREP="$ac_cv_path_EGREP"
+
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#if __ELF__
+ yes
+#endif
+
+_ACEOF
+if (eval "$ac_cpp conftest.$ac_ext") 2>&5 |
+ $EGREP "yes" >/dev/null 2>&1; then :
+ ELF_SYS=true
+else
+ if test "X$elf" = "Xyes" ; then
+ ELF_SYS=true
+else
+ ELF_SYS=
+fi
+fi
+rm -f conftest*
+
+
+
#
# Assignments
#
@@ -12861,6 +13060,57 @@ fi
fi
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_pmem_pmem_map_file=yes
+else
+ ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+ LIBS="-lpmem $LIBS"
+
+else
+ as_fn_error $? "library 'libpmem' is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+fi
+
##
## Header files
@@ -13540,6 +13790,18 @@ fi
done
+fi
+
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+ ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+
fi
if test "$PORTNAME" = "win32" ; then
diff --git a/configure.in b/configure.in
index 0188c6ff07..a5f9c9fb9d 100644
--- a/configure.in
+++ b/configure.in
@@ -992,6 +992,38 @@ PGAC_ARG_BOOL(with, zlib, yes,
[do not use Zlib])
AC_SUBST(with_zlib)
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+AC_MSG_CHECKING([whether to build with non-volatile WAL buffer (NVWAL)])
+PGAC_ARG_BOOL(with, nvwal, no, [use non-volatile WAL buffer (NVWAL)],
+ [AC_DEFINE([USE_NVWAL], 1, [Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal)])])
+AC_MSG_RESULT([$with_nvwal])
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+ freebsd1*|freebsd2*) elf=no;;
+ freebsd3*|freebsd4*) elf=yes;;
+esac
+
+AC_EGREP_CPP(yes,
+[#if __ELF__
+ yes
+#endif
+],
+[ELF_SYS=true],
+[if test "X$elf" = "Xyes" ; then
+ ELF_SYS=true
+else
+ ELF_SYS=
+fi])
+AC_SUBST(ELF_SYS)
+
#
# Assignments
#
@@ -1293,6 +1325,12 @@ elif test "$with_uuid" = ossp ; then
fi
AC_SUBST(UUID_LIBS)
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+ AC_CHECK_LIB(pmem, pmem_map_file, [],
+ [AC_MSG_ERROR([library 'libpmem' is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
##
## Header files
@@ -1470,6 +1508,11 @@ elif test "$with_uuid" = ossp ; then
[AC_MSG_ERROR([header file <ossp/uuid.h> or <uuid.h> is required for OSSP UUID])])])
fi
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+ AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
if test "$PORTNAME" = "win32" ; then
AC_CHECK_HEADERS(crtdefs.h)
fi
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..b41a710e7e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -32,7 +32,8 @@ OBJS = \
xlogfuncs.o \
xloginsert.o \
xlogreader.o \
- xlogutils.o
+ xlogutils.o \
+ nv_xlog_buffer.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/nv_xlog_buffer.c b/src/backend/access/transam/nv_xlog_buffer.c
new file mode 100644
index 0000000000..cfc6a6376b
--- /dev/null
+++ b/src/backend/access/transam/nv_xlog_buffer.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * nv_xlog_buffer.c
+ * PostgreSQL non-volatile WAL buffer
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/nv_xlog_buffer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#ifdef USE_NVWAL
+
+#include <libpmem.h>
+#include "access/nv_xlog_buffer.h"
+
+#include "miscadmin.h" /* IsBootstrapProcessingMode */
+#include "common/file_perm.h" /* pg_file_create_mode */
+
+/*
+ * Maps non-volatile WAL buffer on shared memory.
+ *
+ * Returns a mapped address if success; PANICs and never return otherwise.
+ */
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+ void *addr;
+ size_t map_len = 0;
+ int is_pmem = 0;
+
+ Assert(fname != NULL);
+ Assert(fsize > 0);
+
+ if (IsBootstrapProcessingMode())
+ {
+ /*
+ * Create and map a new file if we are in bootstrap mode (typically
+ * executed by initdb).
+ */
+ addr = pmem_map_file(fname, fsize, PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+ pg_file_create_mode, &map_len, &is_pmem);
+ }
+ else
+ {
+ /*
+ * Map an existing file. The second argument (len) should be zero,
+ * the third argument (flags) should have neither PMEM_FILE_CREATE nor
+ * PMEM_FILE_EXCL, and the fourth argument (mode) will be ignored.
+ */
+ addr = pmem_map_file(fname, 0, 0, 0, &map_len, &is_pmem);
+ }
+
+ if (addr == NULL)
+ elog(PANIC, "could not map non-volatile WAL buffer '%s': %m", fname);
+
+ if (map_len != fsize)
+ elog(PANIC, "size of non-volatile WAL buffer '%s' is invalid; "
+ "expected %zu; actual %zu",
+ fname, fsize, map_len);
+
+ if (!is_pmem)
+ elog(PANIC, "non-volatile WAL buffer '%s' is not on persistent memory",
+ fname);
+
+ /*
+ * Assert page boundary alignment (8KiB as default). It should pass because
+ * PMDK considers hugepage boundary alignment (2MiB or 1GiB on x64).
+ */
+ Assert((uint64) addr % XLOG_BLCKSZ == 0);
+
+ elog(LOG, "non-volatile WAL buffer '%s' is mapped on [%p-%p)",
+ fname, addr, (char *) addr + map_len);
+ return addr;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+ Assert(addr != NULL);
+
+ if (pmem_unmap(addr, fsize) < 0)
+ {
+ elog(WARNING, "could not unmap non-volatile WAL buffer: %m");
+ return;
+ }
+
+ elog(LOG, "non-volatile WAL buffer unmapped");
+}
+
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a1256a103b..0681ba1262 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -37,6 +37,7 @@
#include "access/xloginsert.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
+#include "access/nv_xlog_buffer.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
@@ -873,6 +874,12 @@ static bool InRedo = false;
/* Have we launched bgwriter during recovery? */
static bool bgwriterLaunched = false;
+/* For non-volatile WAL buffer (NVWAL) */
+char *NvwalPath = NULL; /* a GUC parameter */
+int NvwalSizeMB = 1024; /* a direct GUC parameter */
+static Size NvwalSize = 0; /* an indirect GUC parameter */
+static bool NvwalAvail = false;
+
/* For WALInsertLockAcquire/Release functions */
static int MyLockNo = 0;
static bool holdingAllLocks = false;
@@ -5014,6 +5021,76 @@ check_wal_buffers(int *newval, void **extra, GucSource source)
return true;
}
+/*
+ * GUC check_hook for nvwal_path.
+ */
+bool
+check_nvwal_path(char **newval, void **extra, GucSource source)
+{
+#ifndef USE_NVWAL
+ Assert(!NvwalAvail);
+
+ if (**newval != '\0')
+ {
+ GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+ GUC_check_errmsg("nvwal_path is invalid parameter without NVWAL");
+ return false;
+ }
+#endif
+
+ return true;
+}
+
+void
+assign_nvwal_path(const char *newval, void *extra)
+{
+ /* true if not empty; false if empty */
+ NvwalAvail = (bool) (*newval != '\0');
+}
+
+/*
+ * GUC check_hook for nvwal_size.
+ *
+ * It checks the boundary only and DOES NOT check if the size is multiple
+ * of wal_segment_size because the segment size (probably stored in the
+ * control file) have not been set properly here yet.
+ *
+ * See XLOGShmemSize for more validation.
+ */
+bool
+check_nvwal_size(int *newval, void **extra, GucSource source)
+{
+#ifdef USE_NVWAL
+ Size buf_size;
+ int64 npages;
+
+ Assert(*newval > 0);
+
+ buf_size = (Size) (*newval) * 1024 * 1024;
+ npages = (int64) buf_size / XLOG_BLCKSZ;
+ Assert(npages > 0);
+
+ if (npages > INT_MAX)
+ {
+ /* XLOG_BLCKSZ could be so small that npages exceeds INT_MAX */
+ GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+ GUC_check_errmsg("invalid value for nvwal_size (%dMB): "
+ "the number of WAL pages too large; "
+ "buf_size %zu; XLOG_BLCKSZ %d",
+ *newval, buf_size, (int) XLOG_BLCKSZ);
+ return false;
+ }
+#endif
+
+ return true;
+}
+
+void
+assign_nvwal_size(int newval, void *extra)
+{
+ NvwalSize = (Size) newval * 1024 * 1024;
+}
+
/*
* Read the control file, set respective GUCs.
*
@@ -5042,13 +5119,49 @@ XLOGShmemSize(void)
{
Size size;
+ /*
+ * If we use non-volatile WAL buffer, we don't use the given wal_buffers.
+ * Instead, we set it the value based on the size of the file for the
+ * buffer. This should be done here because of xlblocks array calculation.
+ */
+ if (NvwalAvail)
+ {
+ char buf[32];
+ int64 npages;
+
+ Assert(NvwalSizeMB > 0);
+ Assert(NvwalSize > 0);
+ Assert(wal_segment_size > 0);
+ Assert(wal_segment_size % XLOG_BLCKSZ == 0);
+
+ /*
+ * At last, we can check if the size of non-volatile WAL buffer
+ * (nvwal_size) is multiple of WAL segment size.
+ *
+ * Note that NvwalSize has already been calculated in assign_nvwal_size.
+ */
+ if (NvwalSize % wal_segment_size != 0)
+ {
+ elog(PANIC,
+ "invalid value for nvwal_size (%dMB): "
+ "it should be multiple of WAL segment size; "
+ "NvwalSize %zu; wal_segment_size %d",
+ NvwalSizeMB, NvwalSize, wal_segment_size);
+ }
+
+ npages = (int64) NvwalSize / XLOG_BLCKSZ;
+ Assert(npages > 0 && npages <= INT_MAX);
+
+ snprintf(buf, sizeof(buf), "%d", (int) npages);
+ SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
+ }
/*
* If the value of wal_buffers is -1, use the preferred auto-tune value.
* This isn't an amazingly clean place to do this, but we must wait till
* NBuffers has received its final value, and must do it before using the
* value of XLOGbuffers to do anything important.
*/
- if (XLOGbuffers == -1)
+ else if (XLOGbuffers == -1)
{
char buf[32];
@@ -5064,10 +5177,13 @@ XLOGShmemSize(void)
size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
/* xlblocks array */
size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
- /* extra alignment padding for XLOG I/O buffers */
- size = add_size(size, XLOG_BLCKSZ);
- /* and the buffers themselves */
- size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+ if (!NvwalAvail)
+ {
+ /* extra alignment padding for XLOG I/O buffers */
+ size = add_size(size, XLOG_BLCKSZ);
+ /* and the buffers themselves */
+ size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+ }
/*
* Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5161,13 +5277,32 @@ XLOGShmemInit(void)
}
/*
- * Align the start of the page buffers to a full xlog block size boundary.
- * This simplifies some calculations in XLOG insertion. It is also
- * required for O_DIRECT.
+ * Open and memory-map a file for non-volatile XLOG buffer. The PMDK will
+ * align the start of the buffer to 2-MiB boundary if the size of the
+ * buffer is larger than or equal to 4 MiB.
*/
- allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
- XLogCtl->pages = allocptr;
- memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+ if (NvwalAvail)
+ {
+ /* Logging and error-handling should be done in the function */
+ XLogCtl->pages = MapNonVolatileXLogBuffer(NvwalPath, NvwalSize);
+
+ /*
+ * Do not memset non-volatile XLOG buffer (XLogCtl->pages) here
+ * because it would contain records for recovery. We should do so in
+ * checkpoint after the recovery completes successfully.
+ */
+ }
+ else
+ {
+ /*
+ * Align the start of the page buffers to a full xlog block size
+ * boundary. This simplifies some calculations in XLOG insertion. It
+ * is also required for O_DIRECT.
+ */
+ allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+ XLogCtl->pages = allocptr;
+ memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+ }
/*
* Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
@@ -8522,6 +8657,13 @@ ShutdownXLOG(int code, Datum arg)
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
+
+ /*
+ * If we use non-volatile XLOG buffer, unmap it.
+ */
+ if (NvwalAvail)
+ UnmapNonVolatileXLogBuffer(XLogCtl->pages, NvwalSize);
+
ShutdownCLOG();
ShutdownCommitTs();
ShutdownSUBTRANS();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 75fc6f11d6..140a99faee 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2707,7 +2707,7 @@ static struct config_int ConfigureNamesInt[] =
GUC_UNIT_XBLOCKS
},
&XLOGbuffers,
- -1, -1, (INT_MAX / XLOG_BLCKSZ),
+ -1, -1, INT_MAX,
check_wal_buffers, NULL, NULL
},
@@ -3381,6 +3381,17 @@ static struct config_int ConfigureNamesInt[] =
NULL, assign_tcp_user_timeout, show_tcp_user_timeout
},
+ {
+ {"nvwal_size", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Size of non-volatile WAL buffer (NVWAL)."),
+ NULL,
+ GUC_UNIT_MB
+ },
+ &NvwalSizeMB,
+ 1024, 1, INT_MAX,
+ check_nvwal_size, assign_nvwal_size, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -4419,6 +4430,16 @@ static struct config_string ConfigureNamesString[] =
check_backtrace_functions, assign_backtrace_functions, NULL
},
+ {
+ {"nvwal_path", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Path to file for non-volatile WAL buffer (NVWAL)."),
+ NULL
+ },
+ &NvwalPath,
+ "",
+ check_nvwal_path, assign_nvwal_path, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3a25287a39..866f77828d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -226,6 +226,8 @@
#checkpoint_timeout = 5min # range 30s-1d
#max_wal_size = 1GB
#min_wal_size = 80MB
+#nvwal_path = '/path/to/nvwal'
+#nvwal_size = 1GB
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_flush_after = 0 # measured in pages, 0 disables
#checkpoint_warning = 30s # 0 disables
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 786672b1b6..1b18097580 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -144,7 +144,10 @@ static bool show_setting = false;
static bool data_checksums = false;
static char *xlog_dir = NULL;
static char *str_wal_segment_size_mb = NULL;
+static char *nvwal_path = NULL;
+static char *str_nvwal_size_mb = NULL;
static int wal_segment_size_mb;
+static int nvwal_size_mb;
/* internal vars */
@@ -1109,14 +1112,78 @@ setup_config(void)
conflines = replace_token(conflines, "#port = 5432", repltok);
#endif
- /* set default max_wal_size and min_wal_size */
- snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
- pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
- conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+ if (nvwal_path != NULL)
+ {
+ int nr_segs;
+
+ if (str_nvwal_size_mb == NULL)
+ nvwal_size_mb = 1024;
+ else
+ {
+ char *endptr;
+
+ /* check that the argument is a number */
+ nvwal_size_mb = strtol(str_nvwal_size_mb, &endptr, 10);
+
+ /* verify that the size of non-volatile WAL buffer is valid */
+ if (endptr == str_nvwal_size_mb || *endptr != '\0')
+ {
+ pg_log_error("argument of --nvwal-size must be a number; "
+ "str_nvwal_size_mb '%s'",
+ str_nvwal_size_mb);
+ exit(1);
+ }
+ if (nvwal_size_mb <= 0)
+ {
+ pg_log_error("argument of --nvwal-size must be a positive number; "
+ "str_nvwal_size_mb '%s'; nvwal_size_mb %d",
+ str_nvwal_size_mb, nvwal_size_mb);
+ exit(1);
+ }
+ if (nvwal_size_mb % wal_segment_size_mb != 0)
+ {
+ pg_log_error("argument of --nvwal-size must be multiple of WAL segment size; "
+ "str_nvwal_size_mb '%s'; nvwal_size_mb %d; wal_segment_size_mb %d",
+ str_nvwal_size_mb, nvwal_size_mb, wal_segment_size_mb);
+ exit(1);
+ }
+ }
+
+ /*
+ * XXX We set {min_,max_,nv}wal_size to the same value. Note that
+ * postgres might bootstrap and run if the three config does not have
+ * the same value, but have not been tested yet.
+ */
+ nr_segs = nvwal_size_mb / wal_segment_size_mb;
- snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
- pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
- conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+ snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "nvwal_path = '%s'",
+ nvwal_path);
+ conflines = replace_token(conflines,
+ "#nvwal_path = '/path/to/nvwal'", repltok);
+
+ snprintf(repltok, sizeof(repltok), "nvwal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#nvwal_size = 1GB", repltok);
+ }
+ else
+ {
+ /* set default max_wal_size and min_wal_size */
+ snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+ pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
+ conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+ pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
+ conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+ }
snprintf(repltok, sizeof(repltok), "lc_messages = '%s'",
escape_quotes(lc_messages));
@@ -2321,6 +2388,8 @@ usage(const char *progname)
printf(_(" -W, --pwprompt prompt for a password for the new superuser\n"));
printf(_(" -X, --waldir=WALDIR location for the write-ahead log directory\n"));
printf(_(" --wal-segsize=SIZE size of WAL segments, in megabytes\n"));
+ printf(_(" -P, --nvwal-path=FILE path to file for non-volatile WAL buffer (NVWAL)\n"));
+ printf(_(" -Q, --nvwal-size=SIZE size of NVWAL, in megabytes\n"));
printf(_("\nLess commonly used options:\n"));
printf(_(" -d, --debug generate lots of debugging output\n"));
printf(_(" -k, --data-checksums use data page checksums\n"));
@@ -2989,6 +3058,8 @@ main(int argc, char *argv[])
{"sync-only", no_argument, NULL, 'S'},
{"waldir", required_argument, NULL, 'X'},
{"wal-segsize", required_argument, NULL, 12},
+ {"nvwal-path", required_argument, NULL, 'P'},
+ {"nvwal-size", required_argument, NULL, 'Q'},
{"data-checksums", no_argument, NULL, 'k'},
{"allow-group-access", no_argument, NULL, 'g'},
{NULL, 0, NULL, 0}
@@ -3032,7 +3103,7 @@ main(int argc, char *argv[])
/* process command-line options */
- while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:g", long_options, &option_index)) != -1)
+ while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:P:Q:g", long_options, &option_index)) != -1)
{
switch (c)
{
@@ -3126,6 +3197,12 @@ main(int argc, char *argv[])
case 12:
str_wal_segment_size_mb = pg_strdup(optarg);
break;
+ case 'P':
+ nvwal_path = pg_strdup(optarg);
+ break;
+ case 'Q':
+ str_nvwal_size_mb = pg_strdup(optarg);
+ break;
case 'g':
SetDataDirectoryCreatePerm(PG_DIR_MODE_GROUP);
break;
diff --git a/src/include/access/nv_xlog_buffer.h b/src/include/access/nv_xlog_buffer.h
new file mode 100644
index 0000000000..b58878c92b
--- /dev/null
+++ b/src/include/access/nv_xlog_buffer.h
@@ -0,0 +1,71 @@
+/*
+ * nv_xlog_buffer.h
+ *
+ * PostgreSQL non-volatile WAL buffer
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nv_xlog_buffer.h
+ */
+#ifndef NV_XLOG_BUFFER_H
+#define NV_XLOG_BUFFER_H
+
+extern void *MapNonVolatileXLogBuffer(const char *fname, Size fsize);
+extern void UnmapNonVolatileXLogBuffer(void *addr, Size fsize);
+
+#ifdef USE_NVWAL
+#include <libpmem.h>
+
+#define nv_memset_persist pmem_memset_persist
+#define nv_memcpy_nodrain pmem_memcpy_nodrain
+#define nv_flush pmem_flush
+#define nv_drain pmem_drain
+#define nv_persist pmem_persist
+
+#else
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+ return NULL;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+ return;
+}
+
+static inline void *
+nv_memset_persist(void *pmemdest, int c, size_t len)
+{
+ return NULL;
+}
+
+static inline void *
+nv_memcpy_nodrain(void *pmemdest, const void *src,
+ size_t len)
+{
+ return NULL;
+}
+
+static inline void
+nv_flush(void *pmemdest, size_t len)
+{
+ return;
+}
+
+static inline void
+nv_drain(void)
+{
+ return;
+}
+
+static inline void
+nv_persist(const void *addr, size_t len)
+{
+ return;
+}
+
+#endif /* USE_NVWAL */
+#endif /* NV_XLOG_BUFFER_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 347a38f57c..0a05e79524 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,8 @@ extern int recovery_min_apply_delay;
extern char *PrimaryConnInfo;
extern char *PrimarySlotName;
extern bool wal_receiver_create_temp_slot;
+extern char *NvwalPath;
+extern int NvwalSizeMB;
/* indirectly set via GUC system */
extern TransactionId recoveryTargetXid;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c199cd46d2..90d23b46d1 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -325,6 +325,9 @@
/* Define to 1 if you have the `pam' library (-lpam). */
#undef HAVE_LIBPAM
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
/* Define if you have a function readline library */
#undef HAVE_LIBREADLINE
@@ -880,6 +883,9 @@
/* Define to select named POSIX semaphores. */
#undef USE_NAMED_POSIX_SEMAPHORES
+/* Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal) */
+#undef USE_NVWAL
+
/* Define to build with OpenSSL support. (--with-openssl) */
#undef USE_OPENSSL
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..d941a76d43 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,6 +438,10 @@ extern void assign_search_path(const char *newval, void *extra);
/* in access/transam/xlog.c */
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_nvwal_path(char **newval, void **extra, GucSource source);
+extern void assign_nvwal_path(const char *newval, void *extra);
+extern bool check_nvwal_size(int *newval, void **extra, GucSource source);
+extern void assign_nvwal_size(int newval, void *extra);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
#endif /* GUC_H */
--
2.17.1
v3-0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0002-Non-volatile-WAL-buffer.patchDownload
From 0cb1f9197350d76ad8ef1fc2115afb7abdfc4fdc Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:57 +0900
Subject: [PATCH v3 2/5] Non-volatile WAL buffer
Now external WAL buffer becomes non-volatile.
Bumps PG_CONTROL_VERSION.
---
src/backend/access/transam/xlog.c | 1154 ++++++++++++++++--
src/backend/access/transam/xlogreader.c | 24 +
src/bin/pg_controldata/pg_controldata.c | 3 +
src/include/access/xlog.h | 8 +
src/include/catalog/pg_control.h | 17 +-
src/test/regress/expected/misc_functions.out | 14 +-
src/test/regress/sql/misc_functions.sql | 14 +-
7 files changed, 1097 insertions(+), 137 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0681ba1262..45e05b9498 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -654,6 +654,13 @@ typedef struct XLogCtlData
TimeLineID ThisTimeLineID;
TimeLineID PrevTimeLineID;
+ /*
+ * Used for non-volatile WAL buffer (NVWAL).
+ *
+ * All the records up to this LSN are persistent in NVWAL.
+ */
+ XLogRecPtr persistentUpTo;
+
/*
* SharedRecoveryState indicates if we're still in crash or archive
* recovery. Protected by info_lck.
@@ -783,11 +790,13 @@ typedef enum
XLOG_FROM_ANY = 0, /* request to read WAL from any source */
XLOG_FROM_ARCHIVE, /* restored using restore_command */
XLOG_FROM_PG_WAL, /* existing file in pg_wal */
- XLOG_FROM_STREAM /* streamed from master */
+ XLOG_FROM_NVWAL, /* non-volatile WAL buffer */
+ XLOG_FROM_STREAM, /* streamed from master via segment file */
+ XLOG_FROM_STREAM_NVWAL /* same as above, but via NVWAL */
} XLogSource;
/* human-readable names for XLogSources, for debugging output */
-static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
+static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "nvwal", "stream", "stream_nvwal"};
/*
* openLogFile is -1 or a kernel FD for an open log file segment.
@@ -922,6 +931,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
bool fetching_ckpt, XLogRecPtr tliRecPtr);
static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
static void XLogFileClose(void);
+static void PreallocNonVolatileXlogBuffer(void);
static void PreallocXlogFiles(XLogRecPtr endptr);
static void RemoveTempXlogFiles(void);
static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
@@ -1204,6 +1214,43 @@ XLogInsertRecord(XLogRecData *rdata,
}
}
+ /*
+ * Request a checkpoint here if non-volatile WAL buffer is used and we
+ * have consumed too much WAL since the last checkpoint.
+ *
+ * We first screen under the condition (1) OR (2) below:
+ *
+ * (1) The record was the first one in a certain segment.
+ * (2) The record was inserted across segments.
+ *
+ * We then check the segment number which the record was inserted into.
+ */
+ if (NvwalAvail && inserted &&
+ (StartPos % wal_segment_size == SizeOfXLogLongPHD ||
+ StartPos / wal_segment_size < EndPos / wal_segment_size))
+ {
+ XLogSegNo end_segno;
+
+ XLByteToSeg(EndPos, end_segno, wal_segment_size);
+
+ /*
+ * NOTE: We do not signal walsender here because the inserted record
+ * have not drained by NVWAL buffer yet.
+ *
+ * NOTE: We do not signal walarchiver here because the inserted record
+ * have not flushed to a segment file. So we don't need to update
+ * XLogCtl->lastSegSwitch{Time,LSN}, used only by CheckArchiveTimeout.
+ */
+
+ /* Two-step checking for speed (see also XLogWrite) */
+ if (IsUnderPostmaster && XLogCheckpointNeeded(end_segno))
+ {
+ (void) GetRedoRecPtr();
+ if (XLogCheckpointNeeded(end_segno))
+ RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+ }
+ }
+
#ifdef WAL_DEBUG
if (XLOG_DEBUG)
{
@@ -2136,6 +2183,15 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
XLogRecPtr NewPageBeginPtr;
XLogPageHeader NewPage;
int npages = 0;
+ bool is_firstpage;
+
+ if (NvwalAvail)
+ elog(DEBUG1, "XLogCtl->InitializedUpTo %X/%X; upto %X/%X; opportunistic %s",
+ (uint32) (XLogCtl->InitializedUpTo >> 32),
+ (uint32) XLogCtl->InitializedUpTo,
+ (uint32) (upto >> 32),
+ (uint32) upto,
+ opportunistic ? "true" : "false");
LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
@@ -2197,7 +2253,25 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
{
/* Have to write it ourselves */
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
- WriteRqst.Write = OldPageRqstPtr;
+
+ if (NvwalAvail)
+ {
+ /*
+ * If we use non-volatile WAL buffer, it is a special
+ * but expected case to write the buffer pages out to
+ * segment files, and for simplicity, it is done in
+ * segment by segment.
+ */
+ XLogRecPtr OldSegEndPtr;
+
+ OldSegEndPtr = OldPageRqstPtr - XLOG_BLCKSZ + wal_segment_size;
+ Assert(OldSegEndPtr % wal_segment_size == 0);
+
+ WriteRqst.Write = OldSegEndPtr;
+ }
+ else
+ WriteRqst.Write = OldPageRqstPtr;
+
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, false);
LWLockRelease(WALWriteLock);
@@ -2224,7 +2298,20 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
* Be sure to re-zero the buffer so that bytes beyond what we've
* written will look like zeroes and not valid XLOG records...
*/
- MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
+ if (NvwalAvail)
+ {
+ /*
+ * We do not take the way that combines MemSet() and pmem_persist()
+ * because pmem_persist() may use slow and strong-ordered cache
+ * flush instruction if weak-ordered fast one is not supported.
+ * Instead, we first fill the buffer with zero by
+ * pmem_memset_persist() that can leverage non-temporal fast store
+ * instructions, then make the header persistent later.
+ */
+ nv_memset_persist(NewPage, 0, XLOG_BLCKSZ);
+ }
+ else
+ MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
/*
* Fill the new page's header
@@ -2256,7 +2343,8 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
/*
* If first page of an XLOG segment file, make it a long header.
*/
- if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+ is_firstpage = ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0);
+ if (is_firstpage)
{
XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
@@ -2271,7 +2359,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
* before the xlblocks update. GetXLogBuffer() reads xlblocks without
* holding a lock.
*/
- pg_write_barrier();
+ if (NvwalAvail)
+ {
+ /* Make the header persistent on PMEM */
+ nv_persist(NewPage, is_firstpage ? SizeOfXLogLongPHD : SizeOfXLogShortPHD);
+ }
+ else
+ pg_write_barrier();
*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
@@ -2281,6 +2375,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
}
LWLockRelease(WALBufMappingLock);
+ if (NvwalAvail)
+ elog(DEBUG1, "ControlFile->discardedUpTo %X/%X; XLogCtl->InitializedUpTo %X/%X",
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo,
+ (uint32) (XLogCtl->InitializedUpTo >> 32),
+ (uint32) XLogCtl->InitializedUpTo);
+
#ifdef WAL_DEBUG
if (XLOG_DEBUG && npages > 0)
{
@@ -2662,6 +2763,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
LogwrtResult.Flush = LogwrtResult.Write;
}
+ /*
+ * Update discardedUpTo if NVWAL is used. A new value should not fall
+ * behind the old one.
+ */
+ if (NvwalAvail)
+ {
+ Assert(LogwrtResult.Write == LogwrtResult.Flush);
+
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ if (ControlFile->discardedUpTo < LogwrtResult.Write)
+ {
+ ControlFile->discardedUpTo = LogwrtResult.Write;
+ UpdateControlFile();
+ }
+ LWLockRelease(ControlFileLock);
+ }
+
/*
* Update shared-memory status
*
@@ -2866,6 +2984,123 @@ XLogFlush(XLogRecPtr record)
return;
}
+ if (NvwalAvail)
+ {
+ XLogRecPtr FromPos;
+
+ /*
+ * No page on the NVWAL is to be flushed to segment files. Instead,
+ * we wait all the insertions preceding this one complete. We will
+ * wait for all the records to be persistent on the NVWAL below.
+ */
+ record = WaitXLogInsertionsToFinish(record);
+
+ /*
+ * Check if another backend already have done what I am doing.
+ *
+ * We can compare something <= XLogCtl->persistentUpTo without
+ * holding XLogCtl->info_lck spinlock because persistentUpTo is
+ * monotonically increasing and can be loaded atomically on each
+ * NVWAL-supported platform (now x64 only).
+ */
+ FromPos = *((volatile XLogRecPtr *) &XLogCtl->persistentUpTo);
+ if (record <= FromPos)
+ return;
+
+ /*
+ * In a very rare case, we rounded whole the NVWAL. We do not need
+ * to care old pages here because they already have been evicted to
+ * segment files at record insertion.
+ *
+ * In such a case, we flush whole the NVWAL. We also log it as
+ * warning because it can be time-consuming operation.
+ *
+ * TODO Advance XLogCtl->persistentUpTo at the end of XLogWrite, and
+ * we can remove the following first if-block.
+ */
+ if (record - FromPos > NvwalSize)
+ {
+ elog(WARNING, "flush whole the NVWAL; FromPos %X/%X; record %X/%X",
+ (uint32) (FromPos >> 32), (uint32) FromPos,
+ (uint32) (record >> 32), (uint32) record);
+
+ nv_flush(XLogCtl->pages, NvwalSize);
+ }
+ else
+ {
+ char *frompos;
+ char *uptopos;
+ size_t fromoff;
+ size_t uptooff;
+
+ /*
+ * Flush each record that is probably not flushed yet.
+ *
+ * We have two reasons why we say "probably". The first is because
+ * such a record copied with non-temporal store instruction has
+ * already "flushed" but we cannot distinguish it. nv_flush is
+ * harmless for it in consistency.
+ *
+ * The second reason is that the target record might have already
+ * been evicted to a segment file until now. Also in this case,
+ * nv_flush is harmless in consistency.
+ */
+ uptooff = record % NvwalSize;
+ uptopos = XLogCtl->pages + uptooff;
+ fromoff = FromPos % NvwalSize;
+ frompos = XLogCtl->pages + fromoff;
+
+ /* Handles rotation */
+ if (uptopos <= frompos)
+ {
+ nv_flush(frompos, NvwalSize - fromoff);
+ fromoff = 0;
+ frompos = XLogCtl->pages;
+ }
+
+ nv_flush(frompos, uptooff - fromoff);
+ }
+
+ /*
+ * To guarantee durability ("D" of ACID), we should satisfy the
+ * following two for each transaction X:
+ *
+ * (1) All the WAL records inserted by X, including the commit record
+ * of X, should persist on NVWAL before the server commits X.
+ *
+ * (2) All the WAL records inserted by any other transactions than
+ * X, that have less LSN than the commit record just inserted
+ * by X, should persist on NVWAL before the server commits X.
+ *
+ * The (1) can be satisfied by a store barrier after the commit record
+ * of X is flushed because each WAL record on X is already flushed in
+ * the end of its insertion. The (2) can be satisfied by waiting for
+ * any record insertions that have less LSN than the commit record just
+ * inserted by X, and by a store barrier as well.
+ *
+ * Now is the time. Have a store barrier.
+ */
+ nv_drain();
+
+ /*
+ * Remember where the last persistent record is. A new value should
+ * not fall behind the old one.
+ */
+ SpinLockAcquire(&XLogCtl->info_lck);
+ if (XLogCtl->persistentUpTo < record)
+ XLogCtl->persistentUpTo = record;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ /*
+ * The records up to the returned "record" have been persisntent on
+ * NVWAL. Now signal walsenders.
+ */
+ WalSndWakeupRequest();
+ WalSndWakeupProcessRequests();
+
+ return;
+ }
+
/* Quick exit if already known flushed */
if (record <= LogwrtResult.Flush)
return;
@@ -3049,6 +3284,13 @@ XLogBackgroundFlush(void)
if (RecoveryInProgress())
return false;
+ /*
+ * Quick exit if NVWAL buffer is used and archiving is not active. In this
+ * case, we need no WAL segment file in pg_wal directory.
+ */
+ if (NvwalAvail && !XLogArchivingActive())
+ return false;
+
/* read LogwrtResult and update local state */
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
@@ -3067,6 +3309,18 @@ XLogBackgroundFlush(void)
flexible = false; /* ensure it all gets written */
}
+ /*
+ * If NVWAL is used, back off to the last compeleted segment boundary
+ * for writing the buffer page to files in segment by segment. We do so
+ * nowhere but here after XLogCtl->asyncXactLSN is loaded because it
+ * should be considered.
+ */
+ if (NvwalAvail)
+ {
+ WriteRqst.Write -= WriteRqst.Write % wal_segment_size;
+ flexible = false; /* ensure it all gets written */
+ }
+
/*
* If already known flushed, we're done. Just need to check if we are
* holding an open file handle to a logfile that's no longer in use,
@@ -3093,7 +3347,12 @@ XLogBackgroundFlush(void)
flushbytes =
WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
- if (WalWriterFlushAfter == 0 || lastflush == 0)
+ if (NvwalAvail)
+ {
+ WriteRqst.Flush = WriteRqst.Write;
+ lastflush = now;
+ }
+ else if (WalWriterFlushAfter == 0 || lastflush == 0)
{
/* first call, or block based limits disabled */
WriteRqst.Flush = WriteRqst.Write;
@@ -3152,7 +3411,28 @@ XLogBackgroundFlush(void)
* Great, done. To take some work off the critical path, try to initialize
* as many of the no-longer-needed WAL buffers for future use as we can.
*/
- AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
+ if (NvwalAvail && max_wal_senders == 0)
+ {
+ XLogRecPtr upto;
+
+ /*
+ * If NVWAL is used and there is no walsender, nobody is to load
+ * segments on the buffer. So let's recycle segments up to {where we
+ * have requested to write and flush} + NvwalSize.
+ *
+ * Note that if NVWAL is used and a walsender seems running, we have to
+ * do nothing; keep the written pages on the buffer for walsenders to be
+ * loaded from the buffer, not from the segment files. Note that the
+ * buffer pages are eventually to be recycled by checkpoint.
+ */
+ Assert(WriteRqst.Write == WriteRqst.Flush);
+ Assert(WriteRqst.Write % wal_segment_size == 0);
+
+ upto = WriteRqst.Write + NvwalSize;
+ AdvanceXLInsertBuffer(upto - XLOG_BLCKSZ, false);
+ }
+ else
+ AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
/*
* If we determined that we need to write data, but somebody else
@@ -3885,6 +4165,43 @@ XLogFileClose(void)
ReleaseExternalFD();
}
+/*
+ * Preallocate non-volatile XLOG buffers.
+ *
+ * This zeroes buffers and prepare page headers up to
+ * ControlFile->discardedUpTo + S, where S is the total size of
+ * the non-volatile XLOG buffers.
+ *
+ * It is caller's responsibility to update ControlFile->discardedUpTo
+ * and to set XLogCtl->InitializedUpTo properly.
+ */
+static void
+PreallocNonVolatileXlogBuffer(void)
+{
+ XLogRecPtr newupto,
+ InitializedUpTo;
+
+ Assert(NvwalAvail);
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ newupto = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ InitializedUpTo = XLogCtl->InitializedUpTo;
+
+ newupto += NvwalSize;
+ Assert(newupto % wal_segment_size == 0);
+
+ if (newupto <= InitializedUpTo)
+ return;
+
+ /*
+ * Subtracting XLOG_BLCKSZ is important, because AdvanceXLInsertBuffer
+ * handles the first argument as the beginning of pages, not the end.
+ */
+ AdvanceXLInsertBuffer(newupto - XLOG_BLCKSZ, false);
+}
+
/*
* Preallocate log files beyond the specified log endpoint.
*
@@ -4181,8 +4498,11 @@ RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr)
* Before deleting the file, see if it can be recycled as a future log
* segment. Only recycle normal files, pg_standby for example can create
* symbolic links pointing to a separate archive directory.
+ *
+ * If NVWAL buffer is used, a log segment file is never to be recycled
+ * (that is, always go into else block).
*/
- if (wal_recycle &&
+ if (!NvwalAvail && wal_recycle &&
endlogSegNo <= recycleSegNo &&
lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
InstallXLogFileSegment(&endlogSegNo, path,
@@ -4600,6 +4920,7 @@ InitControlFile(uint64 sysidentifier)
memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, MOCK_AUTH_NONCE_LEN);
ControlFile->state = DB_SHUTDOWNED;
ControlFile->unloggedLSN = FirstNormalUnloggedLSN;
+ ControlFile->discardedUpTo = (NvwalAvail) ? wal_segment_size : InvalidXLogRecPtr;
/* Set important parameter values for use when replaying WAL */
ControlFile->MaxConnections = MaxConnections;
@@ -5430,41 +5751,58 @@ BootStrapXLOG(void)
record->xl_crc = crc;
/* Create first XLOG segment file */
- use_existent = false;
- openLogFile = XLogFileInit(1, &use_existent, false);
+ if (NvwalAvail)
+ {
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+ nv_memcpy_nodrain(XLogCtl->pages + wal_segment_size, page, XLOG_BLCKSZ);
+ pgstat_report_wait_end();
- /*
- * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
- * close the file again in a moment.
- */
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+ nv_drain();
+ pgstat_report_wait_end();
- /* Write the first page with the initial record */
- errno = 0;
- pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
- if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
- {
- /* if write didn't set errno, assume problem is no disk space */
- if (errno == 0)
- errno = ENOSPC;
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not write bootstrap write-ahead log file: %m")));
+ /*
+ * Other WAL stuffs will be initialized in startup process.
+ */
}
- pgstat_report_wait_end();
+ else
+ {
+ use_existent = false;
+ openLogFile = XLogFileInit(1, &use_existent, false);
- pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
- if (pg_fsync(openLogFile) != 0)
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not fsync bootstrap write-ahead log file: %m")));
- pgstat_report_wait_end();
+ /*
+ * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
+ * close the file again in a moment.
+ */
- if (close(openLogFile) != 0)
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not close bootstrap write-ahead log file: %m")));
+ /* Write the first page with the initial record */
+ errno = 0;
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+ if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write bootstrap write-ahead log file: %m")));
+ }
+ pgstat_report_wait_end();
- openLogFile = -1;
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+ if (pg_fsync(openLogFile) != 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not fsync bootstrap write-ahead log file: %m")));
+ pgstat_report_wait_end();
+
+ if (close(openLogFile) != 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not close bootstrap write-ahead log file: %m")));
+
+ openLogFile = -1;
+ }
/* Now create pg_control */
InitControlFile(sysidentifier);
@@ -5718,41 +6056,47 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
* happens in the middle of a segment, copy data from the last WAL segment
* of the old timeline up to the switch point, to the starting WAL segment
* on the new timeline.
+ *
+ * If non-volatile WAL buffer is used, no new segment file is created. Data
+ * up to the switch point will be copied into NVWAL buffer by StartupXLOG().
*/
- if (endLogSegNo == startLogSegNo)
- {
- /*
- * Make a copy of the file on the new timeline.
- *
- * Writing WAL isn't allowed yet, so there are no locking
- * considerations. But we should be just as tense as XLogFileInit to
- * avoid emplacing a bogus file.
- */
- XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
- XLogSegmentOffset(endOfLog, wal_segment_size));
- }
- else
+ if (!NvwalAvail)
{
- /*
- * The switch happened at a segment boundary, so just create the next
- * segment on the new timeline.
- */
- bool use_existent = true;
- int fd;
+ if (endLogSegNo == startLogSegNo)
+ {
+ /*
+ * Make a copy of the file on the new timeline.
+ *
+ * Writing WAL isn't allowed yet, so there are no locking
+ * considerations. But we should be just as tense as XLogFileInit to
+ * avoid emplacing a bogus file.
+ */
+ XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
+ XLogSegmentOffset(endOfLog, wal_segment_size));
+ }
+ else
+ {
+ /*
+ * The switch happened at a segment boundary, so just create the next
+ * segment on the new timeline.
+ */
+ bool use_existent = true;
+ int fd;
- fd = XLogFileInit(startLogSegNo, &use_existent, true);
+ fd = XLogFileInit(startLogSegNo, &use_existent, true);
- if (close(fd) != 0)
- {
- char xlogfname[MAXFNAMELEN];
- int save_errno = errno;
+ if (close(fd) != 0)
+ {
+ char xlogfname[MAXFNAMELEN];
+ int save_errno = errno;
- XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
- wal_segment_size);
- errno = save_errno;
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not close file \"%s\": %m", xlogfname)));
+ XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
+ wal_segment_size);
+ errno = save_errno;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m", xlogfname)));
+ }
}
}
@@ -7009,6 +7353,11 @@ StartupXLOG(void)
InRecovery = true;
}
+ /* Dump discardedUpTo just before REDO */
+ elog(LOG, "ControlFile->discardedUpTo %X/%X",
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo);
+
/* REDO */
if (InRecovery)
{
@@ -7795,10 +8144,88 @@ StartupXLOG(void)
Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+ if (NvwalAvail)
+ {
+ XLogRecPtr discardedUpTo;
+
+ discardedUpTo = ControlFile->discardedUpTo;
+ Assert(discardedUpTo == InvalidXLogRecPtr ||
+ discardedUpTo % wal_segment_size == 0);
+
+ if (discardedUpTo == InvalidXLogRecPtr)
+ {
+ elog(DEBUG1, "brand-new NVWAL");
+
+ /* The following "Tricky point" is to initialize the buffer */
+ }
+ else if (EndOfLog <= discardedUpTo)
+ {
+ elog(DEBUG1, "no record on NVWAL has been UNDONE");
+
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = InvalidXLogRecPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+
+ nv_memset_persist(XLogCtl->pages, 0, NvwalSize);
+
+ /* The following "Tricky point" is to initialize the buffer */
+ }
+ else
+ {
+ int last_idx;
+ int idx;
+ XLogRecPtr ptr;
+
+ elog(DEBUG1, "some records on NVWAL have been UNDONE; keep them");
+
+ /*
+ * Initialize xlblock array because we decided to keep UNDONE
+ * records on NVWAL buffer; or each page on the buffer that meets
+ * xlblocks == 0 (initialized as so by XLOGShmemInit) is to be
+ * accidentally cleared by the following AdvanceXLInsertBuffer!
+ *
+ * Two cases can be considered:
+ *
+ * 1) EndOfLog is on a page boundary (divisible by XLOG_BLCKSZ):
+ * Initialize up to (and including) the page containing the last
+ * record. That page should end with EndOfLog. The one more
+ * next page "N" beginning with EndOfLog is to be untouched
+ * because, in such a very corner case that all the NVWAL
+ * buffer pages are already filled, page N is on the same
+ * location as the first page "F" beginning with discardedUpTo.
+ * Of cource we should not overwrite the page F.
+ *
+ * In this case, we first get XLogRecPtrToBufIdx(EndOfLog) as
+ * last_idx, indicating the page N. Then, we go forward from
+ * the page F up to (but excluding) page N that have the same
+ * index as the page F.
+ *
+ * 2) EndOfLog is not on a page boundary: Initialize all the pages
+ * but the page "L" having the last record. The page L is to be
+ * initialized by the following "Tricky point", including its
+ * content.
+ *
+ * In either case, XLogCtl->InitializedUpTo is to be initialized in
+ * the following "Tricky" if-else block.
+ */
+
+ last_idx = XLogRecPtrToBufIdx(EndOfLog);
+
+ ptr = discardedUpTo;
+ for (idx = XLogRecPtrToBufIdx(ptr); idx != last_idx;
+ idx = NextBufIdx(idx))
+ {
+ ptr += XLOG_BLCKSZ;
+ XLogCtl->xlblocks[idx] = ptr;
+ }
+ }
+ }
+
/*
- * Tricky point here: readBuf contains the *last* block that the LastRec
- * record spans, not the one it starts in. The last block is indeed the
- * one we want to use.
+ * Tricky point here: readBuf contains the *last* block that the
+ * LastRec record spans, not the one it starts in. The last block is
+ * indeed the one we want to use.
*/
if (EndOfLog % XLOG_BLCKSZ != 0)
{
@@ -7818,6 +8245,9 @@ StartupXLOG(void)
memcpy(page, xlogreader->readBuf, len);
memset(page + len, 0, XLOG_BLCKSZ - len);
+ if (NvwalAvail)
+ nv_persist(page, XLOG_BLCKSZ);
+
XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
}
@@ -7831,12 +8261,54 @@ StartupXLOG(void)
XLogCtl->InitializedUpTo = EndOfLog;
}
- LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+ if (NvwalAvail)
+ {
+ XLogRecPtr SegBeginPtr;
+
+ /*
+ * If NVWAL buffer is used, writing records out to segment files should
+ * be done in segment by segment. So Logwrt{Rqst,Result} (and also
+ * discardedUpTo) should be multiple of wal_segment_size. Let's get
+ * them back off to the last segment boundary.
+ */
+
+ SegBeginPtr = EndOfLog - (EndOfLog % wal_segment_size);
+ LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+ XLogCtl->LogwrtResult = LogwrtResult;
+ XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+ XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+
+ /*
+ * persistentUpTo does not need to be multiple of wal_segment_size,
+ * and should be drained-up-to LSN. walsender will use it to load
+ * records from NVWAL buffer.
+ */
+ XLogCtl->persistentUpTo = EndOfLog;
+
+ /* Update discardedUpTo in pg_control if still invalid */
+ if (ControlFile->discardedUpTo == InvalidXLogRecPtr)
+ {
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = SegBeginPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+ }
+
+ elog(DEBUG1, "EndOfLog: %X/%X",
+ (uint32) (EndOfLog >> 32), (uint32) EndOfLog);
- XLogCtl->LogwrtResult = LogwrtResult;
+ elog(DEBUG1, "SegBeginPtr: %X/%X",
+ (uint32) (SegBeginPtr >> 32), (uint32) SegBeginPtr);
+ }
+ else
+ {
+ LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
- XLogCtl->LogwrtRqst.Write = EndOfLog;
- XLogCtl->LogwrtRqst.Flush = EndOfLog;
+ XLogCtl->LogwrtResult = LogwrtResult;
+
+ XLogCtl->LogwrtRqst.Write = EndOfLog;
+ XLogCtl->LogwrtRqst.Flush = EndOfLog;
+ }
/*
* Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7967,6 +8439,7 @@ StartupXLOG(void)
char origpath[MAXPGPATH];
char partialfname[MAXFNAMELEN];
char partialpath[MAXPGPATH];
+ XLogRecPtr discardedUpTo;
XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
@@ -7978,6 +8451,53 @@ StartupXLOG(void)
*/
XLogArchiveCleanup(partialfname);
+ /*
+ * If NVWAL is also used for archival recovery, write old
+ * records out to segment files to archive them. Note that we
+ * need locks related to WAL because LocalXLogInsertAllowed
+ * already got to -1.
+ */
+ discardedUpTo = ControlFile->discardedUpTo;
+ if (NvwalAvail && discardedUpTo != InvalidXLogRecPtr &&
+ discardedUpTo < EndOfLog)
+ {
+ XLogwrtRqst WriteRqst;
+ TimeLineID thisTLI = ThisTimeLineID;
+ XLogRecPtr SegBeginPtr =
+ EndOfLog - (EndOfLog % wal_segment_size);
+
+ /*
+ * XXX Assume that all the records have the same TLI.
+ */
+ ThisTimeLineID = EndOfLogTLI;
+
+ WriteRqst.Write = EndOfLog;
+ WriteRqst.Flush = 0;
+
+ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ XLogWrite(WriteRqst, false);
+
+ /*
+ * Force back-off to the last segment boundary.
+ */
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = SegBeginPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+
+ LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ XLogCtl->LogwrtResult = LogwrtResult;
+ XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+ XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ LWLockRelease(WALWriteLock);
+
+ ThisTimeLineID = thisTLI;
+ }
+
durable_rename(origpath, partialpath, ERROR);
XLogArchiveNotify(partialfname);
}
@@ -7987,7 +8507,10 @@ StartupXLOG(void)
/*
* Preallocate additional log files, if wanted.
*/
- PreallocXlogFiles(EndOfLog);
+ if (NvwalAvail)
+ PreallocNonVolatileXlogBuffer();
+ else
+ PreallocXlogFiles(EndOfLog);
/*
* Okay, we're officially UP.
@@ -8550,10 +9073,24 @@ GetInsertRecPtr(void)
/*
* GetFlushRecPtr -- Returns the current flush position, ie, the last WAL
* position known to be fsync'd to disk.
+ *
+ * If NVWAL is used, this returns the last persistent WAL position instead.
*/
XLogRecPtr
GetFlushRecPtr(void)
{
+ if (NvwalAvail)
+ {
+ XLogRecPtr ret;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ LogwrtResult = XLogCtl->LogwrtResult;
+ ret = XLogCtl->persistentUpTo;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ return ret;
+ }
+
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);
@@ -8853,6 +9390,9 @@ CreateCheckPoint(int flags)
VirtualTransactionId *vxids;
int nvxids;
+ /* for non-volatile WAL buffer */
+ XLogRecPtr newDiscardedUpTo = 0;
+
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
* issued at a different time.
@@ -9164,6 +9704,22 @@ CreateCheckPoint(int flags)
*/
PriorRedoPtr = ControlFile->checkPointCopy.redo;
+ /*
+ * If non-volatile WAL buffer is used, discardedUpTo should be updated and
+ * persist on the control file. So the new value should be caluculated
+ * here.
+ *
+ * TODO Do not copy and paste codes...
+ */
+ if (NvwalAvail)
+ {
+ XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+ KeepLogSeg(recptr, &_logSegNo);
+ _logSegNo--;
+
+ newDiscardedUpTo = _logSegNo * wal_segment_size;
+ }
+
/*
* Update the control file.
*/
@@ -9172,6 +9728,16 @@ CreateCheckPoint(int flags)
ControlFile->state = DB_SHUTDOWNED;
ControlFile->checkPoint = ProcLastRecPtr;
ControlFile->checkPointCopy = checkPoint;
+ if (NvwalAvail)
+ {
+ /*
+ * A new value should not fall behind the old one.
+ */
+ if (ControlFile->discardedUpTo < newDiscardedUpTo)
+ ControlFile->discardedUpTo = newDiscardedUpTo;
+ else
+ newDiscardedUpTo = ControlFile->discardedUpTo;
+ }
ControlFile->time = (pg_time_t) time(NULL);
/* crash recovery should always recover to the end of WAL */
ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
@@ -9189,6 +9755,44 @@ CreateCheckPoint(int flags)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * If we use non-volatile XLOG buffer, update XLogCtl->Logwrt{Rqst,Result}
+ * so that the XLOG records older than newDiscardedUpTo are treated as
+ * "already written and flushed."
+ */
+ if (NvwalAvail)
+ {
+ Assert(newDiscardedUpTo > 0);
+
+ /* Update process-local variables */
+ LogwrtResult.Write = LogwrtResult.Flush = newDiscardedUpTo;
+
+ /*
+ * Update shared-memory variables. We need both light-weight lock and
+ * spin lock to update them.
+ */
+ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&XLogCtl->info_lck);
+
+ /*
+ * Note that there can be a corner case that process-local
+ * LogwrtResult falls behind shared XLogCtl->LogwrtResult if whole the
+ * non-volatile XLOG buffer is filled and some pages are written out
+ * to segment files between UpdateControlFile and LWLockAcquire above.
+ *
+ * TODO For now, we ignore that case because it can hardly occur.
+ */
+ XLogCtl->LogwrtResult = LogwrtResult;
+
+ if (XLogCtl->LogwrtRqst.Write < newDiscardedUpTo)
+ XLogCtl->LogwrtRqst.Write = newDiscardedUpTo;
+ if (XLogCtl->LogwrtRqst.Flush < newDiscardedUpTo)
+ XLogCtl->LogwrtRqst.Flush = newDiscardedUpTo;
+
+ SpinLockRelease(&XLogCtl->info_lck);
+ LWLockRelease(WALWriteLock);
+ }
+
/* Update shared-memory copy of checkpoint XID/epoch */
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->ckptFullXid = checkPoint.nextFullXid;
@@ -9212,22 +9816,48 @@ CreateCheckPoint(int flags)
if (PriorRedoPtr != InvalidXLogRecPtr)
UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
- /*
- * Delete old log files, those no longer needed for last checkpoint to
- * prevent the disk holding the xlog from growing full.
- */
- XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
- KeepLogSeg(recptr, &_logSegNo);
- InvalidateObsoleteReplicationSlots(_logSegNo);
- _logSegNo--;
- RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ if (NvwalAvail)
+ {
+ /*
+ * We already set _logSegNo to the value equivalent to discardedUpTo.
+ * We first increment it to call InvalidateObsoleteReplicationSlots.
+ */
+ _logSegNo++;
+ InvalidateObsoleteReplicationSlots(_logSegNo);
+
+ /*
+ * Then we decrement _logSegNo again to remove WAL segment files
+ * having spilled out of non-volatile WAL buffer.
+ *
+ * Note that you should set wal_recycle to off to remove segment files.
+ */
+ _logSegNo--;
+ RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ }
+ else
+ {
+ /*
+ * Delete old log files, those no longer needed for last checkpoint to
+ * prevent the disk holding the xlog from growing full.
+ */
+ XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+ KeepLogSeg(recptr, &_logSegNo);
+ InvalidateObsoleteReplicationSlots(_logSegNo);
+ _logSegNo--;
+ RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ }
/*
* Make more log segments if needed. (Do this after recycling old log
* segments, since that may supply some of the needed files.)
*/
if (!shutdown)
- PreallocXlogFiles(recptr);
+ {
+ if (NvwalAvail)
+ PreallocNonVolatileXlogBuffer();
+ else
+ PreallocXlogFiles(recptr);
+ }
/*
* Truncate pg_subtrans if possible. We can throw away all data before
@@ -11971,6 +12601,170 @@ CancelBackup(void)
}
}
+/*
+ * Is NVWAL used?
+ */
+bool
+IsNvwalAvail(void)
+{
+ return NvwalAvail;
+}
+
+/*
+ * Returns the size we can load from NVWAL and sets nvwalptr to load-from LSN.
+ */
+Size
+GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
+{
+ XLogRecPtr readUpTo;
+ XLogRecPtr discardedUpTo;
+
+ Assert(IsNvwalAvail());
+ Assert(nvwalptr != NULL);
+
+ readUpTo = target + count;
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ discardedUpTo = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ /* Check if all the records are on WAL segment files */
+ if (readUpTo <= discardedUpTo)
+ return 0;
+
+ /* Check if all the records are on NVWAL */
+ if (discardedUpTo <= target)
+ {
+ *nvwalptr = target;
+ return count;
+ }
+
+ /* Some on WAL segment files, some on NVWAL */
+ *nvwalptr = discardedUpTo;
+ return (Size) (readUpTo - discardedUpTo);
+}
+
+/*
+ * It is like WALRead @ xlogreader.c, but loads from non-volatile WAL
+ * buffer.
+ */
+bool
+CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+ char *p;
+ XLogRecPtr recptr;
+ Size nbytes;
+
+ Assert(NvwalAvail);
+
+ p = buf;
+ recptr = startptr;
+ nbytes = count;
+
+ /*
+ * Hold shared WALBufMappingLock to let others not rotate WAL buffer
+ * while copying WAL records from it. We do not need exclusive lock
+ * because we will not rotate the buffer in this function.
+ */
+ LWLockAcquire(WALBufMappingLock, LW_SHARED);
+
+ while (nbytes > 0)
+ {
+ char *q;
+ Size off;
+ Size max_copy;
+ Size copybytes;
+ XLogRecPtr discardedUpTo;
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ discardedUpTo = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ /* Check if the records we need have been already evicted or not */
+ if (recptr < discardedUpTo)
+ {
+ LWLockRelease(WALBufMappingLock);
+
+ /* TODO error handling? */
+ return false;
+ }
+
+ /*
+ * Get the target address on non-volatile WAL buffer and the size we
+ * can copy from it at once because the buffer can rotate and we
+ * might have to copy what we want devided into two or more.
+ */
+ off = recptr % NvwalSize;
+ q = XLogCtl->pages + off;
+ max_copy = NvwalSize - off;
+ copybytes = Min(nbytes, max_copy);
+
+ memcpy(p, q, copybytes);
+
+ /* Update state for copy */
+ recptr += copybytes;
+ nbytes -= copybytes;
+ p += copybytes;
+ }
+
+ LWLockRelease(WALBufMappingLock);
+ return true;
+}
+
+static bool
+IsXLogSourceFromStream(XLogSource source)
+{
+ switch (source)
+ {
+ case XLOG_FROM_STREAM:
+ case XLOG_FROM_STREAM_NVWAL:
+ return true;
+
+ default:
+ return false;
+ }
+}
+
+static bool
+IsXLogSourceFromNvwal(XLogSource source)
+{
+ switch (source)
+ {
+ case XLOG_FROM_NVWAL:
+ case XLOG_FROM_STREAM_NVWAL:
+ return true;
+
+ default:
+ return false;
+ }
+}
+
+static bool
+NeedsForMoreXLog(XLogRecPtr targetChunkEndPtr)
+{
+ switch (readSource)
+ {
+ case XLOG_FROM_ARCHIVE:
+ case XLOG_FROM_PG_WAL:
+ return (readFile < 0);
+
+ case XLOG_FROM_NVWAL:
+ Assert(NvwalAvail);
+ return false;
+
+ case XLOG_FROM_STREAM:
+ return (flushedUpto < targetChunkEndPtr);
+
+ case XLOG_FROM_STREAM_NVWAL:
+ Assert(NvwalAvail);
+ return (flushedUpto < targetChunkEndPtr);
+
+ default: /* XLOG_FROM_ANY */
+ Assert(readFile < 0);
+ return true;
+ }
+}
+
/*
* Read the XLOG page containing RecPtr into readBuf (if not read already).
* Returns number of bytes read, if the page is read successfully, or -1
@@ -12012,7 +12806,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
* See if we need to switch to a new segment because the requested record
* is not in the currently open one.
*/
- if (readFile >= 0 &&
+ if ((readFile >= 0 || IsXLogSourceFromNvwal(readSource)) &&
!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
{
/*
@@ -12029,7 +12823,8 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
}
}
- close(readFile);
+ if (readFile >= 0)
+ close(readFile);
readFile = -1;
readSource = XLOG_FROM_ANY;
}
@@ -12038,9 +12833,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
retry:
/* See if we need to retrieve more data */
- if (readFile < 0 ||
- (readSource == XLOG_FROM_STREAM &&
- flushedUpto < targetPagePtr + reqLen))
+ if (NeedsForMoreXLog(targetPagePtr + reqLen))
{
if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
private->randAccess,
@@ -12061,7 +12854,7 @@ retry:
* At this point, we have the right segment open and if we're streaming we
* know the requested record is in it.
*/
- Assert(readFile != -1);
+ Assert(readFile != -1 || IsXLogSourceFromNvwal(readSource));
/*
* If the current segment is being streamed from master, calculate how
@@ -12069,7 +12862,7 @@ retry:
* requested record has been received, but this is for the benefit of
* future calls, to allow quick exit at the top of this function.
*/
- if (readSource == XLOG_FROM_STREAM)
+ if (IsXLogSourceFromStream(readSource))
{
if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
readLen = XLOG_BLCKSZ;
@@ -12080,41 +12873,59 @@ retry:
else
readLen = XLOG_BLCKSZ;
- /* Read the requested page */
readOff = targetPageOff;
- pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
- r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
- if (r != XLOG_BLCKSZ)
+ if (IsXLogSourceFromNvwal(readSource))
{
- char fname[MAXFNAMELEN];
- int save_errno = errno;
+ Size offset = (Size) (targetPagePtr % NvwalSize);
+ char *readpos = XLogCtl->pages + offset;
+
+ Assert(offset % XLOG_BLCKSZ == 0);
+ /* Load the requested page from non-volatile WAL buffer */
+ pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+ memcpy(readBuf, readpos, readLen);
pgstat_report_wait_end();
- XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
- if (r < 0)
+
+ /* There are not any other clues of TLI... */
+ xlogreader->seg.ws_tli = ((XLogPageHeader) readBuf)->xlp_tli;
+ }
+ else
+ {
+ /* Read the requested page from file */
+ pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+ r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+ if (r != XLOG_BLCKSZ)
{
- errno = save_errno;
- ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
- (errcode_for_file_access(),
- errmsg("could not read from log segment %s, offset %u: %m",
- fname, readOff)));
+ char fname[MAXFNAMELEN];
+ int save_errno = errno;
+
+ pgstat_report_wait_end();
+ XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+ if (r < 0)
+ {
+ errno = save_errno;
+ ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+ (errcode_for_file_access(),
+ errmsg("could not read from log segment %s, offset %u: %m",
+ fname, readOff)));
+ }
+ else
+ ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+ fname, readOff, r, (Size) XLOG_BLCKSZ)));
+ goto next_record_is_invalid;
}
- else
- ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("could not read from log segment %s, offset %u: read %d of %zu",
- fname, readOff, r, (Size) XLOG_BLCKSZ)));
- goto next_record_is_invalid;
+ pgstat_report_wait_end();
+
+ xlogreader->seg.ws_tli = curFileTLI;
}
- pgstat_report_wait_end();
Assert(targetSegNo == readSegNo);
Assert(targetPageOff == readOff);
Assert(reqLen <= readLen);
- xlogreader->seg.ws_tli = curFileTLI;
-
/*
* Check the page header immediately, so that we can retry immediately if
* it's not valid. This may seem unnecessary, because XLogReadRecord()
@@ -12148,6 +12959,17 @@ retry:
goto next_record_is_invalid;
}
+ /*
+ * Updating curFileTLI on each page verified if non-volatile WAL buffer
+ * is used because there is no TimeLineID information in NVWAL's filename.
+ */
+ if (IsXLogSourceFromNvwal(readSource) &&
+ curFileTLI != xlogreader->latestPageTLI)
+ {
+ curFileTLI = xlogreader->latestPageTLI;
+ elog(DEBUG1, "curFileTLI: %u", curFileTLI);
+ }
+
return readLen;
next_record_is_invalid:
@@ -12229,7 +13051,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
if (!InArchiveRecovery)
currentSource = XLOG_FROM_PG_WAL;
else if (currentSource == XLOG_FROM_ANY ||
- (!StandbyMode && currentSource == XLOG_FROM_STREAM))
+ (!StandbyMode && IsXLogSourceFromStream(currentSource)))
{
lastSourceFailed = false;
currentSource = XLOG_FROM_ARCHIVE;
@@ -12252,6 +13074,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
{
case XLOG_FROM_ARCHIVE:
case XLOG_FROM_PG_WAL:
+ case XLOG_FROM_NVWAL:
/*
* Check to see if the trigger file exists. Note that we
@@ -12265,6 +13088,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
return false;
}
+ /* Try NVWAL if available */
+ if (NvwalAvail && currentSource != XLOG_FROM_NVWAL)
+ {
+ currentSource = XLOG_FROM_NVWAL;
+ break;
+ }
+
/*
* Not in standby mode, and we've now tried the archive
* and pg_wal.
@@ -12276,11 +13106,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* Move to XLOG_FROM_STREAM state, and set to start a
* walreceiver if necessary.
*/
- currentSource = XLOG_FROM_STREAM;
+ if (currentSource == XLOG_FROM_NVWAL)
+ currentSource = XLOG_FROM_STREAM_NVWAL;
+ else
+ currentSource = XLOG_FROM_STREAM;
startWalReceiver = true;
break;
case XLOG_FROM_STREAM:
+ case XLOG_FROM_STREAM_NVWAL:
/*
* Failure while streaming. Most likely, we got here
@@ -12386,6 +13220,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
{
case XLOG_FROM_ARCHIVE:
case XLOG_FROM_PG_WAL:
+ case XLOG_FROM_NVWAL:
/*
* WAL receiver must not be running when reading WAL from
@@ -12403,6 +13238,59 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
if (randAccess)
curFileTLI = 0;
+ /* Try to load from NVWAL */
+ if (currentSource == XLOG_FROM_NVWAL)
+ {
+ XLogRecPtr discardedUpTo;
+
+ Assert(NvwalAvail);
+
+ /*
+ * Check if the target page exists on NVWAL. Note that
+ * RecPtr points to the end of the target chunk.
+ *
+ * TODO need ControlFileLock?
+ */
+ discardedUpTo = ControlFile->discardedUpTo;
+ if (discardedUpTo != InvalidXLogRecPtr &&
+ discardedUpTo < RecPtr &&
+ RecPtr <= discardedUpTo + NvwalSize)
+ {
+ /* Report recovery progress in PS display */
+ set_ps_display("recovering NVWAL");
+ elog(DEBUG1, "recovering NVWAL");
+
+ /* Track source of data and receipt time */
+ readSource = XLOG_FROM_NVWAL;
+ XLogReceiptSource = XLOG_FROM_NVWAL;
+ XLogReceiptTime = GetCurrentTimestamp();
+
+ /*
+ * Construct expectedTLEs. This is necessary to
+ * recover only from NVWAL because its filename does
+ * not have any TLI information.
+ */
+ if (!expectedTLEs)
+ {
+ TimeLineHistoryEntry *entry;
+
+ entry = palloc(sizeof(TimeLineHistoryEntry));
+ entry->tli = recoveryTargetTLI;
+ entry->begin = entry->end = InvalidXLogRecPtr;
+
+ expectedTLEs = list_make1(entry);
+ elog(DEBUG1, "expectedTLEs: [%u]",
+ (uint32) recoveryTargetTLI);
+ }
+
+ return true;
+ }
+
+ /* Target page does not exist on NVWAL */
+ lastSourceFailed = true;
+ break;
+ }
+
/*
* Try to restore the file from archive, or read an existing
* file from pg_wal.
@@ -12420,6 +13308,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
break;
case XLOG_FROM_STREAM:
+ case XLOG_FROM_STREAM_NVWAL:
{
bool havedata;
@@ -12544,21 +13433,34 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* info is set correctly and XLogReceiptTime isn't
* changed.
*/
- if (readFile < 0)
+ if (currentSource == XLOG_FROM_STREAM_NVWAL)
{
if (!expectedTLEs)
expectedTLEs = readTimeLineHistory(receiveTLI);
- readFile = XLogFileRead(readSegNo, PANIC,
- receiveTLI,
- XLOG_FROM_STREAM, false);
- Assert(readFile >= 0);
+
+ /* TODO is it ok to return, not to break switch? */
+ readSource = XLOG_FROM_STREAM_NVWAL;
+ XLogReceiptSource = XLOG_FROM_STREAM_NVWAL;
+ return true;
}
else
{
- /* just make sure source info is correct... */
- readSource = XLOG_FROM_STREAM;
- XLogReceiptSource = XLOG_FROM_STREAM;
- return true;
+ if (readFile < 0)
+ {
+ if (!expectedTLEs)
+ expectedTLEs = readTimeLineHistory(receiveTLI);
+ readFile = XLogFileRead(readSegNo, PANIC,
+ receiveTLI,
+ XLOG_FROM_STREAM, false);
+ Assert(readFile >= 0);
+ }
+ else
+ {
+ /* just make sure source info is correct... */
+ readSource = XLOG_FROM_STREAM;
+ XLogReceiptSource = XLOG_FROM_STREAM;
+ return true;
+ }
}
break;
}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4f46..77f629fda2 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1066,11 +1066,24 @@ WALRead(XLogReaderState *state,
char *p;
XLogRecPtr recptr;
Size nbytes;
+#ifndef FRONTEND
+ XLogRecPtr recptr_nvwal = 0;
+ Size nbytes_nvwal = 0;
+#endif
p = buf;
recptr = startptr;
nbytes = count;
+#ifndef FRONTEND
+ /* Try to load records directly from NVWAL if used */
+ if (IsNvwalAvail())
+ {
+ nbytes_nvwal = GetLoadableSizeFromNvwal(startptr, count, &recptr_nvwal);
+ nbytes = count - nbytes_nvwal;
+ }
+#endif
+
while (nbytes > 0)
{
uint32 startoff;
@@ -1138,6 +1151,17 @@ WALRead(XLogReaderState *state,
p += readbytes;
}
+#ifndef FRONTEND
+ if (IsNvwalAvail())
+ {
+ if (!CopyXLogRecordsFromNVWAL(p, nbytes_nvwal, recptr_nvwal))
+ {
+ /* TODO graceful error handling */
+ elog(PANIC, "some records on NVWAL had been discarded");
+ }
+ }
+#endif
+
return true;
}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..4c594e915f 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -272,6 +272,9 @@ main(int argc, char *argv[])
ControlFile->checkPointCopy.oldestCommitTsXid);
printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
ControlFile->checkPointCopy.newestCommitTsXid);
+ printf(_("discarded Up To: %X/%X\n"),
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo);
printf(_("Time of latest checkpoint: %s\n"),
ckpttime_str);
printf(_("Fake LSN counter for unlogged rels: %X/%X\n"),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a05e79524..75433a6dc0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -351,6 +351,14 @@ extern void XLogRequestWalReceiverReply(void);
extern void assign_max_wal_size(int newval, void *extra);
extern void assign_checkpoint_completion_target(double newval, void *extra);
+extern bool IsNvwalAvail(void);
+extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
+ Size count,
+ XLogRecPtr *nvwalptr);
+extern bool CopyXLogRecordsFromNVWAL(char *buf,
+ Size count,
+ XLogRecPtr startptr);
+
/*
* Routines to start, stop, and get status of a base backup.
*/
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..fe71992a69 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
/* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION 1300
+#define PG_CONTROL_VERSION 1301
/* Nonce key length, see below */
#define MOCK_AUTH_NONCE_LEN 32
@@ -132,6 +132,21 @@ typedef struct ControlFileData
XLogRecPtr unloggedLSN; /* current fake LSN value, for unlogged rels */
+ /*
+ * Used for non-volatile WAL buffer (NVWAL).
+ *
+ * discardedUpTo is updated to the oldest LSN in the NVWAL when either a
+ * checkpoint or a restartpoint is completed successfully, or whole the
+ * NVWAL is filled with WAL records and a new record is being inserted.
+ * This field tells that the NVWAL contains WAL records in the range of
+ * [discardedUpTo, discardedUpTo+S), where S is the size of the NVWAL.
+ * Note that the WAL records whose LSN are less than discardedUpTo would
+ * remain in WAL segment files and be needed for recovery.
+ *
+ * It is set to zero when NVWAL is not used.
+ */
+ XLogRecPtr discardedUpTo;
+
/*
* These two values determine the minimum point we must recover up to
* before starting up:
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index d3acb98d04..bbd47e1663 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -142,14 +142,17 @@ HINT: No function matches the given name and argument types. You might need to
select setting as segsize
from pg_settings where name = 'wal_segment_size'
\gset
-select count(*) > 0 as ok from pg_ls_waldir();
+select setting as nvwal_path
+from pg_settings where name = 'nvwal_path'
+\gset
+select count(*) > 0 or :'nvwal_path' <> '' as ok from pg_ls_waldir();
ok
----
t
(1 row)
-- Test ProjectSet as well as FunctionScan
-select count(*) > 0 as ok from (select pg_ls_waldir()) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select pg_ls_waldir()) ss;
ok
----
t
@@ -161,14 +164,15 @@ select * from pg_ls_waldir() limit 0;
------+------+--------------
(0 rows)
-select count(*) > 0 as ok from (select * from pg_ls_waldir() limit 1) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select * from pg_ls_waldir() limit 1) ss;
ok
----
t
(1 row)
-select (w).size = :segsize as ok
-from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from
+ (select * from pg_ls_waldir() w
+ where length((w).name) = 24 and (w).size = :segsize limit 1) ss;
ok
----
t
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 094e8f8296..09c326775d 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -39,15 +39,19 @@ SELECT num_nulls();
select setting as segsize
from pg_settings where name = 'wal_segment_size'
\gset
+select setting as nvwal_path
+from pg_settings where name = 'nvwal_path'
+\gset
-select count(*) > 0 as ok from pg_ls_waldir();
+select count(*) > 0 or :'nvwal_path' <> '' as ok from pg_ls_waldir();
-- Test ProjectSet as well as FunctionScan
-select count(*) > 0 as ok from (select pg_ls_waldir()) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select pg_ls_waldir()) ss;
-- Test not-run-to-completion cases.
select * from pg_ls_waldir() limit 0;
-select count(*) > 0 as ok from (select * from pg_ls_waldir() limit 1) ss;
-select (w).size = :segsize as ok
-from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select * from pg_ls_waldir() limit 1) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from
+ (select * from pg_ls_waldir() w
+ where length((w).name) = 24 and (w).size = :segsize limit 1) ss;
select count(*) >= 0 as ok from pg_ls_archive_statusdir();
--
2.17.1
v3-0003-walreceiver-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0003-walreceiver-supports-non-volatile-WAL-buffer.patchDownload
From e3a4da834a79770c63c26c9859dc179911a37540 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:58 +0900
Subject: [PATCH v3 3/5] walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile
WAL buffer if applicable.
---
src/backend/access/transam/xlog.c | 31 +++++++++++++++-
src/backend/replication/walreceiver.c | 53 ++++++++++++++++++++++++++-
src/include/access/xlog.h | 4 ++
3 files changed, 85 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 45e05b9498..2a022be36a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -925,6 +925,8 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
XLogSource source, bool notfoundOk);
static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
+static bool CopyXLogRecordsOnNVWAL(char *buf, Size count, XLogRecPtr startptr,
+ bool store);
static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
@@ -12650,6 +12652,21 @@ GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
*/
bool
CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+ return CopyXLogRecordsOnNVWAL(buf, count, startptr, false);
+}
+
+/*
+ * Called by walreceiver.
+ */
+bool
+CopyXLogRecordsToNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+ return CopyXLogRecordsOnNVWAL(buf, count, startptr, true);
+}
+
+static bool
+CopyXLogRecordsOnNVWAL(char *buf, Size count, XLogRecPtr startptr, bool store)
{
char *p;
XLogRecPtr recptr;
@@ -12699,7 +12716,13 @@ CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
max_copy = NvwalSize - off;
copybytes = Min(nbytes, max_copy);
- memcpy(p, q, copybytes);
+ if (store)
+ {
+ memcpy(q, p, copybytes);
+ nv_flush(q, copybytes);
+ }
+ else
+ memcpy(p, q, copybytes);
/* Update state for copy */
recptr += copybytes;
@@ -12711,6 +12734,12 @@ CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
return true;
}
+void
+SyncNVWAL(void)
+{
+ nv_drain();
+}
+
static bool
IsXLogSourceFromStream(XLogSource source)
{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d1ad75da87..20922ed230 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -130,6 +130,7 @@ static void WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *start
static void WalRcvDie(int code, Datum arg);
static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
+static void XLogWalRcvStore(char *buf, Size nbytes, XLogRecPtr recptr);
static void XLogWalRcvFlush(bool dying);
static void XLogWalRcvSendReply(bool force, bool requestReply);
static void XLogWalRcvSendHSFeedback(bool immed);
@@ -856,7 +857,10 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
buf += hdrlen;
len -= hdrlen;
- XLogWalRcvWrite(buf, len, dataStart);
+ if (IsNvwalAvail())
+ XLogWalRcvStore(buf, len, dataStart);
+ else
+ XLogWalRcvWrite(buf, len, dataStart);
break;
}
case 'k': /* Keepalive */
@@ -991,6 +995,42 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
}
+/*
+ * Like XLogWalRcvWrite, but store to non-volatile WAL buffer.
+ */
+static void
+XLogWalRcvStore(char *buf, Size nbytes, XLogRecPtr recptr)
+{
+ Assert(IsNvwalAvail());
+
+ CopyXLogRecordsToNVWAL(buf, nbytes, recptr);
+
+ /*
+ * Also write out to file if we have to archive segments.
+ *
+ * We could do this segment by segment but we reuse existing method to
+ * do it record by record because the former gives us more complexity
+ * (locking WalBufMappingLock, getting the address of the segment on
+ * non-volatile WAL buffer, etc).
+ */
+ if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+ XLogWalRcvWrite(buf, nbytes, recptr);
+ else
+ {
+ /*
+ * Update status as like XLogWalRcvWrite does.
+ */
+
+ /* Update process-local status */
+ XLByteToSeg(recptr + nbytes, recvSegNo, wal_segment_size);
+ recvFileTLI = ThisTimeLineID;
+ LogstreamResult.Write = recptr + nbytes;
+
+ /* Update shared-memory status */
+ pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+ }
+}
+
/*
* Flush the log to disk.
*
@@ -1004,7 +1044,16 @@ XLogWalRcvFlush(bool dying)
{
WalRcvData *walrcv = WalRcv;
- issue_xlog_fsync(recvFile, recvSegNo);
+ /*
+ * We should call both SyncNVWAL and issue_xlog_fsync if we use NVWAL
+ * and WAL archive. So we have the following two if-statements, not
+ * one if-else-statement.
+ */
+ if (IsNvwalAvail())
+ SyncNVWAL();
+
+ if (recvFile >= 0)
+ issue_xlog_fsync(recvFile, recvSegNo);
LogstreamResult.Flush = LogstreamResult.Write;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75433a6dc0..e6ca151271 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -358,6 +358,10 @@ extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
extern bool CopyXLogRecordsFromNVWAL(char *buf,
Size count,
XLogRecPtr startptr);
+extern bool CopyXLogRecordsToNVWAL(char *buf,
+ Size count,
+ XLogRecPtr startptr);
+extern void SyncNVWAL(void);
/*
* Routines to start, stop, and get status of a base backup.
--
2.17.1
v3-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchDownload
From c9736171b0480c57ce8f457a3ce1a8ee29ce02f6 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:59 +0900
Subject: [PATCH v3 4/5] pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile
WAL buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option.
The path will be written to postgresql.auto.conf or recovery.conf.
The size of the new NVWAL is same as the master's one.
---
src/bin/pg_basebackup/pg_basebackup.c | 335 +++++++++++++++++++++++++-
src/bin/pg_basebackup/streamutil.c | 69 ++++++
src/bin/pg_basebackup/streamutil.h | 3 +
src/bin/pg_rewind/pg_rewind.c | 4 +-
src/fe_utils/recovery_gen.c | 9 +-
src/include/fe_utils/recovery_gen.h | 3 +-
6 files changed, 407 insertions(+), 16 deletions(-)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 4f29671d0c..e56fae7f47 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -25,6 +25,9 @@
#ifdef HAVE_LIBZ
#include <zlib.h>
#endif
+#ifdef USE_NVWAL
+#include <libpmem.h>
+#endif
#include "access/xlog_internal.h"
#include "common/file_perm.h"
@@ -127,7 +130,8 @@ typedef enum
static char *basedir = NULL;
static TablespaceList tablespace_dirs = {NULL, NULL};
static char *xlog_dir = NULL;
-static char format = 'p'; /* p(lain)/t(ar) */
+static char format = 'p'; /* p(lain)/t(ar); 'p' even if 'nvwal' given */
+static bool format_nvwal = false; /* true if 'nvwal' given */
static char *label = "pg_basebackup base backup";
static bool noclean = false;
static bool checksum_failure = false;
@@ -150,14 +154,24 @@ static bool verify_checksums = true;
static bool manifest = true;
static bool manifest_force_encode = false;
static char *manifest_checksums = NULL;
+static char *nvwal_path = NULL;
+#ifdef USE_NVWAL
+static size_t nvwal_size = 0;
+static char *nvwal_pages = NULL;
+static size_t nvwal_mapped_len = 0;
+#endif
static bool success = false;
+static bool xlogdir_is_pg_xlog = false;
static bool made_new_pgdata = false;
static bool found_existing_pgdata = false;
static bool made_new_xlogdir = false;
static bool found_existing_xlogdir = false;
static bool made_tablespace_dirs = false;
static bool found_tablespace_dirs = false;
+#ifdef USE_NVWAL
+static bool made_new_nvwal = false;
+#endif
/* Progress counters */
static uint64 totalsize_kb;
@@ -381,7 +395,7 @@ usage(void)
printf(_(" %s [OPTION]...\n"), progname);
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
- printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -F, --format=p|t|n output format (plain (default), tar, nvwal)\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -389,6 +403,7 @@ usage(void)
printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"
" relocate tablespace in OLDDIR to NEWDIR\n"));
printf(_(" --waldir=WALDIR location for the write-ahead log directory\n"));
+ printf(_(" --nvwal-path=NVWAL location for the NVWAL file\n"));
printf(_(" -X, --wal-method=none|fetch|stream\n"
" include required WAL files with specified method\n"));
printf(_(" -z, --gzip compress tar output\n"));
@@ -629,9 +644,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
/* In post-10 cluster, pg_xlog has been renamed to pg_wal */
snprintf(param->xlog, sizeof(param->xlog), "%s/%s",
- basedir,
- PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
- "pg_xlog" : "pg_wal");
+ basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
/* Temporary replication slots are only supported in 10 and newer */
if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_TEMP_SLOTS)
@@ -668,9 +681,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
* tar file may arrive later.
*/
snprintf(statusdir, sizeof(statusdir), "%s/%s/archive_status",
- basedir,
- PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
- "pg_xlog" : "pg_wal");
+ basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
{
@@ -1787,6 +1798,135 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
appendPQExpBuffer(buf, copybuf, r);
}
+#ifdef USE_NVWAL
+static void
+cleanup_nvwal_atexit(void)
+{
+ if (success || in_log_streamer)
+ return;
+
+ if (nvwal_pages != NULL)
+ {
+ pg_log_info("unmapping NVWAL");
+ if (pmem_unmap(nvwal_pages, nvwal_mapped_len) < 0)
+ {
+ pg_log_error("could not unmap NVWAL: %m");
+ return;
+ }
+ }
+
+ if (nvwal_path != NULL && made_new_nvwal)
+ {
+ pg_log_info("removing NVWAL file \"%s\"", nvwal_path);
+ if (unlink(nvwal_path) < 0)
+ {
+ pg_log_error("could not remove NVWAL file \"%s\": %m", nvwal_path);
+ return;
+ }
+ }
+}
+
+static int
+filter_walseg(const struct dirent *d)
+{
+ char fullpath[MAXPGPATH];
+ struct stat statbuf;
+
+ if (!IsXLogFileName(d->d_name))
+ return 0;
+
+ snprintf(fullpath, sizeof(fullpath), "%s/%s/%s",
+ basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal", d->d_name);
+
+ if (stat(fullpath, &statbuf) < 0)
+ return 0;
+
+ if (!S_ISREG(statbuf.st_mode))
+ return 0;
+
+ if (statbuf.st_size != WalSegSz)
+ return 0;
+
+ return 1;
+}
+
+static int
+compare_walseg(const struct dirent **a, const struct dirent **b)
+{
+ return strcmp((*a)->d_name, (*b)->d_name);
+}
+
+static void
+free_namelist(struct dirent **namelist, int nr)
+{
+ for (int i = 0; i < nr; ++i)
+ free(namelist[i]);
+
+ free(namelist);
+}
+
+static bool
+copy_walseg_onto_nvwal(const char *segname)
+{
+ char fullpath[MAXPGPATH];
+ int fd;
+ size_t off;
+ struct stat statbuf;
+ TimeLineID tli;
+ XLogSegNo segno;
+
+ snprintf(fullpath, sizeof(fullpath), "%s/%s/%s",
+ basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal", segname);
+
+ fd = open(fullpath, O_RDONLY);
+ if (fd < 0)
+ {
+ pg_log_error("could not open xlog segment \"%s\": %m", fullpath);
+ return false;
+ }
+
+ if (fstat(fd, &statbuf) < 0)
+ {
+ pg_log_error("could not fstat xlog segment \"%s\": %m", fullpath);
+ goto close_on_error;
+ }
+
+ if (!S_ISREG(statbuf.st_mode))
+ {
+ pg_log_error("xlog segment \"%s\" is not a regular file", fullpath);
+ goto close_on_error;
+ }
+
+ if (statbuf.st_size != WalSegSz)
+ {
+ pg_log_error("invalid size of xlog segment \"%s\"; expected %d, actual %zd",
+ fullpath, WalSegSz, (ssize_t) statbuf.st_size);
+ goto close_on_error;
+ }
+
+ XLogFromFileName(segname, &tli, &segno, WalSegSz);
+ off = ((size_t) segno * WalSegSz) % nvwal_size;
+
+ if (read(fd, &nvwal_pages[off], WalSegSz) < WalSegSz)
+ {
+ pg_log_error("could not fully read xlog segment \"%s\": %m", fullpath);
+ goto close_on_error;
+ }
+
+ if (close(fd) < 0)
+ {
+ pg_log_error("could not close xlog segment \"%s\": %m", fullpath);
+ return false;
+ }
+
+ return true;
+
+close_on_error:
+ (void) close(fd);
+ return false;
+}
+#endif
+
static void
BaseBackup(void)
{
@@ -1845,7 +1985,8 @@ BaseBackup(void)
* Build contents of configuration file if requested
*/
if (writerecoveryconf)
- recoveryconfcontents = GenerateRecoveryConfig(conn, replication_slot);
+ recoveryconfcontents = GenerateRecoveryConfig(conn, replication_slot,
+ nvwal_path);
/*
* Run IDENTIFY_SYSTEM so we can get the timeline
@@ -2214,6 +2355,69 @@ BaseBackup(void)
exit(1);
}
+#ifdef USE_NVWAL
+ /* Copy xlog segments into NVWAL when nvwal mode */
+ if (format_nvwal)
+ {
+ char xldr_path[MAXPGPATH];
+ int nr_segs;
+ struct dirent **namelist;
+
+ /* clear NVWAL before copying xlog segments */
+ pmem_memset_persist(nvwal_pages, 0, nvwal_size);
+
+ snprintf(xldr_path, sizeof(xldr_path), "%s/%s",
+ basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
+
+ /*
+ * Sort xlog segments in ascending order, filtering out non-segment
+ * files and directories.
+ */
+ nr_segs = scandir(xldr_path, &namelist, filter_walseg, compare_walseg);
+ if (nr_segs < 0)
+ {
+ pg_log_error("could not scan xlog directory \"%s\": %m", xldr_path);
+ exit(1);
+ }
+
+ /* Copy xlog segments onto NVWAL */
+ for (int i = 0; i < nr_segs; ++i)
+ {
+ if (!copy_walseg_onto_nvwal(namelist[i]->d_name))
+ {
+ free_namelist(namelist, nr_segs);
+ exit(1);
+ }
+ }
+
+ /* Copy compelete; now remove all the xlog segments */
+ for (int i = 0; i < nr_segs; ++i)
+ {
+ char fullpath[MAXPGPATH];
+
+ snprintf(fullpath, sizeof(fullpath), "%s/%s",
+ xldr_path, namelist[i]->d_name);
+
+ if (unlink(fullpath) < 0)
+ {
+ pg_log_error("could not remove xlog segment \"%s\": %m", fullpath);
+ free_namelist(namelist, nr_segs);
+ exit(1);
+ }
+ }
+
+ free_namelist(namelist, nr_segs);
+
+ if (pmem_unmap(nvwal_pages, nvwal_mapped_len) < 0)
+ {
+ pg_log_error("could not unmap NVWAL: %m");
+ exit(1);
+ }
+ nvwal_pages = NULL;
+ nvwal_mapped_len = 0;
+ }
+#endif
+
if (verbose)
pg_log_info("base backup completed");
}
@@ -2255,6 +2459,7 @@ main(int argc, char **argv)
{"no-manifest", no_argument, NULL, 5},
{"manifest-force-encode", no_argument, NULL, 6},
{"manifest-checksums", required_argument, NULL, 7},
+ {"nvwal-path", required_argument, NULL, 8},
{NULL, 0, NULL, 0}
};
int c;
@@ -2295,9 +2500,27 @@ main(int argc, char **argv)
break;
case 'F':
if (strcmp(optarg, "p") == 0 || strcmp(optarg, "plain") == 0)
+ {
+ /* See the comment for "nvwal" below */
format = 'p';
+ format_nvwal = false;
+ }
else if (strcmp(optarg, "t") == 0 || strcmp(optarg, "tar") == 0)
+ {
+ /* See the comment for "nvwal" below */
format = 't';
+ format_nvwal = false;
+ }
+ else if (strcmp(optarg, "n") == 0 || strcmp(optarg, "nvwal") == 0)
+ {
+ /*
+ * If "nvwal" mode given, we set two variables as follows
+ * because it is almost same as "plain" mode, except NVWAL
+ * handling.
+ */
+ format = 'p';
+ format_nvwal = true;
+ }
else
{
pg_log_error("invalid output format \"%s\", must be \"plain\" or \"tar\"",
@@ -2352,6 +2575,9 @@ main(int argc, char **argv)
case 1:
xlog_dir = pg_strdup(optarg);
break;
+ case 8:
+ nvwal_path = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2533,7 +2759,7 @@ main(int argc, char **argv)
{
if (format != 'p')
{
- pg_log_error("WAL directory location can only be specified in plain mode");
+ pg_log_error("WAL directory location can only be specified in plain or nvwal mode");
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
progname);
exit(1);
@@ -2550,6 +2776,44 @@ main(int argc, char **argv)
}
}
+#ifdef USE_NVWAL
+ if (format_nvwal)
+ {
+ if (nvwal_path == NULL)
+ {
+ pg_log_error("NVWAL file location must be given in nvwal mode");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+ progname);
+ exit(1);
+ }
+
+ /* clean up NVWAL file name and check if it is absolute */
+ canonicalize_path(nvwal_path);
+ if (!is_absolute_path(nvwal_path))
+ {
+ pg_log_error("NVWAL file location must be an absolute path");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+ progname);
+ exit(1);
+ }
+
+ /* We do not map NVWAL file here because we do not know its size yet */
+ }
+ else if (nvwal_path != NULL)
+ {
+ pg_log_error("NVWAL file location can only be specified in plain or nvwal mode");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+ progname);
+ exit(1);
+ }
+#else
+ if (format_nvwal || nvwal_path != NULL)
+ {
+ pg_log_error("this build does not support nvwal mode");
+ exit(1);
+ }
+#endif /* USE_NVWAL */
+
#ifndef HAVE_LIBZ
if (compresslevel != 0)
{
@@ -2594,6 +2858,9 @@ main(int argc, char **argv)
}
atexit(disconnect_atexit);
+ /* Remember the predicate for use after disconnection */
+ xlogdir_is_pg_xlog = (PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL);
+
/*
* Set umask so that directories/files are created with the same
* permissions as directories/files in the source data directory.
@@ -2620,6 +2887,16 @@ main(int argc, char **argv)
if (!RetrieveWalSegSize(conn))
exit(1);
+#ifdef USE_NVWAL
+ /* determine remote server's NVWAL size */
+ if (format_nvwal)
+ {
+ nvwal_size = RetrieveNvwalSize(conn);
+ if (nvwal_size == 0)
+ exit(1);
+ }
+#endif
+
/* Create pg_wal symlink, if required */
if (xlog_dir)
{
@@ -2632,8 +2909,7 @@ main(int argc, char **argv)
* renamed to pg_wal in post-10 clusters.
*/
linkloc = psprintf("%s/%s", basedir,
- PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
- "pg_xlog" : "pg_wal");
+ xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
#ifdef HAVE_SYMLINK
if (symlink(xlog_dir, linkloc) != 0)
@@ -2648,6 +2924,41 @@ main(int argc, char **argv)
free(linkloc);
}
+#ifdef USE_NVWAL
+ /* Create and map NVWAL file if required */
+ if (format_nvwal)
+ {
+ int is_pmem = 0;
+
+ nvwal_pages = pmem_map_file(nvwal_path, nvwal_size,
+ PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+ pg_file_create_mode,
+ &nvwal_mapped_len, &is_pmem);
+ if (nvwal_pages == NULL)
+ {
+ pg_log_error("could not map a new NVWAL file \"%s\": %m",
+ nvwal_path);
+ exit(1);
+ }
+
+ made_new_nvwal = true;
+ atexit(cleanup_nvwal_atexit);
+
+ if (!is_pmem)
+ {
+ pg_log_error("NVWAL file \"%s\" is not on PMEM", nvwal_path);
+ exit(1);
+ }
+
+ if (nvwal_size != nvwal_mapped_len)
+ {
+ pg_log_error("invalid size of NVWAL file \"%s\"; expected %zu, actual %zu",
+ nvwal_path, nvwal_size, nvwal_mapped_len);
+ exit(1);
+ }
+ }
+#endif
+
BaseBackup();
success = true;
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 410116492e..af2bb21e4c 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -397,6 +397,75 @@ RetrieveDataDirCreatePerm(PGconn *conn)
return true;
}
+#ifdef USE_NVWAL
+/*
+ * Returns nvwal_size in bytes if available, 0 otherwise.
+ * Note that it is caller's responsibility to check if the returned
+ * nvwal_size is really valid, that is, multiple of WAL segment size.
+ */
+size_t
+RetrieveNvwalSize(PGconn *conn)
+{
+ PGresult *res;
+ char unit[3];
+ int val;
+ size_t nvwal_size;
+
+ /* check connection existence */
+ Assert(conn != NULL);
+
+ /* fail if we do not have SHOW command */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_SHOW_CMD)
+ {
+ pg_log_error("SHOW command is not supported for retrieving nvwal_size");
+ return 0;
+ }
+
+ res = PQexec(conn, "SHOW nvwal_size");
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ {
+ pg_log_error("could not send replication command \"%s\": %s",
+ "SHOW nvwal_size", PQerrorMessage(conn));
+
+ PQclear(res);
+ return 0;
+ }
+ if (PQntuples(res) != 1 || PQnfields(res) < 1)
+ {
+ pg_log_error("could not fetch NVWAL size: got %d rows and %d fields, expected %d rows and %d or more fields",
+ PQntuples(res), PQnfields(res), 1, 1);
+
+ PQclear(res);
+ return 0;
+ }
+
+ /* fetch value and unit from the result */
+ if (sscanf(PQgetvalue(res, 0, 0), "%d%s", &val, unit) != 2)
+ {
+ pg_log_error("NVWAL size could not be parsed");
+ PQclear(res);
+ return 0;
+ }
+
+ PQclear(res);
+
+ /* convert to bytes */
+ if (strcmp(unit, "MB") == 0)
+ nvwal_size = ((size_t) val) << 20;
+ else if (strcmp(unit, "GB") == 0)
+ nvwal_size = ((size_t) val) << 30;
+ else if (strcmp(unit, "TB") == 0)
+ nvwal_size = ((size_t) val) << 40;
+ else
+ {
+ pg_log_error("unsupported NVWAL unit");
+ return 0;
+ }
+
+ return nvwal_size;
+}
+#endif
+
/*
* Run IDENTIFY_SYSTEM through a given connection and give back to caller
* some result information if requested:
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index 57448656e3..b4c2ab1a74 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -41,6 +41,9 @@ extern bool RunIdentifySystem(PGconn *conn, char **sysid,
XLogRecPtr *startpos,
char **db_name);
extern bool RetrieveWalSegSize(PGconn *conn);
+#ifdef USE_NVWAL
+extern size_t RetrieveNvwalSize(PGconn *conn);
+#endif
extern TimestampTz feGetCurrentTimestamp(void);
extern void feTimestampDifference(TimestampTz start_time, TimestampTz stop_time,
long *secs, int *microsecs);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0015d3b461..578b37b588 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -360,7 +360,7 @@ main(int argc, char **argv)
pg_log_info("no rewind required");
if (writerecoveryconf && !dry_run)
WriteRecoveryConfig(conn, datadir_target,
- GenerateRecoveryConfig(conn, NULL));
+ GenerateRecoveryConfig(conn, NULL, NULL));
exit(0);
}
@@ -460,7 +460,7 @@ main(int argc, char **argv)
if (writerecoveryconf && !dry_run)
WriteRecoveryConfig(conn, datadir_target,
- GenerateRecoveryConfig(conn, NULL));
+ GenerateRecoveryConfig(conn, NULL, NULL));
pg_log_info("Done!");
diff --git a/src/fe_utils/recovery_gen.c b/src/fe_utils/recovery_gen.c
index 46ca20e20b..1e08ec3fa8 100644
--- a/src/fe_utils/recovery_gen.c
+++ b/src/fe_utils/recovery_gen.c
@@ -20,7 +20,7 @@ static char *escape_quotes(const char *src);
* return it.
*/
PQExpBuffer
-GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
+GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot, char *nvwal_path)
{
PQconninfoOption *connOptions;
PQExpBufferData conninfo_buf;
@@ -95,6 +95,13 @@ GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
replication_slot);
}
+ if (nvwal_path)
+ {
+ escaped = escape_quotes(nvwal_path);
+ appendPQExpBuffer(contents, "nvwal_path = '%s'\n", escaped);
+ free(escaped);
+ }
+
if (PQExpBufferBroken(contents))
{
pg_log_error("out of memory");
diff --git a/src/include/fe_utils/recovery_gen.h b/src/include/fe_utils/recovery_gen.h
index c8655cd294..061c59125b 100644
--- a/src/include/fe_utils/recovery_gen.h
+++ b/src/include/fe_utils/recovery_gen.h
@@ -21,7 +21,8 @@
#define MINIMUM_VERSION_FOR_RECOVERY_GUC 120000
extern PQExpBuffer GenerateRecoveryConfig(PGconn *pgconn,
- char *pg_replication_slot);
+ char *pg_replication_slot,
+ char *nvwal_path);
extern void WriteRecoveryConfig(PGconn *pgconn, char *target_dir,
PQExpBuffer contents);
--
2.17.1
v3-0005-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0005-README-for-non-volatile-WAL-buffer.patchDownload
From 5a5408159af48096d0d9a1e002e49756078b526f Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:08:00 +0900
Subject: [PATCH v3 5/5] README for non-volatile WAL buffer
---
README.nvwal | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 184 insertions(+)
create mode 100644 README.nvwal
diff --git a/README.nvwal b/README.nvwal
new file mode 100644
index 0000000000..b6b9d576e7
--- /dev/null
+++ b/README.nvwal
@@ -0,0 +1,184 @@
+Non-volatile WAL buffer
+=======================
+Here is a PostgreSQL branch with a proof-of-concept "non-volatile WAL buffer"
+(NVWAL) feature. Putting the WAL buffer pages on persistent memory (PMEM) [1],
+inserting WAL records into it directly, and eliminating I/O for WAL segment
+files, PostgreSQL gets lower latency and higher throughput.
+
+
+Prerequisites and recommends
+----------------------------
+* An x64 system
+ * (Recommended) Supporting CLFLUSHOPT or CLWB instruction
+ * See if lscpu shows "clflushopt" or "clwb" flag
+* An OS supporting PMEM
+ * Linux: 4.15 or later (tested on 5.2)
+ * Windows: (Sorry but we have not tested on Windows yet.)
+* A filesystem supporting DAX (tested on ext4)
+* libpmem in PMDK [2] 1.4 or later (tested on 1.7)
+* ndctl [3] (tested on 61.2)
+* ipmctl [4] if you use Intel DCPMM
+* sudo privilege
+* All other prerequisites of original PostgreSQL
+* (Recommended) PMEM module(s) (NVDIMM-N or Intel DCPMM)
+ * You can emulate PMEM using DRAM [5] even if you have no PMEM module.
+* (Recommended) numactl
+
+
+Build and install PostgreSQL with NVWAL feature
+-----------------------------------------------
+We have a new configure option --with-nvwal.
+
+I believe it is good to install under your home directory with --prefix option.
+If you do so, please DO NOT forget "export PATH".
+
+ $ ./configure --with-nvwal --prefix="$HOME/postgres"
+ $ make
+ $ make install
+ $ export PATH="$HOME/postgres/bin:$PATH"
+
+NOTE: ./configure --with-nvwal will fail if libpmem is not found.
+
+
+Prepare DAX filesystem
+----------------------
+Here we use NVDIMM-N or emulated PMEM, make ext4 filesystem on namespace0.0
+(/dev/pmem0), and mount it onto /mnt/pmem0. Please DO NOT forget "-o dax" option
+on mount. For Intel DCPMM and ipmctl, please see [4].
+
+ $ ndctl list
+ [
+ {
+ "dev":"namespace1.0",
+ "mode":"raw",
+ "size":103079215104,
+ "sector_size":512,
+ "blockdev":"pmem1",
+ "numa_node":1
+ },
+ {
+ "dev":"namespace0.0",
+ "mode":"raw",
+ "size":103079215104,
+ "sector_size":512,
+ "blockdev":"pmem0",
+ "numa_node":0
+ }
+ ]
+
+ $ sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0
+ {
+ "dev":"namespace0.0",
+ "mode":"fsdax",
+ "map":"dev",
+ "size":"94.50 GiB (101.47 GB)",
+ "uuid":"e7da9d65-140b-4e1e-90ec-6548023a1b6e",
+ "sector_size":512,
+ "blockdev":"pmem0",
+ "numa_node":0
+ }
+
+ $ ls -l /dev/pmem0
+ brw-rw---- 1 root disk 259, 3 Jan 6 17:06 /dev/pmem0
+
+ $ sudo mkfs.ext4 -q -F /dev/pmem0
+ $ sudo mkdir -p /mnt/pmem0
+ $ sudo mount -o dax /dev/pmem0 /mnt/pmem0
+ $ mount -l | grep ^/dev/pmem0
+ /dev/pmem0 on /mnt/pmem0 type ext4 (rw,relatime,dax)
+
+
+Enable transparent huge page
+----------------------------
+Of course transparent huge page would not be suitable for database workload,
+but it improves performance of PMEM by reducing overhead of page walk.
+
+ $ ls -l /sys/kernel/mm/transparent_hugepage/enabled
+ -rw-r--r-- 1 root root 4096 Dec 3 10:38 /sys/kernel/mm/transparent_hugepage/enabled
+
+ $ echo always | sudo dd of=/sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null
+ $ cat /sys/kernel/mm/transparent_hugepage/enabled
+ [always] madvise never
+
+
+initdb
+------
+We have two new options:
+
+ -P, --nvwal-path=FILE path to file for non-volatile WAL buffer (NVWAL)
+ -Q, --nvwal-size=SIZE size of NVWAL, in megabytes
+
+If you want to create a new 80GB (81920MB) NVWAL file on /mnt/pmem0/pgsql/nvwal,
+please run initdb as follows:
+
+ $ sudo mkdir -p /mnt/pmem0/pgsql
+ $ sudo chown "$USER:$USER" /mnt/pmem0/pgsql
+ $ export PGDATA="$HOME/pgdata"
+ $ initdb -P /mnt/pmem0/pgsql/nvwal -Q 81920
+
+You will find there is no WAL segment file to be created in PGDATA/pg_wal
+directory. That is okay; your NVWAL file has the content of the first WAL
+segment file.
+
+NOTE:
+* initdb will fail if the given NVWAL size is not multiple of WAL segment
+ size. The segment size is given with initdb --wal-segsize, or is 16MB as
+ default.
+* postgres (executed by initdb) will fail in bootstrap if the directory in
+ which the NVWAL file is being created (/mnt/pmem0/pgsql for example
+ above) does not exist.
+* postgres (executed by initdb) will fail in bootstrap if an entry already
+ exists on the given path.
+* postgres (executed by initdb) will fail in bootstrap if the given path is
+ not on PMEM or you forget "-o dax" option on mount.
+* Resizing an NVWAL file is NOT supported yet. Please be careful to decide
+ how large your NVWAL file is to be.
+* "-Q 1024" (1024MB) will be assumed if -P is given but -Q is not.
+
+
+postgresql.conf
+---------------
+We have two new parameters nvwal_path and nvwal_size, corresponding to the two
+new options of initdb. If you run initdb as above, you will find postgresql.conf
+in your PGDATA directory like as follows:
+
+ max_wal_size = 80GB
+ min_wal_size = 80GB
+ nvwal_path = '/mnt/pmem0/pgsql/nvwal'
+ nvwal_size = 80GB
+
+NOTE:
+* postgres will fail in startup if no file exists on the given nvwal_path.
+* postgres will fail in startup if the given nvwal_size is not equal to the
+ actual NVWAL file size,
+* postgres will fail in startup if the given nvwal_path is not on PMEM or you
+ forget "-o dax" option on mount.
+* wal_buffers will be ignored if nvwal_path is given.
+* You SHOULD give both max_wal_size and min_wal_size the same value as
+ nvwal_size. postgres could possibly run even though the three values are
+ not same, however, we have not tested such a case yet.
+
+
+Startup
+-------
+Same as you know:
+
+ $ pg_ctl start
+
+or use numactl as follows to let postgres run on the specified NUMA node (typi-
+cally the one on which your NVWAL file is) if you need stable performance:
+
+ $ numactl --cpunodebind=0 --membind=0 -- pg_ctl start
+
+
+References
+----------
+[1] https://pmem.io/
+[2] https://pmem.io/pmdk/
+[3] https://docs.pmem.io/ndctl-user-guide/
+[4] https://docs.pmem.io/ipmctl-user-guide/
+[5] https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
+
+
+--
+Takashi Menjo <takashi.menjou.vg AT hco.ntt.co.jp>
--
2.17.1
Import Notes
Reply to msg id not found:
Rebased.
2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>:
Dear hackers,
I update my non-volatile WAL buffer's patchset to v3. Now we can use it
in streaming replication mode.Updates from v2:
- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL
buffer if applicable.- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL
buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path
will be written to postgresql.auto.conf or recovery.conf. The size of the
new NVWAL is same as the master's one.Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>; 'Amit Langote'
<amitlangote09@gmail.com>
Subject: RE: [PoC] Non-volatile WAL bufferDear hackers,
I rebased my non-volatile WAL buffer's patchset onto master. A new v2
patchset is attached to this mail.
I also measured performance before and after patchset, varying
-c/--client and -j/--jobs options of pgbench, for
each scaling factor s = 50 or 1000. The results are presented in the
following tables and the attached charts.
Conditions, steps, and other details will be shown later.
Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)Both throughput and average latency are improved for each scaling
factor. Throughput seemed to almost reach
the upper limit when (c,j)=(36,18).
The percentage in s=1000 case looks larger than in s=50 case. I think
larger scaling factor leads to less
contentions on the same tables and/or indexes, that is, less lock and
unlock operations. In such a situation,
write-ahead logging appears to be more significant for performance.
Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set forpg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access(DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patchSteps
=====
For each (c,j) pair, I did the following steps three times then I foundthe median of the three as a final result shown
in the tables above.
(1) Run initdb with proper -D and -X options; and also give --nvwal-path
and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutespgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j___ dbname
I gave no -b option to use the built-in "TPC-B (sort-of)" query.
Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GABest regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software InnovationCenter
-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>;
'PostgreSQL-development'
<pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL bufferDear Amit,
Thank you for your advice. Exactly, it's so to speak "do as the
hackers do when in pgsql"...
I'm rebasing my branch onto master. I'll submit an updated patchset
and performance report later.
Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software
Innovation Center-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas
<hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL bufferHello,
On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <
takashi.menjou.vg@hco.ntt.co.jp> wrote:
Hello Amit,
I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have anyspecific reason to be working on REL_12_0?
Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I knowall new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which committhe "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or notbecause master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using releasenotes and user manuals.
Thanks for clarifying. I see where you're coming from.
While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss tonotice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architectureof your own changes, although maybe that
rarely occurs.
Thanks,
Amit
--
Takashi Menjo <takashi.menjo@gmail.com>
Attachments:
v4-0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=v4-0001-Support-GUCs-for-external-WAL-buffer.patchDownload
From 668939ff8ddca517c7efb08218b01007ee6b4e94 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:56 +0900
Subject: [PATCH v4 1/5] Support GUCs for external WAL buffer
To implement non-volatile WAL buffer, we add two new GUCs nvwal_path
and nvwal_size. Now postgres maps a file at that path onto memory to
use it as WAL buffer. Note that the buffer is still volatile for now.
---
configure | 262 ++++++++++++++++++
configure.ac | 43 +++
src/backend/access/transam/Makefile | 3 +-
src/backend/access/transam/nv_xlog_buffer.c | 95 +++++++
src/backend/access/transam/xlog.c | 164 ++++++++++-
src/backend/utils/misc/guc.c | 23 +-
src/backend/utils/misc/postgresql.conf.sample | 2 +
src/bin/initdb/initdb.c | 93 ++++++-
src/include/access/nv_xlog_buffer.h | 71 +++++
src/include/access/xlog.h | 2 +
src/include/pg_config.h.in | 6 +
src/include/utils/guc.h | 4 +
12 files changed, 747 insertions(+), 21 deletions(-)
create mode 100644 src/backend/access/transam/nv_xlog_buffer.c
create mode 100644 src/include/access/nv_xlog_buffer.h
diff --git a/configure b/configure
index 19a3cd09a0..764ed1e942 100755
--- a/configure
+++ b/configure
@@ -867,6 +867,7 @@ with_libxml
with_libxslt
with_system_tzdata
with_zlib
+with_nvwal
with_gnu_ld
enable_largefile
'
@@ -1571,6 +1572,7 @@ Optional Packages:
--with-system-tzdata=DIR
use system time zone data in DIR
--without-zlib do not use Zlib
+ --with-nvwal use non-volatile WAL buffer (NVWAL)
--with-gnu-ld assume the C compiler uses GNU ld [default=no]
Some influential environment variables:
@@ -8601,6 +8603,203 @@ fi
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with non-volatile WAL buffer (NVWAL)" >&5
+$as_echo_n "checking whether to build with non-volatile WAL buffer (NVWAL)... " >&6; }
+
+
+
+# Check whether --with-nvwal was given.
+if test "${with_nvwal+set}" = set; then :
+ withval=$with_nvwal;
+ case $withval in
+ yes)
+
+$as_echo "#define USE_NVWAL 1" >>confdefs.h
+
+ ;;
+ no)
+ :
+ ;;
+ *)
+ as_fn_error $? "no argument expected for --with-nvwal option" "$LINENO" 5
+ ;;
+ esac
+
+else
+ with_nvwal=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_nvwal" >&5
+$as_echo "$with_nvwal" >&6; }
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+ freebsd1*|freebsd2*) elf=no;;
+ freebsd3*|freebsd4*) elf=yes;;
+esac
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for grep that handles long lines and -e" >&5
+$as_echo_n "checking for grep that handles long lines and -e... " >&6; }
+if ${ac_cv_path_GREP+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ if test -z "$GREP"; then
+ ac_path_GREP_found=false
+ # Loop through the user's path and test for each of PROGNAME-LIST
+ as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+ IFS=$as_save_IFS
+ test -z "$as_dir" && as_dir=.
+ for ac_prog in grep ggrep; do
+ for ac_exec_ext in '' $ac_executable_extensions; do
+ ac_path_GREP="$as_dir/$ac_prog$ac_exec_ext"
+ as_fn_executable_p "$ac_path_GREP" || continue
+# Check for GNU ac_path_GREP and select it if it is found.
+ # Check for GNU $ac_path_GREP
+case `"$ac_path_GREP" --version 2>&1` in
+*GNU*)
+ ac_cv_path_GREP="$ac_path_GREP" ac_path_GREP_found=:;;
+*)
+ ac_count=0
+ $as_echo_n 0123456789 >"conftest.in"
+ while :
+ do
+ cat "conftest.in" "conftest.in" >"conftest.tmp"
+ mv "conftest.tmp" "conftest.in"
+ cp "conftest.in" "conftest.nl"
+ $as_echo 'GREP' >> "conftest.nl"
+ "$ac_path_GREP" -e 'GREP$' -e '-(cannot match)-' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+ diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+ as_fn_arith $ac_count + 1 && ac_count=$as_val
+ if test $ac_count -gt ${ac_path_GREP_max-0}; then
+ # Best one so far, save it but keep looking for a better one
+ ac_cv_path_GREP="$ac_path_GREP"
+ ac_path_GREP_max=$ac_count
+ fi
+ # 10*(2^10) chars as input seems more than enough
+ test $ac_count -gt 10 && break
+ done
+ rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+ $ac_path_GREP_found && break 3
+ done
+ done
+ done
+IFS=$as_save_IFS
+ if test -z "$ac_cv_path_GREP"; then
+ as_fn_error $? "no acceptable grep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+ fi
+else
+ ac_cv_path_GREP=$GREP
+fi
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_GREP" >&5
+$as_echo "$ac_cv_path_GREP" >&6; }
+ GREP="$ac_cv_path_GREP"
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for egrep" >&5
+$as_echo_n "checking for egrep... " >&6; }
+if ${ac_cv_path_EGREP+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ if echo a | $GREP -E '(a|b)' >/dev/null 2>&1
+ then ac_cv_path_EGREP="$GREP -E"
+ else
+ if test -z "$EGREP"; then
+ ac_path_EGREP_found=false
+ # Loop through the user's path and test for each of PROGNAME-LIST
+ as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+ IFS=$as_save_IFS
+ test -z "$as_dir" && as_dir=.
+ for ac_prog in egrep; do
+ for ac_exec_ext in '' $ac_executable_extensions; do
+ ac_path_EGREP="$as_dir/$ac_prog$ac_exec_ext"
+ as_fn_executable_p "$ac_path_EGREP" || continue
+# Check for GNU ac_path_EGREP and select it if it is found.
+ # Check for GNU $ac_path_EGREP
+case `"$ac_path_EGREP" --version 2>&1` in
+*GNU*)
+ ac_cv_path_EGREP="$ac_path_EGREP" ac_path_EGREP_found=:;;
+*)
+ ac_count=0
+ $as_echo_n 0123456789 >"conftest.in"
+ while :
+ do
+ cat "conftest.in" "conftest.in" >"conftest.tmp"
+ mv "conftest.tmp" "conftest.in"
+ cp "conftest.in" "conftest.nl"
+ $as_echo 'EGREP' >> "conftest.nl"
+ "$ac_path_EGREP" 'EGREP$' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+ diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+ as_fn_arith $ac_count + 1 && ac_count=$as_val
+ if test $ac_count -gt ${ac_path_EGREP_max-0}; then
+ # Best one so far, save it but keep looking for a better one
+ ac_cv_path_EGREP="$ac_path_EGREP"
+ ac_path_EGREP_max=$ac_count
+ fi
+ # 10*(2^10) chars as input seems more than enough
+ test $ac_count -gt 10 && break
+ done
+ rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+ $ac_path_EGREP_found && break 3
+ done
+ done
+ done
+IFS=$as_save_IFS
+ if test -z "$ac_cv_path_EGREP"; then
+ as_fn_error $? "no acceptable egrep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+ fi
+else
+ ac_cv_path_EGREP=$EGREP
+fi
+
+ fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_EGREP" >&5
+$as_echo "$ac_cv_path_EGREP" >&6; }
+ EGREP="$ac_cv_path_EGREP"
+
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+#if __ELF__
+ yes
+#endif
+
+_ACEOF
+if (eval "$ac_cpp conftest.$ac_ext") 2>&5 |
+ $EGREP "yes" >/dev/null 2>&1; then :
+ ELF_SYS=true
+else
+ if test "X$elf" = "Xyes" ; then
+ ELF_SYS=true
+else
+ ELF_SYS=
+fi
+fi
+rm -f conftest*
+
+
+
#
# Assignments
#
@@ -12962,6 +13161,57 @@ fi
fi
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+ { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+ $as_echo_n "(cached) " >&6
+else
+ ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h. */
+
+/* Override any GCC internal prototype to avoid an error.
+ Use char because int might match the return type of a GCC
+ builtin and then its argument prototype would still apply. */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+ ;
+ return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+ ac_cv_lib_pmem_pmem_map_file=yes
+else
+ ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+ conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+ cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+ LIBS="-lpmem $LIBS"
+
+else
+ as_fn_error $? "library 'libpmem' is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+fi
+
##
## Header files
@@ -13641,6 +13891,18 @@ fi
done
+fi
+
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+ ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+ as_fn_error $? "header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+
fi
if test "$PORTNAME" = "win32" ; then
diff --git a/configure.ac b/configure.ac
index 6b9d0487a8..afa501a665 100644
--- a/configure.ac
+++ b/configure.ac
@@ -999,6 +999,38 @@ PGAC_ARG_BOOL(with, zlib, yes,
[do not use Zlib])
AC_SUBST(with_zlib)
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+AC_MSG_CHECKING([whether to build with non-volatile WAL buffer (NVWAL)])
+PGAC_ARG_BOOL(with, nvwal, no, [use non-volatile WAL buffer (NVWAL)],
+ [AC_DEFINE([USE_NVWAL], 1, [Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal)])])
+AC_MSG_RESULT([$with_nvwal])
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+ freebsd1*|freebsd2*) elf=no;;
+ freebsd3*|freebsd4*) elf=yes;;
+esac
+
+AC_EGREP_CPP(yes,
+[#if __ELF__
+ yes
+#endif
+],
+[ELF_SYS=true],
+[if test "X$elf" = "Xyes" ; then
+ ELF_SYS=true
+else
+ ELF_SYS=
+fi])
+AC_SUBST(ELF_SYS)
+
#
# Assignments
#
@@ -1303,6 +1335,12 @@ elif test "$with_uuid" = ossp ; then
fi
AC_SUBST(UUID_LIBS)
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+ AC_CHECK_LIB(pmem, pmem_map_file, [],
+ [AC_MSG_ERROR([library 'libpmem' is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
##
## Header files
@@ -1480,6 +1518,11 @@ elif test "$with_uuid" = ossp ; then
[AC_MSG_ERROR([header file <ossp/uuid.h> or <uuid.h> is required for OSSP UUID])])])
fi
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+ AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
if test "$PORTNAME" = "win32" ; then
AC_CHECK_HEADERS(crtdefs.h)
fi
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..b41a710e7e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -32,7 +32,8 @@ OBJS = \
xlogfuncs.o \
xloginsert.o \
xlogreader.o \
- xlogutils.o
+ xlogutils.o \
+ nv_xlog_buffer.o
include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/transam/nv_xlog_buffer.c b/src/backend/access/transam/nv_xlog_buffer.c
new file mode 100644
index 0000000000..cfc6a6376b
--- /dev/null
+++ b/src/backend/access/transam/nv_xlog_buffer.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * nv_xlog_buffer.c
+ * PostgreSQL non-volatile WAL buffer
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/nv_xlog_buffer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#ifdef USE_NVWAL
+
+#include <libpmem.h>
+#include "access/nv_xlog_buffer.h"
+
+#include "miscadmin.h" /* IsBootstrapProcessingMode */
+#include "common/file_perm.h" /* pg_file_create_mode */
+
+/*
+ * Maps non-volatile WAL buffer on shared memory.
+ *
+ * Returns a mapped address if success; PANICs and never return otherwise.
+ */
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+ void *addr;
+ size_t map_len = 0;
+ int is_pmem = 0;
+
+ Assert(fname != NULL);
+ Assert(fsize > 0);
+
+ if (IsBootstrapProcessingMode())
+ {
+ /*
+ * Create and map a new file if we are in bootstrap mode (typically
+ * executed by initdb).
+ */
+ addr = pmem_map_file(fname, fsize, PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+ pg_file_create_mode, &map_len, &is_pmem);
+ }
+ else
+ {
+ /*
+ * Map an existing file. The second argument (len) should be zero,
+ * the third argument (flags) should have neither PMEM_FILE_CREATE nor
+ * PMEM_FILE_EXCL, and the fourth argument (mode) will be ignored.
+ */
+ addr = pmem_map_file(fname, 0, 0, 0, &map_len, &is_pmem);
+ }
+
+ if (addr == NULL)
+ elog(PANIC, "could not map non-volatile WAL buffer '%s': %m", fname);
+
+ if (map_len != fsize)
+ elog(PANIC, "size of non-volatile WAL buffer '%s' is invalid; "
+ "expected %zu; actual %zu",
+ fname, fsize, map_len);
+
+ if (!is_pmem)
+ elog(PANIC, "non-volatile WAL buffer '%s' is not on persistent memory",
+ fname);
+
+ /*
+ * Assert page boundary alignment (8KiB as default). It should pass because
+ * PMDK considers hugepage boundary alignment (2MiB or 1GiB on x64).
+ */
+ Assert((uint64) addr % XLOG_BLCKSZ == 0);
+
+ elog(LOG, "non-volatile WAL buffer '%s' is mapped on [%p-%p)",
+ fname, addr, (char *) addr + map_len);
+ return addr;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+ Assert(addr != NULL);
+
+ if (pmem_unmap(addr, fsize) < 0)
+ {
+ elog(WARNING, "could not unmap non-volatile WAL buffer: %m");
+ return;
+ }
+
+ elog(LOG, "non-volatile WAL buffer unmapped");
+}
+
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..a7bb7c88ff 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -37,6 +37,7 @@
#include "access/xloginsert.h"
#include "access/xlogreader.h"
#include "access/xlogutils.h"
+#include "access/nv_xlog_buffer.h"
#include "catalog/catversion.h"
#include "catalog/pg_control.h"
#include "catalog/pg_database.h"
@@ -873,6 +874,12 @@ static bool InRedo = false;
/* Have we launched bgwriter during recovery? */
static bool bgwriterLaunched = false;
+/* For non-volatile WAL buffer (NVWAL) */
+char *NvwalPath = NULL; /* a GUC parameter */
+int NvwalSizeMB = 1024; /* a direct GUC parameter */
+static Size NvwalSize = 0; /* an indirect GUC parameter */
+static bool NvwalAvail = false;
+
/* For WALInsertLockAcquire/Release functions */
static int MyLockNo = 0;
static bool holdingAllLocks = false;
@@ -5014,6 +5021,76 @@ check_wal_buffers(int *newval, void **extra, GucSource source)
return true;
}
+/*
+ * GUC check_hook for nvwal_path.
+ */
+bool
+check_nvwal_path(char **newval, void **extra, GucSource source)
+{
+#ifndef USE_NVWAL
+ Assert(!NvwalAvail);
+
+ if (**newval != '\0')
+ {
+ GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+ GUC_check_errmsg("nvwal_path is invalid parameter without NVWAL");
+ return false;
+ }
+#endif
+
+ return true;
+}
+
+void
+assign_nvwal_path(const char *newval, void *extra)
+{
+ /* true if not empty; false if empty */
+ NvwalAvail = (bool) (*newval != '\0');
+}
+
+/*
+ * GUC check_hook for nvwal_size.
+ *
+ * It checks the boundary only and DOES NOT check if the size is multiple
+ * of wal_segment_size because the segment size (probably stored in the
+ * control file) have not been set properly here yet.
+ *
+ * See XLOGShmemSize for more validation.
+ */
+bool
+check_nvwal_size(int *newval, void **extra, GucSource source)
+{
+#ifdef USE_NVWAL
+ Size buf_size;
+ int64 npages;
+
+ Assert(*newval > 0);
+
+ buf_size = (Size) (*newval) * 1024 * 1024;
+ npages = (int64) buf_size / XLOG_BLCKSZ;
+ Assert(npages > 0);
+
+ if (npages > INT_MAX)
+ {
+ /* XLOG_BLCKSZ could be so small that npages exceeds INT_MAX */
+ GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+ GUC_check_errmsg("invalid value for nvwal_size (%dMB): "
+ "the number of WAL pages too large; "
+ "buf_size %zu; XLOG_BLCKSZ %d",
+ *newval, buf_size, (int) XLOG_BLCKSZ);
+ return false;
+ }
+#endif
+
+ return true;
+}
+
+void
+assign_nvwal_size(int newval, void *extra)
+{
+ NvwalSize = (Size) newval * 1024 * 1024;
+}
+
/*
* Read the control file, set respective GUCs.
*
@@ -5042,13 +5119,49 @@ XLOGShmemSize(void)
{
Size size;
+ /*
+ * If we use non-volatile WAL buffer, we don't use the given wal_buffers.
+ * Instead, we set it the value based on the size of the file for the
+ * buffer. This should be done here because of xlblocks array calculation.
+ */
+ if (NvwalAvail)
+ {
+ char buf[32];
+ int64 npages;
+
+ Assert(NvwalSizeMB > 0);
+ Assert(NvwalSize > 0);
+ Assert(wal_segment_size > 0);
+ Assert(wal_segment_size % XLOG_BLCKSZ == 0);
+
+ /*
+ * At last, we can check if the size of non-volatile WAL buffer
+ * (nvwal_size) is multiple of WAL segment size.
+ *
+ * Note that NvwalSize has already been calculated in assign_nvwal_size.
+ */
+ if (NvwalSize % wal_segment_size != 0)
+ {
+ elog(PANIC,
+ "invalid value for nvwal_size (%dMB): "
+ "it should be multiple of WAL segment size; "
+ "NvwalSize %zu; wal_segment_size %d",
+ NvwalSizeMB, NvwalSize, wal_segment_size);
+ }
+
+ npages = (int64) NvwalSize / XLOG_BLCKSZ;
+ Assert(npages > 0 && npages <= INT_MAX);
+
+ snprintf(buf, sizeof(buf), "%d", (int) npages);
+ SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
+ }
/*
* If the value of wal_buffers is -1, use the preferred auto-tune value.
* This isn't an amazingly clean place to do this, but we must wait till
* NBuffers has received its final value, and must do it before using the
* value of XLOGbuffers to do anything important.
*/
- if (XLOGbuffers == -1)
+ else if (XLOGbuffers == -1)
{
char buf[32];
@@ -5064,10 +5177,13 @@ XLOGShmemSize(void)
size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
/* xlblocks array */
size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
- /* extra alignment padding for XLOG I/O buffers */
- size = add_size(size, XLOG_BLCKSZ);
- /* and the buffers themselves */
- size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+ if (!NvwalAvail)
+ {
+ /* extra alignment padding for XLOG I/O buffers */
+ size = add_size(size, XLOG_BLCKSZ);
+ /* and the buffers themselves */
+ size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+ }
/*
* Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5161,13 +5277,32 @@ XLOGShmemInit(void)
}
/*
- * Align the start of the page buffers to a full xlog block size boundary.
- * This simplifies some calculations in XLOG insertion. It is also
- * required for O_DIRECT.
+ * Open and memory-map a file for non-volatile XLOG buffer. The PMDK will
+ * align the start of the buffer to 2-MiB boundary if the size of the
+ * buffer is larger than or equal to 4 MiB.
*/
- allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
- XLogCtl->pages = allocptr;
- memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+ if (NvwalAvail)
+ {
+ /* Logging and error-handling should be done in the function */
+ XLogCtl->pages = MapNonVolatileXLogBuffer(NvwalPath, NvwalSize);
+
+ /*
+ * Do not memset non-volatile XLOG buffer (XLogCtl->pages) here
+ * because it would contain records for recovery. We should do so in
+ * checkpoint after the recovery completes successfully.
+ */
+ }
+ else
+ {
+ /*
+ * Align the start of the page buffers to a full xlog block size
+ * boundary. This simplifies some calculations in XLOG insertion. It
+ * is also required for O_DIRECT.
+ */
+ allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+ XLogCtl->pages = allocptr;
+ memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+ }
/*
* Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
@@ -8523,6 +8658,13 @@ ShutdownXLOG(int code, Datum arg)
CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
}
+
+ /*
+ * If we use non-volatile XLOG buffer, unmap it.
+ */
+ if (NvwalAvail)
+ UnmapNonVolatileXLogBuffer(XLogCtl->pages, NvwalSize);
+
ShutdownCLOG();
ShutdownCommitTs();
ShutdownSUBTRANS();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef7..77a1b8bb32 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2714,7 +2714,7 @@ static struct config_int ConfigureNamesInt[] =
GUC_UNIT_XBLOCKS
},
&XLOGbuffers,
- -1, -1, (INT_MAX / XLOG_BLCKSZ),
+ -1, -1, INT_MAX,
check_wal_buffers, NULL, NULL
},
@@ -3399,6 +3399,17 @@ static struct config_int ConfigureNamesInt[] =
check_huge_page_size, NULL, NULL
},
+ {
+ {"nvwal_size", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Size of non-volatile WAL buffer (NVWAL)."),
+ NULL,
+ GUC_UNIT_MB
+ },
+ &NvwalSizeMB,
+ 1024, 1, INT_MAX,
+ check_nvwal_size, assign_nvwal_size, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -4448,6 +4459,16 @@ static struct config_string ConfigureNamesString[] =
check_backtrace_functions, assign_backtrace_functions, NULL
},
+ {
+ {"nvwal_path", PGC_POSTMASTER, WAL_SETTINGS,
+ gettext_noop("Path to file for non-volatile WAL buffer (NVWAL)."),
+ NULL
+ },
+ &NvwalPath,
+ "",
+ check_nvwal_path, assign_nvwal_path, NULL
+ },
+
/* End-of-list marker */
{
{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..f343d6b296 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -230,6 +230,8 @@
#checkpoint_timeout = 5min # range 30s-1d
#max_wal_size = 1GB
#min_wal_size = 80MB
+#nvwal_path = '/path/to/nvwal'
+#nvwal_size = 1GB
#checkpoint_completion_target = 0.5 # checkpoint target duration, 0.0 - 1.0
#checkpoint_flush_after = 0 # measured in pages, 0 disables
#checkpoint_warning = 30s # 0 disables
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 37e0d7ceab..2dd0a09734 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -145,7 +145,10 @@ static bool show_setting = false;
static bool data_checksums = false;
static char *xlog_dir = NULL;
static char *str_wal_segment_size_mb = NULL;
+static char *nvwal_path = NULL;
+static char *str_nvwal_size_mb = NULL;
static int wal_segment_size_mb;
+static int nvwal_size_mb;
/* internal vars */
@@ -1098,14 +1101,78 @@ setup_config(void)
conflines = replace_token(conflines, "#port = 5432", repltok);
#endif
- /* set default max_wal_size and min_wal_size */
- snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
- pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
- conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+ if (nvwal_path != NULL)
+ {
+ int nr_segs;
+
+ if (str_nvwal_size_mb == NULL)
+ nvwal_size_mb = 1024;
+ else
+ {
+ char *endptr;
+
+ /* check that the argument is a number */
+ nvwal_size_mb = strtol(str_nvwal_size_mb, &endptr, 10);
+
+ /* verify that the size of non-volatile WAL buffer is valid */
+ if (endptr == str_nvwal_size_mb || *endptr != '\0')
+ {
+ pg_log_error("argument of --nvwal-size must be a number; "
+ "str_nvwal_size_mb '%s'",
+ str_nvwal_size_mb);
+ exit(1);
+ }
+ if (nvwal_size_mb <= 0)
+ {
+ pg_log_error("argument of --nvwal-size must be a positive number; "
+ "str_nvwal_size_mb '%s'; nvwal_size_mb %d",
+ str_nvwal_size_mb, nvwal_size_mb);
+ exit(1);
+ }
+ if (nvwal_size_mb % wal_segment_size_mb != 0)
+ {
+ pg_log_error("argument of --nvwal-size must be multiple of WAL segment size; "
+ "str_nvwal_size_mb '%s'; nvwal_size_mb %d; wal_segment_size_mb %d",
+ str_nvwal_size_mb, nvwal_size_mb, wal_segment_size_mb);
+ exit(1);
+ }
+ }
+
+ /*
+ * XXX We set {min_,max_,nv}wal_size to the same value. Note that
+ * postgres might bootstrap and run if the three config does not have
+ * the same value, but have not been tested yet.
+ */
+ nr_segs = nvwal_size_mb / wal_segment_size_mb;
- snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
- pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
- conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+ snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "nvwal_path = '%s'",
+ nvwal_path);
+ conflines = replace_token(conflines,
+ "#nvwal_path = '/path/to/nvwal'", repltok);
+
+ snprintf(repltok, sizeof(repltok), "nvwal_size = %s",
+ pretty_wal_size(nr_segs));
+ conflines = replace_token(conflines, "#nvwal_size = 1GB", repltok);
+ }
+ else
+ {
+ /* set default max_wal_size and min_wal_size */
+ snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+ pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
+ conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+ snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+ pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
+ conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+ }
snprintf(repltok, sizeof(repltok), "lc_messages = '%s'",
escape_quotes(lc_messages));
@@ -2310,6 +2377,8 @@ usage(const char *progname)
printf(_(" -W, --pwprompt prompt for a password for the new superuser\n"));
printf(_(" -X, --waldir=WALDIR location for the write-ahead log directory\n"));
printf(_(" --wal-segsize=SIZE size of WAL segments, in megabytes\n"));
+ printf(_(" -P, --nvwal-path=FILE path to file for non-volatile WAL buffer (NVWAL)\n"));
+ printf(_(" -Q, --nvwal-size=SIZE size of NVWAL, in megabytes\n"));
printf(_("\nLess commonly used options:\n"));
printf(_(" -d, --debug generate lots of debugging output\n"));
printf(_(" -k, --data-checksums use data page checksums\n"));
@@ -2978,6 +3047,8 @@ main(int argc, char *argv[])
{"sync-only", no_argument, NULL, 'S'},
{"waldir", required_argument, NULL, 'X'},
{"wal-segsize", required_argument, NULL, 12},
+ {"nvwal-path", required_argument, NULL, 'P'},
+ {"nvwal-size", required_argument, NULL, 'Q'},
{"data-checksums", no_argument, NULL, 'k'},
{"allow-group-access", no_argument, NULL, 'g'},
{NULL, 0, NULL, 0}
@@ -3021,7 +3092,7 @@ main(int argc, char *argv[])
/* process command-line options */
- while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:g", long_options, &option_index)) != -1)
+ while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:P:Q:g", long_options, &option_index)) != -1)
{
switch (c)
{
@@ -3115,6 +3186,12 @@ main(int argc, char *argv[])
case 12:
str_wal_segment_size_mb = pg_strdup(optarg);
break;
+ case 'P':
+ nvwal_path = pg_strdup(optarg);
+ break;
+ case 'Q':
+ str_nvwal_size_mb = pg_strdup(optarg);
+ break;
case 'g':
SetDataDirectoryCreatePerm(PG_DIR_MODE_GROUP);
break;
diff --git a/src/include/access/nv_xlog_buffer.h b/src/include/access/nv_xlog_buffer.h
new file mode 100644
index 0000000000..b58878c92b
--- /dev/null
+++ b/src/include/access/nv_xlog_buffer.h
@@ -0,0 +1,71 @@
+/*
+ * nv_xlog_buffer.h
+ *
+ * PostgreSQL non-volatile WAL buffer
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nv_xlog_buffer.h
+ */
+#ifndef NV_XLOG_BUFFER_H
+#define NV_XLOG_BUFFER_H
+
+extern void *MapNonVolatileXLogBuffer(const char *fname, Size fsize);
+extern void UnmapNonVolatileXLogBuffer(void *addr, Size fsize);
+
+#ifdef USE_NVWAL
+#include <libpmem.h>
+
+#define nv_memset_persist pmem_memset_persist
+#define nv_memcpy_nodrain pmem_memcpy_nodrain
+#define nv_flush pmem_flush
+#define nv_drain pmem_drain
+#define nv_persist pmem_persist
+
+#else
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+ return NULL;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+ return;
+}
+
+static inline void *
+nv_memset_persist(void *pmemdest, int c, size_t len)
+{
+ return NULL;
+}
+
+static inline void *
+nv_memcpy_nodrain(void *pmemdest, const void *src,
+ size_t len)
+{
+ return NULL;
+}
+
+static inline void
+nv_flush(void *pmemdest, size_t len)
+{
+ return;
+}
+
+static inline void
+nv_drain(void)
+{
+ return;
+}
+
+static inline void
+nv_persist(const void *addr, size_t len)
+{
+ return;
+}
+
+#endif /* USE_NVWAL */
+#endif /* NV_XLOG_BUFFER_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..03fd1267e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,8 @@ extern int recovery_min_apply_delay;
extern char *PrimaryConnInfo;
extern char *PrimarySlotName;
extern bool wal_receiver_create_temp_slot;
+extern char *NvwalPath;
+extern int NvwalSizeMB;
/* indirectly set via GUC system */
extern TransactionId recoveryTargetXid;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index fb270df678..961be9aff5 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -325,6 +325,9 @@
/* Define to 1 if you have the `pam' library (-lpam). */
#undef HAVE_LIBPAM
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
/* Define if you have a function readline library */
#undef HAVE_LIBREADLINE
@@ -884,6 +887,9 @@
/* Define to select named POSIX semaphores. */
#undef USE_NAMED_POSIX_SEMAPHORES
+/* Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal) */
+#undef USE_NVWAL
+
/* Define to build with OpenSSL support. (--with-openssl) */
#undef USE_OPENSSL
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..d941a76d43 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,6 +438,10 @@ extern void assign_search_path(const char *newval, void *extra);
/* in access/transam/xlog.c */
extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_nvwal_path(char **newval, void **extra, GucSource source);
+extern void assign_nvwal_path(const char *newval, void *extra);
+extern bool check_nvwal_size(int *newval, void **extra, GucSource source);
+extern void assign_nvwal_size(int newval, void *extra);
extern void assign_xlog_sync_method(int new_sync_method, void *extra);
#endif /* GUC_H */
--
2.17.1
v4-0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0002-Non-volatile-WAL-buffer.patchDownload
From 9d2ebe6744b9fdb966da78d4a535bc5c4fee33e0 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:57 +0900
Subject: [PATCH v4 2/5] Non-volatile WAL buffer
Now external WAL buffer becomes non-volatile.
Bumps PG_CONTROL_VERSION.
---
src/backend/access/transam/xlog.c | 1154 ++++++++++++++++--
src/backend/access/transam/xlogreader.c | 24 +
src/bin/pg_controldata/pg_controldata.c | 3 +
src/include/access/xlog.h | 8 +
src/include/catalog/pg_control.h | 17 +-
src/test/regress/expected/misc_functions.out | 14 +-
src/test/regress/sql/misc_functions.sql | 14 +-
7 files changed, 1097 insertions(+), 137 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a7bb7c88ff..6a579a308f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -652,6 +652,13 @@ typedef struct XLogCtlData
TimeLineID ThisTimeLineID;
TimeLineID PrevTimeLineID;
+ /*
+ * Used for non-volatile WAL buffer (NVWAL).
+ *
+ * All the records up to this LSN are persistent in NVWAL.
+ */
+ XLogRecPtr persistentUpTo;
+
/*
* SharedRecoveryState indicates if we're still in crash or archive
* recovery. Protected by info_lck.
@@ -783,11 +790,13 @@ typedef enum
XLOG_FROM_ANY = 0, /* request to read WAL from any source */
XLOG_FROM_ARCHIVE, /* restored using restore_command */
XLOG_FROM_PG_WAL, /* existing file in pg_wal */
- XLOG_FROM_STREAM /* streamed from primary */
+ XLOG_FROM_NVWAL, /* non-volatile WAL buffer */
+ XLOG_FROM_STREAM, /* streamed from primary via segment file */
+ XLOG_FROM_STREAM_NVWAL /* same as above, but via NVWAL */
} XLogSource;
/* human-readable names for XLogSources, for debugging output */
-static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
+static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "nvwal", "stream", "stream_nvwal"};
/*
* openLogFile is -1 or a kernel FD for an open log file segment.
@@ -922,6 +931,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
bool fetching_ckpt, XLogRecPtr tliRecPtr);
static int emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
static void XLogFileClose(void);
+static void PreallocNonVolatileXlogBuffer(void);
static void PreallocXlogFiles(XLogRecPtr endptr);
static void RemoveTempXlogFiles(void);
static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
@@ -1204,6 +1214,43 @@ XLogInsertRecord(XLogRecData *rdata,
}
}
+ /*
+ * Request a checkpoint here if non-volatile WAL buffer is used and we
+ * have consumed too much WAL since the last checkpoint.
+ *
+ * We first screen under the condition (1) OR (2) below:
+ *
+ * (1) The record was the first one in a certain segment.
+ * (2) The record was inserted across segments.
+ *
+ * We then check the segment number which the record was inserted into.
+ */
+ if (NvwalAvail && inserted &&
+ (StartPos % wal_segment_size == SizeOfXLogLongPHD ||
+ StartPos / wal_segment_size < EndPos / wal_segment_size))
+ {
+ XLogSegNo end_segno;
+
+ XLByteToSeg(EndPos, end_segno, wal_segment_size);
+
+ /*
+ * NOTE: We do not signal walsender here because the inserted record
+ * have not drained by NVWAL buffer yet.
+ *
+ * NOTE: We do not signal walarchiver here because the inserted record
+ * have not flushed to a segment file. So we don't need to update
+ * XLogCtl->lastSegSwitch{Time,LSN}, used only by CheckArchiveTimeout.
+ */
+
+ /* Two-step checking for speed (see also XLogWrite) */
+ if (IsUnderPostmaster && XLogCheckpointNeeded(end_segno))
+ {
+ (void) GetRedoRecPtr();
+ if (XLogCheckpointNeeded(end_segno))
+ RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+ }
+ }
+
#ifdef WAL_DEBUG
if (XLOG_DEBUG)
{
@@ -2136,6 +2183,15 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
XLogRecPtr NewPageBeginPtr;
XLogPageHeader NewPage;
int npages = 0;
+ bool is_firstpage;
+
+ if (NvwalAvail)
+ elog(DEBUG1, "XLogCtl->InitializedUpTo %X/%X; upto %X/%X; opportunistic %s",
+ (uint32) (XLogCtl->InitializedUpTo >> 32),
+ (uint32) XLogCtl->InitializedUpTo,
+ (uint32) (upto >> 32),
+ (uint32) upto,
+ opportunistic ? "true" : "false");
LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
@@ -2197,7 +2253,25 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
{
/* Have to write it ourselves */
TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
- WriteRqst.Write = OldPageRqstPtr;
+
+ if (NvwalAvail)
+ {
+ /*
+ * If we use non-volatile WAL buffer, it is a special
+ * but expected case to write the buffer pages out to
+ * segment files, and for simplicity, it is done in
+ * segment by segment.
+ */
+ XLogRecPtr OldSegEndPtr;
+
+ OldSegEndPtr = OldPageRqstPtr - XLOG_BLCKSZ + wal_segment_size;
+ Assert(OldSegEndPtr % wal_segment_size == 0);
+
+ WriteRqst.Write = OldSegEndPtr;
+ }
+ else
+ WriteRqst.Write = OldPageRqstPtr;
+
WriteRqst.Flush = 0;
XLogWrite(WriteRqst, false);
LWLockRelease(WALWriteLock);
@@ -2224,7 +2298,20 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
* Be sure to re-zero the buffer so that bytes beyond what we've
* written will look like zeroes and not valid XLOG records...
*/
- MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
+ if (NvwalAvail)
+ {
+ /*
+ * We do not take the way that combines MemSet() and pmem_persist()
+ * because pmem_persist() may use slow and strong-ordered cache
+ * flush instruction if weak-ordered fast one is not supported.
+ * Instead, we first fill the buffer with zero by
+ * pmem_memset_persist() that can leverage non-temporal fast store
+ * instructions, then make the header persistent later.
+ */
+ nv_memset_persist(NewPage, 0, XLOG_BLCKSZ);
+ }
+ else
+ MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
/*
* Fill the new page's header
@@ -2256,7 +2343,8 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
/*
* If first page of an XLOG segment file, make it a long header.
*/
- if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+ is_firstpage = ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0);
+ if (is_firstpage)
{
XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
@@ -2271,7 +2359,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
* before the xlblocks update. GetXLogBuffer() reads xlblocks without
* holding a lock.
*/
- pg_write_barrier();
+ if (NvwalAvail)
+ {
+ /* Make the header persistent on PMEM */
+ nv_persist(NewPage, is_firstpage ? SizeOfXLogLongPHD : SizeOfXLogShortPHD);
+ }
+ else
+ pg_write_barrier();
*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
@@ -2281,6 +2375,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
}
LWLockRelease(WALBufMappingLock);
+ if (NvwalAvail)
+ elog(DEBUG1, "ControlFile->discardedUpTo %X/%X; XLogCtl->InitializedUpTo %X/%X",
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo,
+ (uint32) (XLogCtl->InitializedUpTo >> 32),
+ (uint32) XLogCtl->InitializedUpTo);
+
#ifdef WAL_DEBUG
if (XLOG_DEBUG && npages > 0)
{
@@ -2662,6 +2763,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
LogwrtResult.Flush = LogwrtResult.Write;
}
+ /*
+ * Update discardedUpTo if NVWAL is used. A new value should not fall
+ * behind the old one.
+ */
+ if (NvwalAvail)
+ {
+ Assert(LogwrtResult.Write == LogwrtResult.Flush);
+
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ if (ControlFile->discardedUpTo < LogwrtResult.Write)
+ {
+ ControlFile->discardedUpTo = LogwrtResult.Write;
+ UpdateControlFile();
+ }
+ LWLockRelease(ControlFileLock);
+ }
+
/*
* Update shared-memory status
*
@@ -2866,6 +2984,123 @@ XLogFlush(XLogRecPtr record)
return;
}
+ if (NvwalAvail)
+ {
+ XLogRecPtr FromPos;
+
+ /*
+ * No page on the NVWAL is to be flushed to segment files. Instead,
+ * we wait all the insertions preceding this one complete. We will
+ * wait for all the records to be persistent on the NVWAL below.
+ */
+ record = WaitXLogInsertionsToFinish(record);
+
+ /*
+ * Check if another backend already have done what I am doing.
+ *
+ * We can compare something <= XLogCtl->persistentUpTo without
+ * holding XLogCtl->info_lck spinlock because persistentUpTo is
+ * monotonically increasing and can be loaded atomically on each
+ * NVWAL-supported platform (now x64 only).
+ */
+ FromPos = *((volatile XLogRecPtr *) &XLogCtl->persistentUpTo);
+ if (record <= FromPos)
+ return;
+
+ /*
+ * In a very rare case, we rounded whole the NVWAL. We do not need
+ * to care old pages here because they already have been evicted to
+ * segment files at record insertion.
+ *
+ * In such a case, we flush whole the NVWAL. We also log it as
+ * warning because it can be time-consuming operation.
+ *
+ * TODO Advance XLogCtl->persistentUpTo at the end of XLogWrite, and
+ * we can remove the following first if-block.
+ */
+ if (record - FromPos > NvwalSize)
+ {
+ elog(WARNING, "flush whole the NVWAL; FromPos %X/%X; record %X/%X",
+ (uint32) (FromPos >> 32), (uint32) FromPos,
+ (uint32) (record >> 32), (uint32) record);
+
+ nv_flush(XLogCtl->pages, NvwalSize);
+ }
+ else
+ {
+ char *frompos;
+ char *uptopos;
+ size_t fromoff;
+ size_t uptooff;
+
+ /*
+ * Flush each record that is probably not flushed yet.
+ *
+ * We have two reasons why we say "probably". The first is because
+ * such a record copied with non-temporal store instruction has
+ * already "flushed" but we cannot distinguish it. nv_flush is
+ * harmless for it in consistency.
+ *
+ * The second reason is that the target record might have already
+ * been evicted to a segment file until now. Also in this case,
+ * nv_flush is harmless in consistency.
+ */
+ uptooff = record % NvwalSize;
+ uptopos = XLogCtl->pages + uptooff;
+ fromoff = FromPos % NvwalSize;
+ frompos = XLogCtl->pages + fromoff;
+
+ /* Handles rotation */
+ if (uptopos <= frompos)
+ {
+ nv_flush(frompos, NvwalSize - fromoff);
+ fromoff = 0;
+ frompos = XLogCtl->pages;
+ }
+
+ nv_flush(frompos, uptooff - fromoff);
+ }
+
+ /*
+ * To guarantee durability ("D" of ACID), we should satisfy the
+ * following two for each transaction X:
+ *
+ * (1) All the WAL records inserted by X, including the commit record
+ * of X, should persist on NVWAL before the server commits X.
+ *
+ * (2) All the WAL records inserted by any other transactions than
+ * X, that have less LSN than the commit record just inserted
+ * by X, should persist on NVWAL before the server commits X.
+ *
+ * The (1) can be satisfied by a store barrier after the commit record
+ * of X is flushed because each WAL record on X is already flushed in
+ * the end of its insertion. The (2) can be satisfied by waiting for
+ * any record insertions that have less LSN than the commit record just
+ * inserted by X, and by a store barrier as well.
+ *
+ * Now is the time. Have a store barrier.
+ */
+ nv_drain();
+
+ /*
+ * Remember where the last persistent record is. A new value should
+ * not fall behind the old one.
+ */
+ SpinLockAcquire(&XLogCtl->info_lck);
+ if (XLogCtl->persistentUpTo < record)
+ XLogCtl->persistentUpTo = record;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ /*
+ * The records up to the returned "record" have been persisntent on
+ * NVWAL. Now signal walsenders.
+ */
+ WalSndWakeupRequest();
+ WalSndWakeupProcessRequests();
+
+ return;
+ }
+
/* Quick exit if already known flushed */
if (record <= LogwrtResult.Flush)
return;
@@ -3049,6 +3284,13 @@ XLogBackgroundFlush(void)
if (RecoveryInProgress())
return false;
+ /*
+ * Quick exit if NVWAL buffer is used and archiving is not active. In this
+ * case, we need no WAL segment file in pg_wal directory.
+ */
+ if (NvwalAvail && !XLogArchivingActive())
+ return false;
+
/* read LogwrtResult and update local state */
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
@@ -3067,6 +3309,18 @@ XLogBackgroundFlush(void)
flexible = false; /* ensure it all gets written */
}
+ /*
+ * If NVWAL is used, back off to the last compeleted segment boundary
+ * for writing the buffer page to files in segment by segment. We do so
+ * nowhere but here after XLogCtl->asyncXactLSN is loaded because it
+ * should be considered.
+ */
+ if (NvwalAvail)
+ {
+ WriteRqst.Write -= WriteRqst.Write % wal_segment_size;
+ flexible = false; /* ensure it all gets written */
+ }
+
/*
* If already known flushed, we're done. Just need to check if we are
* holding an open file handle to a logfile that's no longer in use,
@@ -3093,7 +3347,12 @@ XLogBackgroundFlush(void)
flushbytes =
WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
- if (WalWriterFlushAfter == 0 || lastflush == 0)
+ if (NvwalAvail)
+ {
+ WriteRqst.Flush = WriteRqst.Write;
+ lastflush = now;
+ }
+ else if (WalWriterFlushAfter == 0 || lastflush == 0)
{
/* first call, or block based limits disabled */
WriteRqst.Flush = WriteRqst.Write;
@@ -3152,7 +3411,28 @@ XLogBackgroundFlush(void)
* Great, done. To take some work off the critical path, try to initialize
* as many of the no-longer-needed WAL buffers for future use as we can.
*/
- AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
+ if (NvwalAvail && max_wal_senders == 0)
+ {
+ XLogRecPtr upto;
+
+ /*
+ * If NVWAL is used and there is no walsender, nobody is to load
+ * segments on the buffer. So let's recycle segments up to {where we
+ * have requested to write and flush} + NvwalSize.
+ *
+ * Note that if NVWAL is used and a walsender seems running, we have to
+ * do nothing; keep the written pages on the buffer for walsenders to be
+ * loaded from the buffer, not from the segment files. Note that the
+ * buffer pages are eventually to be recycled by checkpoint.
+ */
+ Assert(WriteRqst.Write == WriteRqst.Flush);
+ Assert(WriteRqst.Write % wal_segment_size == 0);
+
+ upto = WriteRqst.Write + NvwalSize;
+ AdvanceXLInsertBuffer(upto - XLOG_BLCKSZ, false);
+ }
+ else
+ AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
/*
* If we determined that we need to write data, but somebody else
@@ -3885,6 +4165,43 @@ XLogFileClose(void)
ReleaseExternalFD();
}
+/*
+ * Preallocate non-volatile XLOG buffers.
+ *
+ * This zeroes buffers and prepare page headers up to
+ * ControlFile->discardedUpTo + S, where S is the total size of
+ * the non-volatile XLOG buffers.
+ *
+ * It is caller's responsibility to update ControlFile->discardedUpTo
+ * and to set XLogCtl->InitializedUpTo properly.
+ */
+static void
+PreallocNonVolatileXlogBuffer(void)
+{
+ XLogRecPtr newupto,
+ InitializedUpTo;
+
+ Assert(NvwalAvail);
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ newupto = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ InitializedUpTo = XLogCtl->InitializedUpTo;
+
+ newupto += NvwalSize;
+ Assert(newupto % wal_segment_size == 0);
+
+ if (newupto <= InitializedUpTo)
+ return;
+
+ /*
+ * Subtracting XLOG_BLCKSZ is important, because AdvanceXLInsertBuffer
+ * handles the first argument as the beginning of pages, not the end.
+ */
+ AdvanceXLInsertBuffer(newupto - XLOG_BLCKSZ, false);
+}
+
/*
* Preallocate log files beyond the specified log endpoint.
*
@@ -4181,8 +4498,11 @@ RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr)
* Before deleting the file, see if it can be recycled as a future log
* segment. Only recycle normal files, pg_standby for example can create
* symbolic links pointing to a separate archive directory.
+ *
+ * If NVWAL buffer is used, a log segment file is never to be recycled
+ * (that is, always go into else block).
*/
- if (wal_recycle &&
+ if (!NvwalAvail && wal_recycle &&
endlogSegNo <= recycleSegNo &&
lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
InstallXLogFileSegment(&endlogSegNo, path,
@@ -4600,6 +4920,7 @@ InitControlFile(uint64 sysidentifier)
memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, MOCK_AUTH_NONCE_LEN);
ControlFile->state = DB_SHUTDOWNED;
ControlFile->unloggedLSN = FirstNormalUnloggedLSN;
+ ControlFile->discardedUpTo = (NvwalAvail) ? wal_segment_size : InvalidXLogRecPtr;
/* Set important parameter values for use when replaying WAL */
ControlFile->MaxConnections = MaxConnections;
@@ -5430,41 +5751,58 @@ BootStrapXLOG(void)
record->xl_crc = crc;
/* Create first XLOG segment file */
- use_existent = false;
- openLogFile = XLogFileInit(1, &use_existent, false);
+ if (NvwalAvail)
+ {
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+ nv_memcpy_nodrain(XLogCtl->pages + wal_segment_size, page, XLOG_BLCKSZ);
+ pgstat_report_wait_end();
- /*
- * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
- * close the file again in a moment.
- */
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+ nv_drain();
+ pgstat_report_wait_end();
- /* Write the first page with the initial record */
- errno = 0;
- pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
- if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
- {
- /* if write didn't set errno, assume problem is no disk space */
- if (errno == 0)
- errno = ENOSPC;
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not write bootstrap write-ahead log file: %m")));
+ /*
+ * Other WAL stuffs will be initialized in startup process.
+ */
}
- pgstat_report_wait_end();
+ else
+ {
+ use_existent = false;
+ openLogFile = XLogFileInit(1, &use_existent, false);
- pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
- if (pg_fsync(openLogFile) != 0)
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not fsync bootstrap write-ahead log file: %m")));
- pgstat_report_wait_end();
+ /*
+ * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
+ * close the file again in a moment.
+ */
- if (close(openLogFile) != 0)
- ereport(PANIC,
- (errcode_for_file_access(),
- errmsg("could not close bootstrap write-ahead log file: %m")));
+ /* Write the first page with the initial record */
+ errno = 0;
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+ if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+ {
+ /* if write didn't set errno, assume problem is no disk space */
+ if (errno == 0)
+ errno = ENOSPC;
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not write bootstrap write-ahead log file: %m")));
+ }
+ pgstat_report_wait_end();
- openLogFile = -1;
+ pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+ if (pg_fsync(openLogFile) != 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not fsync bootstrap write-ahead log file: %m")));
+ pgstat_report_wait_end();
+
+ if (close(openLogFile) != 0)
+ ereport(PANIC,
+ (errcode_for_file_access(),
+ errmsg("could not close bootstrap write-ahead log file: %m")));
+
+ openLogFile = -1;
+ }
/* Now create pg_control */
InitControlFile(sysidentifier);
@@ -5718,41 +6056,47 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
* happens in the middle of a segment, copy data from the last WAL segment
* of the old timeline up to the switch point, to the starting WAL segment
* on the new timeline.
+ *
+ * If non-volatile WAL buffer is used, no new segment file is created. Data
+ * up to the switch point will be copied into NVWAL buffer by StartupXLOG().
*/
- if (endLogSegNo == startLogSegNo)
- {
- /*
- * Make a copy of the file on the new timeline.
- *
- * Writing WAL isn't allowed yet, so there are no locking
- * considerations. But we should be just as tense as XLogFileInit to
- * avoid emplacing a bogus file.
- */
- XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
- XLogSegmentOffset(endOfLog, wal_segment_size));
- }
- else
+ if (!NvwalAvail)
{
- /*
- * The switch happened at a segment boundary, so just create the next
- * segment on the new timeline.
- */
- bool use_existent = true;
- int fd;
+ if (endLogSegNo == startLogSegNo)
+ {
+ /*
+ * Make a copy of the file on the new timeline.
+ *
+ * Writing WAL isn't allowed yet, so there are no locking
+ * considerations. But we should be just as tense as XLogFileInit to
+ * avoid emplacing a bogus file.
+ */
+ XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
+ XLogSegmentOffset(endOfLog, wal_segment_size));
+ }
+ else
+ {
+ /*
+ * The switch happened at a segment boundary, so just create the next
+ * segment on the new timeline.
+ */
+ bool use_existent = true;
+ int fd;
- fd = XLogFileInit(startLogSegNo, &use_existent, true);
+ fd = XLogFileInit(startLogSegNo, &use_existent, true);
- if (close(fd) != 0)
- {
- char xlogfname[MAXFNAMELEN];
- int save_errno = errno;
+ if (close(fd) != 0)
+ {
+ char xlogfname[MAXFNAMELEN];
+ int save_errno = errno;
- XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
- wal_segment_size);
- errno = save_errno;
- ereport(ERROR,
- (errcode_for_file_access(),
- errmsg("could not close file \"%s\": %m", xlogfname)));
+ XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
+ wal_segment_size);
+ errno = save_errno;
+ ereport(ERROR,
+ (errcode_for_file_access(),
+ errmsg("could not close file \"%s\": %m", xlogfname)));
+ }
}
}
@@ -7009,6 +7353,11 @@ StartupXLOG(void)
InRecovery = true;
}
+ /* Dump discardedUpTo just before REDO */
+ elog(LOG, "ControlFile->discardedUpTo %X/%X",
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo);
+
/* REDO */
if (InRecovery)
{
@@ -7795,10 +8144,88 @@ StartupXLOG(void)
Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
+ if (NvwalAvail)
+ {
+ XLogRecPtr discardedUpTo;
+
+ discardedUpTo = ControlFile->discardedUpTo;
+ Assert(discardedUpTo == InvalidXLogRecPtr ||
+ discardedUpTo % wal_segment_size == 0);
+
+ if (discardedUpTo == InvalidXLogRecPtr)
+ {
+ elog(DEBUG1, "brand-new NVWAL");
+
+ /* The following "Tricky point" is to initialize the buffer */
+ }
+ else if (EndOfLog <= discardedUpTo)
+ {
+ elog(DEBUG1, "no record on NVWAL has been UNDONE");
+
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = InvalidXLogRecPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+
+ nv_memset_persist(XLogCtl->pages, 0, NvwalSize);
+
+ /* The following "Tricky point" is to initialize the buffer */
+ }
+ else
+ {
+ int last_idx;
+ int idx;
+ XLogRecPtr ptr;
+
+ elog(DEBUG1, "some records on NVWAL have been UNDONE; keep them");
+
+ /*
+ * Initialize xlblock array because we decided to keep UNDONE
+ * records on NVWAL buffer; or each page on the buffer that meets
+ * xlblocks == 0 (initialized as so by XLOGShmemInit) is to be
+ * accidentally cleared by the following AdvanceXLInsertBuffer!
+ *
+ * Two cases can be considered:
+ *
+ * 1) EndOfLog is on a page boundary (divisible by XLOG_BLCKSZ):
+ * Initialize up to (and including) the page containing the last
+ * record. That page should end with EndOfLog. The one more
+ * next page "N" beginning with EndOfLog is to be untouched
+ * because, in such a very corner case that all the NVWAL
+ * buffer pages are already filled, page N is on the same
+ * location as the first page "F" beginning with discardedUpTo.
+ * Of cource we should not overwrite the page F.
+ *
+ * In this case, we first get XLogRecPtrToBufIdx(EndOfLog) as
+ * last_idx, indicating the page N. Then, we go forward from
+ * the page F up to (but excluding) page N that have the same
+ * index as the page F.
+ *
+ * 2) EndOfLog is not on a page boundary: Initialize all the pages
+ * but the page "L" having the last record. The page L is to be
+ * initialized by the following "Tricky point", including its
+ * content.
+ *
+ * In either case, XLogCtl->InitializedUpTo is to be initialized in
+ * the following "Tricky" if-else block.
+ */
+
+ last_idx = XLogRecPtrToBufIdx(EndOfLog);
+
+ ptr = discardedUpTo;
+ for (idx = XLogRecPtrToBufIdx(ptr); idx != last_idx;
+ idx = NextBufIdx(idx))
+ {
+ ptr += XLOG_BLCKSZ;
+ XLogCtl->xlblocks[idx] = ptr;
+ }
+ }
+ }
+
/*
- * Tricky point here: readBuf contains the *last* block that the LastRec
- * record spans, not the one it starts in. The last block is indeed the
- * one we want to use.
+ * Tricky point here: readBuf contains the *last* block that the
+ * LastRec record spans, not the one it starts in. The last block is
+ * indeed the one we want to use.
*/
if (EndOfLog % XLOG_BLCKSZ != 0)
{
@@ -7818,6 +8245,9 @@ StartupXLOG(void)
memcpy(page, xlogreader->readBuf, len);
memset(page + len, 0, XLOG_BLCKSZ - len);
+ if (NvwalAvail)
+ nv_persist(page, XLOG_BLCKSZ);
+
XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
}
@@ -7831,12 +8261,54 @@ StartupXLOG(void)
XLogCtl->InitializedUpTo = EndOfLog;
}
- LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+ if (NvwalAvail)
+ {
+ XLogRecPtr SegBeginPtr;
+
+ /*
+ * If NVWAL buffer is used, writing records out to segment files should
+ * be done in segment by segment. So Logwrt{Rqst,Result} (and also
+ * discardedUpTo) should be multiple of wal_segment_size. Let's get
+ * them back off to the last segment boundary.
+ */
+
+ SegBeginPtr = EndOfLog - (EndOfLog % wal_segment_size);
+ LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+ XLogCtl->LogwrtResult = LogwrtResult;
+ XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+ XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+
+ /*
+ * persistentUpTo does not need to be multiple of wal_segment_size,
+ * and should be drained-up-to LSN. walsender will use it to load
+ * records from NVWAL buffer.
+ */
+ XLogCtl->persistentUpTo = EndOfLog;
+
+ /* Update discardedUpTo in pg_control if still invalid */
+ if (ControlFile->discardedUpTo == InvalidXLogRecPtr)
+ {
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = SegBeginPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+ }
+
+ elog(DEBUG1, "EndOfLog: %X/%X",
+ (uint32) (EndOfLog >> 32), (uint32) EndOfLog);
- XLogCtl->LogwrtResult = LogwrtResult;
+ elog(DEBUG1, "SegBeginPtr: %X/%X",
+ (uint32) (SegBeginPtr >> 32), (uint32) SegBeginPtr);
+ }
+ else
+ {
+ LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
- XLogCtl->LogwrtRqst.Write = EndOfLog;
- XLogCtl->LogwrtRqst.Flush = EndOfLog;
+ XLogCtl->LogwrtResult = LogwrtResult;
+
+ XLogCtl->LogwrtRqst.Write = EndOfLog;
+ XLogCtl->LogwrtRqst.Flush = EndOfLog;
+ }
/*
* Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7967,6 +8439,7 @@ StartupXLOG(void)
char origpath[MAXPGPATH];
char partialfname[MAXFNAMELEN];
char partialpath[MAXPGPATH];
+ XLogRecPtr discardedUpTo;
XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
@@ -7978,6 +8451,53 @@ StartupXLOG(void)
*/
XLogArchiveCleanup(partialfname);
+ /*
+ * If NVWAL is also used for archival recovery, write old
+ * records out to segment files to archive them. Note that we
+ * need locks related to WAL because LocalXLogInsertAllowed
+ * already got to -1.
+ */
+ discardedUpTo = ControlFile->discardedUpTo;
+ if (NvwalAvail && discardedUpTo != InvalidXLogRecPtr &&
+ discardedUpTo < EndOfLog)
+ {
+ XLogwrtRqst WriteRqst;
+ TimeLineID thisTLI = ThisTimeLineID;
+ XLogRecPtr SegBeginPtr =
+ EndOfLog - (EndOfLog % wal_segment_size);
+
+ /*
+ * XXX Assume that all the records have the same TLI.
+ */
+ ThisTimeLineID = EndOfLogTLI;
+
+ WriteRqst.Write = EndOfLog;
+ WriteRqst.Flush = 0;
+
+ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ XLogWrite(WriteRqst, false);
+
+ /*
+ * Force back-off to the last segment boundary.
+ */
+ LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+ ControlFile->discardedUpTo = SegBeginPtr;
+ UpdateControlFile();
+ LWLockRelease(ControlFileLock);
+
+ LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ XLogCtl->LogwrtResult = LogwrtResult;
+ XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+ XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ LWLockRelease(WALWriteLock);
+
+ ThisTimeLineID = thisTLI;
+ }
+
durable_rename(origpath, partialpath, ERROR);
XLogArchiveNotify(partialfname);
}
@@ -7987,7 +8507,10 @@ StartupXLOG(void)
/*
* Preallocate additional log files, if wanted.
*/
- PreallocXlogFiles(EndOfLog);
+ if (NvwalAvail)
+ PreallocNonVolatileXlogBuffer();
+ else
+ PreallocXlogFiles(EndOfLog);
/*
* Okay, we're officially UP.
@@ -8551,10 +9074,24 @@ GetInsertRecPtr(void)
/*
* GetFlushRecPtr -- Returns the current flush position, ie, the last WAL
* position known to be fsync'd to disk.
+ *
+ * If NVWAL is used, this returns the last persistent WAL position instead.
*/
XLogRecPtr
GetFlushRecPtr(void)
{
+ if (NvwalAvail)
+ {
+ XLogRecPtr ret;
+
+ SpinLockAcquire(&XLogCtl->info_lck);
+ LogwrtResult = XLogCtl->LogwrtResult;
+ ret = XLogCtl->persistentUpTo;
+ SpinLockRelease(&XLogCtl->info_lck);
+
+ return ret;
+ }
+
SpinLockAcquire(&XLogCtl->info_lck);
LogwrtResult = XLogCtl->LogwrtResult;
SpinLockRelease(&XLogCtl->info_lck);
@@ -8854,6 +9391,9 @@ CreateCheckPoint(int flags)
VirtualTransactionId *vxids;
int nvxids;
+ /* for non-volatile WAL buffer */
+ XLogRecPtr newDiscardedUpTo = 0;
+
/*
* An end-of-recovery checkpoint is really a shutdown checkpoint, just
* issued at a different time.
@@ -9165,6 +9705,22 @@ CreateCheckPoint(int flags)
*/
PriorRedoPtr = ControlFile->checkPointCopy.redo;
+ /*
+ * If non-volatile WAL buffer is used, discardedUpTo should be updated and
+ * persist on the control file. So the new value should be caluculated
+ * here.
+ *
+ * TODO Do not copy and paste codes...
+ */
+ if (NvwalAvail)
+ {
+ XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+ KeepLogSeg(recptr, &_logSegNo);
+ _logSegNo--;
+
+ newDiscardedUpTo = _logSegNo * wal_segment_size;
+ }
+
/*
* Update the control file.
*/
@@ -9173,6 +9729,16 @@ CreateCheckPoint(int flags)
ControlFile->state = DB_SHUTDOWNED;
ControlFile->checkPoint = ProcLastRecPtr;
ControlFile->checkPointCopy = checkPoint;
+ if (NvwalAvail)
+ {
+ /*
+ * A new value should not fall behind the old one.
+ */
+ if (ControlFile->discardedUpTo < newDiscardedUpTo)
+ ControlFile->discardedUpTo = newDiscardedUpTo;
+ else
+ newDiscardedUpTo = ControlFile->discardedUpTo;
+ }
ControlFile->time = (pg_time_t) time(NULL);
/* crash recovery should always recover to the end of WAL */
ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
@@ -9190,6 +9756,44 @@ CreateCheckPoint(int flags)
UpdateControlFile();
LWLockRelease(ControlFileLock);
+ /*
+ * If we use non-volatile XLOG buffer, update XLogCtl->Logwrt{Rqst,Result}
+ * so that the XLOG records older than newDiscardedUpTo are treated as
+ * "already written and flushed."
+ */
+ if (NvwalAvail)
+ {
+ Assert(newDiscardedUpTo > 0);
+
+ /* Update process-local variables */
+ LogwrtResult.Write = LogwrtResult.Flush = newDiscardedUpTo;
+
+ /*
+ * Update shared-memory variables. We need both light-weight lock and
+ * spin lock to update them.
+ */
+ LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+ SpinLockAcquire(&XLogCtl->info_lck);
+
+ /*
+ * Note that there can be a corner case that process-local
+ * LogwrtResult falls behind shared XLogCtl->LogwrtResult if whole the
+ * non-volatile XLOG buffer is filled and some pages are written out
+ * to segment files between UpdateControlFile and LWLockAcquire above.
+ *
+ * TODO For now, we ignore that case because it can hardly occur.
+ */
+ XLogCtl->LogwrtResult = LogwrtResult;
+
+ if (XLogCtl->LogwrtRqst.Write < newDiscardedUpTo)
+ XLogCtl->LogwrtRqst.Write = newDiscardedUpTo;
+ if (XLogCtl->LogwrtRqst.Flush < newDiscardedUpTo)
+ XLogCtl->LogwrtRqst.Flush = newDiscardedUpTo;
+
+ SpinLockRelease(&XLogCtl->info_lck);
+ LWLockRelease(WALWriteLock);
+ }
+
/* Update shared-memory copy of checkpoint XID/epoch */
SpinLockAcquire(&XLogCtl->info_lck);
XLogCtl->ckptFullXid = checkPoint.nextXid;
@@ -9213,22 +9817,48 @@ CreateCheckPoint(int flags)
if (PriorRedoPtr != InvalidXLogRecPtr)
UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
- /*
- * Delete old log files, those no longer needed for last checkpoint to
- * prevent the disk holding the xlog from growing full.
- */
- XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
- KeepLogSeg(recptr, &_logSegNo);
- InvalidateObsoleteReplicationSlots(_logSegNo);
- _logSegNo--;
- RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ if (NvwalAvail)
+ {
+ /*
+ * We already set _logSegNo to the value equivalent to discardedUpTo.
+ * We first increment it to call InvalidateObsoleteReplicationSlots.
+ */
+ _logSegNo++;
+ InvalidateObsoleteReplicationSlots(_logSegNo);
+
+ /*
+ * Then we decrement _logSegNo again to remove WAL segment files
+ * having spilled out of non-volatile WAL buffer.
+ *
+ * Note that you should set wal_recycle to off to remove segment files.
+ */
+ _logSegNo--;
+ RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ }
+ else
+ {
+ /*
+ * Delete old log files, those no longer needed for last checkpoint to
+ * prevent the disk holding the xlog from growing full.
+ */
+ XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+ KeepLogSeg(recptr, &_logSegNo);
+ InvalidateObsoleteReplicationSlots(_logSegNo);
+ _logSegNo--;
+ RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+ }
/*
* Make more log segments if needed. (Do this after recycling old log
* segments, since that may supply some of the needed files.)
*/
if (!shutdown)
- PreallocXlogFiles(recptr);
+ {
+ if (NvwalAvail)
+ PreallocNonVolatileXlogBuffer();
+ else
+ PreallocXlogFiles(recptr);
+ }
/*
* Truncate pg_subtrans if possible. We can throw away all data before
@@ -11985,6 +12615,170 @@ CancelBackup(void)
}
}
+/*
+ * Is NVWAL used?
+ */
+bool
+IsNvwalAvail(void)
+{
+ return NvwalAvail;
+}
+
+/*
+ * Returns the size we can load from NVWAL and sets nvwalptr to load-from LSN.
+ */
+Size
+GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
+{
+ XLogRecPtr readUpTo;
+ XLogRecPtr discardedUpTo;
+
+ Assert(IsNvwalAvail());
+ Assert(nvwalptr != NULL);
+
+ readUpTo = target + count;
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ discardedUpTo = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ /* Check if all the records are on WAL segment files */
+ if (readUpTo <= discardedUpTo)
+ return 0;
+
+ /* Check if all the records are on NVWAL */
+ if (discardedUpTo <= target)
+ {
+ *nvwalptr = target;
+ return count;
+ }
+
+ /* Some on WAL segment files, some on NVWAL */
+ *nvwalptr = discardedUpTo;
+ return (Size) (readUpTo - discardedUpTo);
+}
+
+/*
+ * It is like WALRead @ xlogreader.c, but loads from non-volatile WAL
+ * buffer.
+ */
+bool
+CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+ char *p;
+ XLogRecPtr recptr;
+ Size nbytes;
+
+ Assert(NvwalAvail);
+
+ p = buf;
+ recptr = startptr;
+ nbytes = count;
+
+ /*
+ * Hold shared WALBufMappingLock to let others not rotate WAL buffer
+ * while copying WAL records from it. We do not need exclusive lock
+ * because we will not rotate the buffer in this function.
+ */
+ LWLockAcquire(WALBufMappingLock, LW_SHARED);
+
+ while (nbytes > 0)
+ {
+ char *q;
+ Size off;
+ Size max_copy;
+ Size copybytes;
+ XLogRecPtr discardedUpTo;
+
+ LWLockAcquire(ControlFileLock, LW_SHARED);
+ discardedUpTo = ControlFile->discardedUpTo;
+ LWLockRelease(ControlFileLock);
+
+ /* Check if the records we need have been already evicted or not */
+ if (recptr < discardedUpTo)
+ {
+ LWLockRelease(WALBufMappingLock);
+
+ /* TODO error handling? */
+ return false;
+ }
+
+ /*
+ * Get the target address on non-volatile WAL buffer and the size we
+ * can copy from it at once because the buffer can rotate and we
+ * might have to copy what we want devided into two or more.
+ */
+ off = recptr % NvwalSize;
+ q = XLogCtl->pages + off;
+ max_copy = NvwalSize - off;
+ copybytes = Min(nbytes, max_copy);
+
+ memcpy(p, q, copybytes);
+
+ /* Update state for copy */
+ recptr += copybytes;
+ nbytes -= copybytes;
+ p += copybytes;
+ }
+
+ LWLockRelease(WALBufMappingLock);
+ return true;
+}
+
+static bool
+IsXLogSourceFromStream(XLogSource source)
+{
+ switch (source)
+ {
+ case XLOG_FROM_STREAM:
+ case XLOG_FROM_STREAM_NVWAL:
+ return true;
+
+ default:
+ return false;
+ }
+}
+
+static bool
+IsXLogSourceFromNvwal(XLogSource source)
+{
+ switch (source)
+ {
+ case XLOG_FROM_NVWAL:
+ case XLOG_FROM_STREAM_NVWAL:
+ return true;
+
+ default:
+ return false;
+ }
+}
+
+static bool
+NeedsForMoreXLog(XLogRecPtr targetChunkEndPtr)
+{
+ switch (readSource)
+ {
+ case XLOG_FROM_ARCHIVE:
+ case XLOG_FROM_PG_WAL:
+ return (readFile < 0);
+
+ case XLOG_FROM_NVWAL:
+ Assert(NvwalAvail);
+ return false;
+
+ case XLOG_FROM_STREAM:
+ return (flushedUpto < targetChunkEndPtr);
+
+ case XLOG_FROM_STREAM_NVWAL:
+ Assert(NvwalAvail);
+ return (flushedUpto < targetChunkEndPtr);
+
+ default: /* XLOG_FROM_ANY */
+ Assert(readFile < 0);
+ return true;
+ }
+}
+
/*
* Read the XLOG page containing RecPtr into readBuf (if not read already).
* Returns number of bytes read, if the page is read successfully, or -1
@@ -12026,7 +12820,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
* See if we need to switch to a new segment because the requested record
* is not in the currently open one.
*/
- if (readFile >= 0 &&
+ if ((readFile >= 0 || IsXLogSourceFromNvwal(readSource)) &&
!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
{
/*
@@ -12043,7 +12837,8 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
}
}
- close(readFile);
+ if (readFile >= 0)
+ close(readFile);
readFile = -1;
readSource = XLOG_FROM_ANY;
}
@@ -12052,9 +12847,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
retry:
/* See if we need to retrieve more data */
- if (readFile < 0 ||
- (readSource == XLOG_FROM_STREAM &&
- flushedUpto < targetPagePtr + reqLen))
+ if (NeedsForMoreXLog(targetPagePtr + reqLen))
{
if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
private->randAccess,
@@ -12075,7 +12868,7 @@ retry:
* At this point, we have the right segment open and if we're streaming we
* know the requested record is in it.
*/
- Assert(readFile != -1);
+ Assert(readFile != -1 || IsXLogSourceFromNvwal(readSource));
/*
* If the current segment is being streamed from the primary, calculate how
@@ -12083,7 +12876,7 @@ retry:
* requested record has been received, but this is for the benefit of
* future calls, to allow quick exit at the top of this function.
*/
- if (readSource == XLOG_FROM_STREAM)
+ if (IsXLogSourceFromStream(readSource))
{
if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
readLen = XLOG_BLCKSZ;
@@ -12094,41 +12887,59 @@ retry:
else
readLen = XLOG_BLCKSZ;
- /* Read the requested page */
readOff = targetPageOff;
- pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
- r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
- if (r != XLOG_BLCKSZ)
+ if (IsXLogSourceFromNvwal(readSource))
{
- char fname[MAXFNAMELEN];
- int save_errno = errno;
+ Size offset = (Size) (targetPagePtr % NvwalSize);
+ char *readpos = XLogCtl->pages + offset;
+
+ Assert(offset % XLOG_BLCKSZ == 0);
+ /* Load the requested page from non-volatile WAL buffer */
+ pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+ memcpy(readBuf, readpos, readLen);
pgstat_report_wait_end();
- XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
- if (r < 0)
+
+ /* There are not any other clues of TLI... */
+ xlogreader->seg.ws_tli = ((XLogPageHeader) readBuf)->xlp_tli;
+ }
+ else
+ {
+ /* Read the requested page from file */
+ pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+ r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+ if (r != XLOG_BLCKSZ)
{
- errno = save_errno;
- ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
- (errcode_for_file_access(),
- errmsg("could not read from log segment %s, offset %u: %m",
- fname, readOff)));
+ char fname[MAXFNAMELEN];
+ int save_errno = errno;
+
+ pgstat_report_wait_end();
+ XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+ if (r < 0)
+ {
+ errno = save_errno;
+ ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+ (errcode_for_file_access(),
+ errmsg("could not read from log segment %s, offset %u: %m",
+ fname, readOff)));
+ }
+ else
+ ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+ (errcode(ERRCODE_DATA_CORRUPTED),
+ errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+ fname, readOff, r, (Size) XLOG_BLCKSZ)));
+ goto next_record_is_invalid;
}
- else
- ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
- (errcode(ERRCODE_DATA_CORRUPTED),
- errmsg("could not read from log segment %s, offset %u: read %d of %zu",
- fname, readOff, r, (Size) XLOG_BLCKSZ)));
- goto next_record_is_invalid;
+ pgstat_report_wait_end();
+
+ xlogreader->seg.ws_tli = curFileTLI;
}
- pgstat_report_wait_end();
Assert(targetSegNo == readSegNo);
Assert(targetPageOff == readOff);
Assert(reqLen <= readLen);
- xlogreader->seg.ws_tli = curFileTLI;
-
/*
* Check the page header immediately, so that we can retry immediately if
* it's not valid. This may seem unnecessary, because XLogReadRecord()
@@ -12162,6 +12973,17 @@ retry:
goto next_record_is_invalid;
}
+ /*
+ * Updating curFileTLI on each page verified if non-volatile WAL buffer
+ * is used because there is no TimeLineID information in NVWAL's filename.
+ */
+ if (IsXLogSourceFromNvwal(readSource) &&
+ curFileTLI != xlogreader->latestPageTLI)
+ {
+ curFileTLI = xlogreader->latestPageTLI;
+ elog(DEBUG1, "curFileTLI: %u", curFileTLI);
+ }
+
return readLen;
next_record_is_invalid:
@@ -12243,7 +13065,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
if (!InArchiveRecovery)
currentSource = XLOG_FROM_PG_WAL;
else if (currentSource == XLOG_FROM_ANY ||
- (!StandbyMode && currentSource == XLOG_FROM_STREAM))
+ (!StandbyMode && IsXLogSourceFromStream(currentSource)))
{
lastSourceFailed = false;
currentSource = XLOG_FROM_ARCHIVE;
@@ -12266,6 +13088,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
{
case XLOG_FROM_ARCHIVE:
case XLOG_FROM_PG_WAL:
+ case XLOG_FROM_NVWAL:
/*
* Check to see if the trigger file exists. Note that we
@@ -12279,6 +13102,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
return false;
}
+ /* Try NVWAL if available */
+ if (NvwalAvail && currentSource != XLOG_FROM_NVWAL)
+ {
+ currentSource = XLOG_FROM_NVWAL;
+ break;
+ }
+
/*
* Not in standby mode, and we've now tried the archive
* and pg_wal.
@@ -12290,11 +13120,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* Move to XLOG_FROM_STREAM state, and set to start a
* walreceiver if necessary.
*/
- currentSource = XLOG_FROM_STREAM;
+ if (currentSource == XLOG_FROM_NVWAL)
+ currentSource = XLOG_FROM_STREAM_NVWAL;
+ else
+ currentSource = XLOG_FROM_STREAM;
startWalReceiver = true;
break;
case XLOG_FROM_STREAM:
+ case XLOG_FROM_STREAM_NVWAL:
/*
* Failure while streaming. Most likely, we got here
@@ -12400,6 +13234,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
{
case XLOG_FROM_ARCHIVE:
case XLOG_FROM_PG_WAL:
+ case XLOG_FROM_NVWAL:
/*
* WAL receiver must not be running when reading WAL from
@@ -12417,6 +13252,59 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
if (randAccess)
curFileTLI = 0;
+ /* Try to load from NVWAL */
+ if (currentSource == XLOG_FROM_NVWAL)
+ {
+ XLogRecPtr discardedUpTo;
+
+ Assert(NvwalAvail);
+
+ /*
+ * Check if the target page exists on NVWAL. Note that
+ * RecPtr points to the end of the target chunk.
+ *
+ * TODO need ControlFileLock?
+ */
+ discardedUpTo = ControlFile->discardedUpTo;
+ if (discardedUpTo != InvalidXLogRecPtr &&
+ discardedUpTo < RecPtr &&
+ RecPtr <= discardedUpTo + NvwalSize)
+ {
+ /* Report recovery progress in PS display */
+ set_ps_display("recovering NVWAL");
+ elog(DEBUG1, "recovering NVWAL");
+
+ /* Track source of data and receipt time */
+ readSource = XLOG_FROM_NVWAL;
+ XLogReceiptSource = XLOG_FROM_NVWAL;
+ XLogReceiptTime = GetCurrentTimestamp();
+
+ /*
+ * Construct expectedTLEs. This is necessary to
+ * recover only from NVWAL because its filename does
+ * not have any TLI information.
+ */
+ if (!expectedTLEs)
+ {
+ TimeLineHistoryEntry *entry;
+
+ entry = palloc(sizeof(TimeLineHistoryEntry));
+ entry->tli = recoveryTargetTLI;
+ entry->begin = entry->end = InvalidXLogRecPtr;
+
+ expectedTLEs = list_make1(entry);
+ elog(DEBUG1, "expectedTLEs: [%u]",
+ (uint32) recoveryTargetTLI);
+ }
+
+ return true;
+ }
+
+ /* Target page does not exist on NVWAL */
+ lastSourceFailed = true;
+ break;
+ }
+
/*
* Try to restore the file from archive, or read an existing
* file from pg_wal.
@@ -12434,6 +13322,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
break;
case XLOG_FROM_STREAM:
+ case XLOG_FROM_STREAM_NVWAL:
{
bool havedata;
@@ -12558,21 +13447,34 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
* info is set correctly and XLogReceiptTime isn't
* changed.
*/
- if (readFile < 0)
+ if (currentSource == XLOG_FROM_STREAM_NVWAL)
{
if (!expectedTLEs)
expectedTLEs = readTimeLineHistory(receiveTLI);
- readFile = XLogFileRead(readSegNo, PANIC,
- receiveTLI,
- XLOG_FROM_STREAM, false);
- Assert(readFile >= 0);
+
+ /* TODO is it ok to return, not to break switch? */
+ readSource = XLOG_FROM_STREAM_NVWAL;
+ XLogReceiptSource = XLOG_FROM_STREAM_NVWAL;
+ return true;
}
else
{
- /* just make sure source info is correct... */
- readSource = XLOG_FROM_STREAM;
- XLogReceiptSource = XLOG_FROM_STREAM;
- return true;
+ if (readFile < 0)
+ {
+ if (!expectedTLEs)
+ expectedTLEs = readTimeLineHistory(receiveTLI);
+ readFile = XLogFileRead(readSegNo, PANIC,
+ receiveTLI,
+ XLOG_FROM_STREAM, false);
+ Assert(readFile >= 0);
+ }
+ else
+ {
+ /* just make sure source info is correct... */
+ readSource = XLOG_FROM_STREAM;
+ XLogReceiptSource = XLOG_FROM_STREAM;
+ return true;
+ }
}
break;
}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a63ad8cfd0..d3841cc559 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1067,11 +1067,24 @@ WALRead(XLogReaderState *state,
char *p;
XLogRecPtr recptr;
Size nbytes;
+#ifndef FRONTEND
+ XLogRecPtr recptr_nvwal = 0;
+ Size nbytes_nvwal = 0;
+#endif
p = buf;
recptr = startptr;
nbytes = count;
+#ifndef FRONTEND
+ /* Try to load records directly from NVWAL if used */
+ if (IsNvwalAvail())
+ {
+ nbytes_nvwal = GetLoadableSizeFromNvwal(startptr, count, &recptr_nvwal);
+ nbytes = count - nbytes_nvwal;
+ }
+#endif
+
while (nbytes > 0)
{
uint32 startoff;
@@ -1139,6 +1152,17 @@ WALRead(XLogReaderState *state,
p += readbytes;
}
+#ifndef FRONTEND
+ if (IsNvwalAvail())
+ {
+ if (!CopyXLogRecordsFromNVWAL(p, nbytes_nvwal, recptr_nvwal))
+ {
+ /* TODO graceful error handling */
+ elog(PANIC, "some records on NVWAL had been discarded");
+ }
+ }
+#endif
+
return true;
}
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f70..eabcaae2ff 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -272,6 +272,9 @@ main(int argc, char *argv[])
ControlFile->checkPointCopy.oldestCommitTsXid);
printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
ControlFile->checkPointCopy.newestCommitTsXid);
+ printf(_("discarded Up To: %X/%X\n"),
+ (uint32) (ControlFile->discardedUpTo >> 32),
+ (uint32) ControlFile->discardedUpTo);
printf(_("Time of latest checkpoint: %s\n"),
ckpttime_str);
printf(_("Fake LSN counter for unlogged rels: %X/%X\n"),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 03fd1267e8..ddf786290b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -354,6 +354,14 @@ extern void XLogRequestWalReceiverReply(void);
extern void assign_max_wal_size(int newval, void *extra);
extern void assign_checkpoint_completion_target(double newval, void *extra);
+extern bool IsNvwalAvail(void);
+extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
+ Size count,
+ XLogRecPtr *nvwalptr);
+extern bool CopyXLogRecordsFromNVWAL(char *buf,
+ Size count,
+ XLogRecPtr startptr);
+
/*
* Routines to start, stop, and get status of a base backup.
*/
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06bed90c5e..012eeee058 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
/* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION 1300
+#define PG_CONTROL_VERSION 1301
/* Nonce key length, see below */
#define MOCK_AUTH_NONCE_LEN 32
@@ -132,6 +132,21 @@ typedef struct ControlFileData
XLogRecPtr unloggedLSN; /* current fake LSN value, for unlogged rels */
+ /*
+ * Used for non-volatile WAL buffer (NVWAL).
+ *
+ * discardedUpTo is updated to the oldest LSN in the NVWAL when either a
+ * checkpoint or a restartpoint is completed successfully, or whole the
+ * NVWAL is filled with WAL records and a new record is being inserted.
+ * This field tells that the NVWAL contains WAL records in the range of
+ * [discardedUpTo, discardedUpTo+S), where S is the size of the NVWAL.
+ * Note that the WAL records whose LSN are less than discardedUpTo would
+ * remain in WAL segment files and be needed for recovery.
+ *
+ * It is set to zero when NVWAL is not used.
+ */
+ XLogRecPtr discardedUpTo;
+
/*
* These two values determine the minimum point we must recover up to
* before starting up:
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index d3acb98d04..bbd47e1663 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -142,14 +142,17 @@ HINT: No function matches the given name and argument types. You might need to
select setting as segsize
from pg_settings where name = 'wal_segment_size'
\gset
-select count(*) > 0 as ok from pg_ls_waldir();
+select setting as nvwal_path
+from pg_settings where name = 'nvwal_path'
+\gset
+select count(*) > 0 or :'nvwal_path' <> '' as ok from pg_ls_waldir();
ok
----
t
(1 row)
-- Test ProjectSet as well as FunctionScan
-select count(*) > 0 as ok from (select pg_ls_waldir()) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select pg_ls_waldir()) ss;
ok
----
t
@@ -161,14 +164,15 @@ select * from pg_ls_waldir() limit 0;
------+------+--------------
(0 rows)
-select count(*) > 0 as ok from (select * from pg_ls_waldir() limit 1) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select * from pg_ls_waldir() limit 1) ss;
ok
----
t
(1 row)
-select (w).size = :segsize as ok
-from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from
+ (select * from pg_ls_waldir() w
+ where length((w).name) = 24 and (w).size = :segsize limit 1) ss;
ok
----
t
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 094e8f8296..09c326775d 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -39,15 +39,19 @@ SELECT num_nulls();
select setting as segsize
from pg_settings where name = 'wal_segment_size'
\gset
+select setting as nvwal_path
+from pg_settings where name = 'nvwal_path'
+\gset
-select count(*) > 0 as ok from pg_ls_waldir();
+select count(*) > 0 or :'nvwal_path' <> '' as ok from pg_ls_waldir();
-- Test ProjectSet as well as FunctionScan
-select count(*) > 0 as ok from (select pg_ls_waldir()) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select pg_ls_waldir()) ss;
-- Test not-run-to-completion cases.
select * from pg_ls_waldir() limit 0;
-select count(*) > 0 as ok from (select * from pg_ls_waldir() limit 1) ss;
-select (w).size = :segsize as ok
-from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select * from pg_ls_waldir() limit 1) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from
+ (select * from pg_ls_waldir() w
+ where length((w).name) = 24 and (w).size = :segsize limit 1) ss;
select count(*) >= 0 as ok from pg_ls_archive_statusdir();
--
2.17.1
v4-0003-walreceiver-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0003-walreceiver-supports-non-volatile-WAL-buffer.patchDownload
From 72506483b9b02a7f89273a5090ec6ab061457831 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:58 +0900
Subject: [PATCH v4 3/5] walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile
WAL buffer if applicable.
---
src/backend/access/transam/xlog.c | 31 +++++++++++++++-
src/backend/replication/walreceiver.c | 53 ++++++++++++++++++++++++++-
src/include/access/xlog.h | 4 ++
3 files changed, 85 insertions(+), 3 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a579a308f..dfa7c2517b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -925,6 +925,8 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
XLogSource source, bool notfoundOk);
static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
+static bool CopyXLogRecordsOnNVWAL(char *buf, Size count, XLogRecPtr startptr,
+ bool store);
static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
@@ -12664,6 +12666,21 @@ GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
*/
bool
CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+ return CopyXLogRecordsOnNVWAL(buf, count, startptr, false);
+}
+
+/*
+ * Called by walreceiver.
+ */
+bool
+CopyXLogRecordsToNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+ return CopyXLogRecordsOnNVWAL(buf, count, startptr, true);
+}
+
+static bool
+CopyXLogRecordsOnNVWAL(char *buf, Size count, XLogRecPtr startptr, bool store)
{
char *p;
XLogRecPtr recptr;
@@ -12713,7 +12730,13 @@ CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
max_copy = NvwalSize - off;
copybytes = Min(nbytes, max_copy);
- memcpy(p, q, copybytes);
+ if (store)
+ {
+ memcpy(q, p, copybytes);
+ nv_flush(q, copybytes);
+ }
+ else
+ memcpy(p, q, copybytes);
/* Update state for copy */
recptr += copybytes;
@@ -12725,6 +12748,12 @@ CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
return true;
}
+void
+SyncNVWAL(void)
+{
+ nv_drain();
+}
+
static bool
IsXLogSourceFromStream(XLogSource source)
{
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7c11e1ab44..563dd59ec0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -130,6 +130,7 @@ static void WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *start
static void WalRcvDie(int code, Datum arg);
static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
+static void XLogWalRcvStore(char *buf, Size nbytes, XLogRecPtr recptr);
static void XLogWalRcvFlush(bool dying);
static void XLogWalRcvSendReply(bool force, bool requestReply);
static void XLogWalRcvSendHSFeedback(bool immed);
@@ -856,7 +857,10 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
buf += hdrlen;
len -= hdrlen;
- XLogWalRcvWrite(buf, len, dataStart);
+ if (IsNvwalAvail())
+ XLogWalRcvStore(buf, len, dataStart);
+ else
+ XLogWalRcvWrite(buf, len, dataStart);
break;
}
case 'k': /* Keepalive */
@@ -991,6 +995,42 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
}
+/*
+ * Like XLogWalRcvWrite, but store to non-volatile WAL buffer.
+ */
+static void
+XLogWalRcvStore(char *buf, Size nbytes, XLogRecPtr recptr)
+{
+ Assert(IsNvwalAvail());
+
+ CopyXLogRecordsToNVWAL(buf, nbytes, recptr);
+
+ /*
+ * Also write out to file if we have to archive segments.
+ *
+ * We could do this segment by segment but we reuse existing method to
+ * do it record by record because the former gives us more complexity
+ * (locking WalBufMappingLock, getting the address of the segment on
+ * non-volatile WAL buffer, etc).
+ */
+ if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+ XLogWalRcvWrite(buf, nbytes, recptr);
+ else
+ {
+ /*
+ * Update status as like XLogWalRcvWrite does.
+ */
+
+ /* Update process-local status */
+ XLByteToSeg(recptr + nbytes, recvSegNo, wal_segment_size);
+ recvFileTLI = ThisTimeLineID;
+ LogstreamResult.Write = recptr + nbytes;
+
+ /* Update shared-memory status */
+ pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+ }
+}
+
/*
* Flush the log to disk.
*
@@ -1004,7 +1044,16 @@ XLogWalRcvFlush(bool dying)
{
WalRcvData *walrcv = WalRcv;
- issue_xlog_fsync(recvFile, recvSegNo);
+ /*
+ * We should call both SyncNVWAL and issue_xlog_fsync if we use NVWAL
+ * and WAL archive. So we have the following two if-statements, not
+ * one if-else-statement.
+ */
+ if (IsNvwalAvail())
+ SyncNVWAL();
+
+ if (recvFile >= 0)
+ issue_xlog_fsync(recvFile, recvSegNo);
LogstreamResult.Flush = LogstreamResult.Write;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ddf786290b..799357cfac 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -361,6 +361,10 @@ extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
extern bool CopyXLogRecordsFromNVWAL(char *buf,
Size count,
XLogRecPtr startptr);
+extern bool CopyXLogRecordsToNVWAL(char *buf,
+ Size count,
+ XLogRecPtr startptr);
+extern void SyncNVWAL(void);
/*
* Routines to start, stop, and get status of a base backup.
--
2.17.1
v4-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchDownload
From 5b794eab4f57a17c41b79769ee6def3cc050bdd0 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:59 +0900
Subject: [PATCH v4 4/5] pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile
WAL buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option.
The path will be written to postgresql.auto.conf or recovery.conf.
The size of the new NVWAL is same as the master's one.
---
src/bin/pg_basebackup/pg_basebackup.c | 335 +++++++++++++++++++++++++-
src/bin/pg_basebackup/streamutil.c | 69 ++++++
src/bin/pg_basebackup/streamutil.h | 3 +
src/bin/pg_rewind/pg_rewind.c | 4 +-
src/fe_utils/recovery_gen.c | 9 +-
src/include/fe_utils/recovery_gen.h | 3 +-
6 files changed, 407 insertions(+), 16 deletions(-)
diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 7a5d4562f9..9b85949078 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -25,6 +25,9 @@
#ifdef HAVE_LIBZ
#include <zlib.h>
#endif
+#ifdef USE_NVWAL
+#include <libpmem.h>
+#endif
#include "access/xlog_internal.h"
#include "common/file_perm.h"
@@ -127,7 +130,8 @@ typedef enum
static char *basedir = NULL;
static TablespaceList tablespace_dirs = {NULL, NULL};
static char *xlog_dir = NULL;
-static char format = 'p'; /* p(lain)/t(ar) */
+static char format = 'p'; /* p(lain)/t(ar); 'p' even if 'nvwal' given */
+static bool format_nvwal = false; /* true if 'nvwal' given */
static char *label = "pg_basebackup base backup";
static bool noclean = false;
static bool checksum_failure = false;
@@ -150,14 +154,24 @@ static bool verify_checksums = true;
static bool manifest = true;
static bool manifest_force_encode = false;
static char *manifest_checksums = NULL;
+static char *nvwal_path = NULL;
+#ifdef USE_NVWAL
+static size_t nvwal_size = 0;
+static char *nvwal_pages = NULL;
+static size_t nvwal_mapped_len = 0;
+#endif
static bool success = false;
+static bool xlogdir_is_pg_xlog = false;
static bool made_new_pgdata = false;
static bool found_existing_pgdata = false;
static bool made_new_xlogdir = false;
static bool found_existing_xlogdir = false;
static bool made_tablespace_dirs = false;
static bool found_tablespace_dirs = false;
+#ifdef USE_NVWAL
+static bool made_new_nvwal = false;
+#endif
/* Progress counters */
static uint64 totalsize_kb;
@@ -382,7 +396,7 @@ usage(void)
printf(_(" %s [OPTION]...\n"), progname);
printf(_("\nOptions controlling the output:\n"));
printf(_(" -D, --pgdata=DIRECTORY receive base backup into directory\n"));
- printf(_(" -F, --format=p|t output format (plain (default), tar)\n"));
+ printf(_(" -F, --format=p|t|n output format (plain (default), tar, nvwal)\n"));
printf(_(" -r, --max-rate=RATE maximum transfer rate to transfer data directory\n"
" (in kB/s, or use suffix \"k\" or \"M\")\n"));
printf(_(" -R, --write-recovery-conf\n"
@@ -390,6 +404,7 @@ usage(void)
printf(_(" -T, --tablespace-mapping=OLDDIR=NEWDIR\n"
" relocate tablespace in OLDDIR to NEWDIR\n"));
printf(_(" --waldir=WALDIR location for the write-ahead log directory\n"));
+ printf(_(" --nvwal-path=NVWAL location for the NVWAL file\n"));
printf(_(" -X, --wal-method=none|fetch|stream\n"
" include required WAL files with specified method\n"));
printf(_(" -z, --gzip compress tar output\n"));
@@ -630,9 +645,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
/* In post-10 cluster, pg_xlog has been renamed to pg_wal */
snprintf(param->xlog, sizeof(param->xlog), "%s/%s",
- basedir,
- PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
- "pg_xlog" : "pg_wal");
+ basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
/* Temporary replication slots are only supported in 10 and newer */
if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_TEMP_SLOTS)
@@ -669,9 +682,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
* tar file may arrive later.
*/
snprintf(statusdir, sizeof(statusdir), "%s/%s/archive_status",
- basedir,
- PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
- "pg_xlog" : "pg_wal");
+ basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
{
@@ -1793,6 +1804,135 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
appendPQExpBuffer(buf, copybuf, r);
}
+#ifdef USE_NVWAL
+static void
+cleanup_nvwal_atexit(void)
+{
+ if (success || in_log_streamer)
+ return;
+
+ if (nvwal_pages != NULL)
+ {
+ pg_log_info("unmapping NVWAL");
+ if (pmem_unmap(nvwal_pages, nvwal_mapped_len) < 0)
+ {
+ pg_log_error("could not unmap NVWAL: %m");
+ return;
+ }
+ }
+
+ if (nvwal_path != NULL && made_new_nvwal)
+ {
+ pg_log_info("removing NVWAL file \"%s\"", nvwal_path);
+ if (unlink(nvwal_path) < 0)
+ {
+ pg_log_error("could not remove NVWAL file \"%s\": %m", nvwal_path);
+ return;
+ }
+ }
+}
+
+static int
+filter_walseg(const struct dirent *d)
+{
+ char fullpath[MAXPGPATH];
+ struct stat statbuf;
+
+ if (!IsXLogFileName(d->d_name))
+ return 0;
+
+ snprintf(fullpath, sizeof(fullpath), "%s/%s/%s",
+ basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal", d->d_name);
+
+ if (stat(fullpath, &statbuf) < 0)
+ return 0;
+
+ if (!S_ISREG(statbuf.st_mode))
+ return 0;
+
+ if (statbuf.st_size != WalSegSz)
+ return 0;
+
+ return 1;
+}
+
+static int
+compare_walseg(const struct dirent **a, const struct dirent **b)
+{
+ return strcmp((*a)->d_name, (*b)->d_name);
+}
+
+static void
+free_namelist(struct dirent **namelist, int nr)
+{
+ for (int i = 0; i < nr; ++i)
+ free(namelist[i]);
+
+ free(namelist);
+}
+
+static bool
+copy_walseg_onto_nvwal(const char *segname)
+{
+ char fullpath[MAXPGPATH];
+ int fd;
+ size_t off;
+ struct stat statbuf;
+ TimeLineID tli;
+ XLogSegNo segno;
+
+ snprintf(fullpath, sizeof(fullpath), "%s/%s/%s",
+ basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal", segname);
+
+ fd = open(fullpath, O_RDONLY);
+ if (fd < 0)
+ {
+ pg_log_error("could not open xlog segment \"%s\": %m", fullpath);
+ return false;
+ }
+
+ if (fstat(fd, &statbuf) < 0)
+ {
+ pg_log_error("could not fstat xlog segment \"%s\": %m", fullpath);
+ goto close_on_error;
+ }
+
+ if (!S_ISREG(statbuf.st_mode))
+ {
+ pg_log_error("xlog segment \"%s\" is not a regular file", fullpath);
+ goto close_on_error;
+ }
+
+ if (statbuf.st_size != WalSegSz)
+ {
+ pg_log_error("invalid size of xlog segment \"%s\"; expected %d, actual %zd",
+ fullpath, WalSegSz, (ssize_t) statbuf.st_size);
+ goto close_on_error;
+ }
+
+ XLogFromFileName(segname, &tli, &segno, WalSegSz);
+ off = ((size_t) segno * WalSegSz) % nvwal_size;
+
+ if (read(fd, &nvwal_pages[off], WalSegSz) < WalSegSz)
+ {
+ pg_log_error("could not fully read xlog segment \"%s\": %m", fullpath);
+ goto close_on_error;
+ }
+
+ if (close(fd) < 0)
+ {
+ pg_log_error("could not close xlog segment \"%s\": %m", fullpath);
+ return false;
+ }
+
+ return true;
+
+close_on_error:
+ (void) close(fd);
+ return false;
+}
+#endif
+
static void
BaseBackup(void)
{
@@ -1851,7 +1991,8 @@ BaseBackup(void)
* Build contents of configuration file if requested
*/
if (writerecoveryconf)
- recoveryconfcontents = GenerateRecoveryConfig(conn, replication_slot);
+ recoveryconfcontents = GenerateRecoveryConfig(conn, replication_slot,
+ nvwal_path);
/*
* Run IDENTIFY_SYSTEM so we can get the timeline
@@ -2216,6 +2357,69 @@ BaseBackup(void)
exit(1);
}
+#ifdef USE_NVWAL
+ /* Copy xlog segments into NVWAL when nvwal mode */
+ if (format_nvwal)
+ {
+ char xldr_path[MAXPGPATH];
+ int nr_segs;
+ struct dirent **namelist;
+
+ /* clear NVWAL before copying xlog segments */
+ pmem_memset_persist(nvwal_pages, 0, nvwal_size);
+
+ snprintf(xldr_path, sizeof(xldr_path), "%s/%s",
+ basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
+
+ /*
+ * Sort xlog segments in ascending order, filtering out non-segment
+ * files and directories.
+ */
+ nr_segs = scandir(xldr_path, &namelist, filter_walseg, compare_walseg);
+ if (nr_segs < 0)
+ {
+ pg_log_error("could not scan xlog directory \"%s\": %m", xldr_path);
+ exit(1);
+ }
+
+ /* Copy xlog segments onto NVWAL */
+ for (int i = 0; i < nr_segs; ++i)
+ {
+ if (!copy_walseg_onto_nvwal(namelist[i]->d_name))
+ {
+ free_namelist(namelist, nr_segs);
+ exit(1);
+ }
+ }
+
+ /* Copy compelete; now remove all the xlog segments */
+ for (int i = 0; i < nr_segs; ++i)
+ {
+ char fullpath[MAXPGPATH];
+
+ snprintf(fullpath, sizeof(fullpath), "%s/%s",
+ xldr_path, namelist[i]->d_name);
+
+ if (unlink(fullpath) < 0)
+ {
+ pg_log_error("could not remove xlog segment \"%s\": %m", fullpath);
+ free_namelist(namelist, nr_segs);
+ exit(1);
+ }
+ }
+
+ free_namelist(namelist, nr_segs);
+
+ if (pmem_unmap(nvwal_pages, nvwal_mapped_len) < 0)
+ {
+ pg_log_error("could not unmap NVWAL: %m");
+ exit(1);
+ }
+ nvwal_pages = NULL;
+ nvwal_mapped_len = 0;
+ }
+#endif
+
if (verbose)
pg_log_info("base backup completed");
}
@@ -2257,6 +2461,7 @@ main(int argc, char **argv)
{"no-manifest", no_argument, NULL, 5},
{"manifest-force-encode", no_argument, NULL, 6},
{"manifest-checksums", required_argument, NULL, 7},
+ {"nvwal-path", required_argument, NULL, 8},
{NULL, 0, NULL, 0}
};
int c;
@@ -2297,9 +2502,27 @@ main(int argc, char **argv)
break;
case 'F':
if (strcmp(optarg, "p") == 0 || strcmp(optarg, "plain") == 0)
+ {
+ /* See the comment for "nvwal" below */
format = 'p';
+ format_nvwal = false;
+ }
else if (strcmp(optarg, "t") == 0 || strcmp(optarg, "tar") == 0)
+ {
+ /* See the comment for "nvwal" below */
format = 't';
+ format_nvwal = false;
+ }
+ else if (strcmp(optarg, "n") == 0 || strcmp(optarg, "nvwal") == 0)
+ {
+ /*
+ * If "nvwal" mode given, we set two variables as follows
+ * because it is almost same as "plain" mode, except NVWAL
+ * handling.
+ */
+ format = 'p';
+ format_nvwal = true;
+ }
else
{
pg_log_error("invalid output format \"%s\", must be \"plain\" or \"tar\"",
@@ -2354,6 +2577,9 @@ main(int argc, char **argv)
case 1:
xlog_dir = pg_strdup(optarg);
break;
+ case 8:
+ nvwal_path = pg_strdup(optarg);
+ break;
case 'l':
label = pg_strdup(optarg);
break;
@@ -2535,7 +2761,7 @@ main(int argc, char **argv)
{
if (format != 'p')
{
- pg_log_error("WAL directory location can only be specified in plain mode");
+ pg_log_error("WAL directory location can only be specified in plain or nvwal mode");
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
progname);
exit(1);
@@ -2552,6 +2778,44 @@ main(int argc, char **argv)
}
}
+#ifdef USE_NVWAL
+ if (format_nvwal)
+ {
+ if (nvwal_path == NULL)
+ {
+ pg_log_error("NVWAL file location must be given in nvwal mode");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+ progname);
+ exit(1);
+ }
+
+ /* clean up NVWAL file name and check if it is absolute */
+ canonicalize_path(nvwal_path);
+ if (!is_absolute_path(nvwal_path))
+ {
+ pg_log_error("NVWAL file location must be an absolute path");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+ progname);
+ exit(1);
+ }
+
+ /* We do not map NVWAL file here because we do not know its size yet */
+ }
+ else if (nvwal_path != NULL)
+ {
+ pg_log_error("NVWAL file location can only be specified in plain or nvwal mode");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+ progname);
+ exit(1);
+ }
+#else
+ if (format_nvwal || nvwal_path != NULL)
+ {
+ pg_log_error("this build does not support nvwal mode");
+ exit(1);
+ }
+#endif /* USE_NVWAL */
+
#ifndef HAVE_LIBZ
if (compresslevel != 0)
{
@@ -2596,6 +2860,9 @@ main(int argc, char **argv)
}
atexit(disconnect_atexit);
+ /* Remember the predicate for use after disconnection */
+ xlogdir_is_pg_xlog = (PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL);
+
/*
* Set umask so that directories/files are created with the same
* permissions as directories/files in the source data directory.
@@ -2622,6 +2889,16 @@ main(int argc, char **argv)
if (!RetrieveWalSegSize(conn))
exit(1);
+#ifdef USE_NVWAL
+ /* determine remote server's NVWAL size */
+ if (format_nvwal)
+ {
+ nvwal_size = RetrieveNvwalSize(conn);
+ if (nvwal_size == 0)
+ exit(1);
+ }
+#endif
+
/* Create pg_wal symlink, if required */
if (xlog_dir)
{
@@ -2634,8 +2911,7 @@ main(int argc, char **argv)
* renamed to pg_wal in post-10 clusters.
*/
linkloc = psprintf("%s/%s", basedir,
- PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
- "pg_xlog" : "pg_wal");
+ xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
#ifdef HAVE_SYMLINK
if (symlink(xlog_dir, linkloc) != 0)
@@ -2650,6 +2926,41 @@ main(int argc, char **argv)
free(linkloc);
}
+#ifdef USE_NVWAL
+ /* Create and map NVWAL file if required */
+ if (format_nvwal)
+ {
+ int is_pmem = 0;
+
+ nvwal_pages = pmem_map_file(nvwal_path, nvwal_size,
+ PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+ pg_file_create_mode,
+ &nvwal_mapped_len, &is_pmem);
+ if (nvwal_pages == NULL)
+ {
+ pg_log_error("could not map a new NVWAL file \"%s\": %m",
+ nvwal_path);
+ exit(1);
+ }
+
+ made_new_nvwal = true;
+ atexit(cleanup_nvwal_atexit);
+
+ if (!is_pmem)
+ {
+ pg_log_error("NVWAL file \"%s\" is not on PMEM", nvwal_path);
+ exit(1);
+ }
+
+ if (nvwal_size != nvwal_mapped_len)
+ {
+ pg_log_error("invalid size of NVWAL file \"%s\"; expected %zu, actual %zu",
+ nvwal_path, nvwal_size, nvwal_mapped_len);
+ exit(1);
+ }
+ }
+#endif
+
BaseBackup();
success = true;
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index be653ebb2d..baf3a7bc53 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -398,6 +398,75 @@ RetrieveDataDirCreatePerm(PGconn *conn)
return true;
}
+#ifdef USE_NVWAL
+/*
+ * Returns nvwal_size in bytes if available, 0 otherwise.
+ * Note that it is caller's responsibility to check if the returned
+ * nvwal_size is really valid, that is, multiple of WAL segment size.
+ */
+size_t
+RetrieveNvwalSize(PGconn *conn)
+{
+ PGresult *res;
+ char unit[3];
+ int val;
+ size_t nvwal_size;
+
+ /* check connection existence */
+ Assert(conn != NULL);
+
+ /* fail if we do not have SHOW command */
+ if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_SHOW_CMD)
+ {
+ pg_log_error("SHOW command is not supported for retrieving nvwal_size");
+ return 0;
+ }
+
+ res = PQexec(conn, "SHOW nvwal_size");
+ if (PQresultStatus(res) != PGRES_TUPLES_OK)
+ {
+ pg_log_error("could not send replication command \"%s\": %s",
+ "SHOW nvwal_size", PQerrorMessage(conn));
+
+ PQclear(res);
+ return 0;
+ }
+ if (PQntuples(res) != 1 || PQnfields(res) < 1)
+ {
+ pg_log_error("could not fetch NVWAL size: got %d rows and %d fields, expected %d rows and %d or more fields",
+ PQntuples(res), PQnfields(res), 1, 1);
+
+ PQclear(res);
+ return 0;
+ }
+
+ /* fetch value and unit from the result */
+ if (sscanf(PQgetvalue(res, 0, 0), "%d%s", &val, unit) != 2)
+ {
+ pg_log_error("NVWAL size could not be parsed");
+ PQclear(res);
+ return 0;
+ }
+
+ PQclear(res);
+
+ /* convert to bytes */
+ if (strcmp(unit, "MB") == 0)
+ nvwal_size = ((size_t) val) << 20;
+ else if (strcmp(unit, "GB") == 0)
+ nvwal_size = ((size_t) val) << 30;
+ else if (strcmp(unit, "TB") == 0)
+ nvwal_size = ((size_t) val) << 40;
+ else
+ {
+ pg_log_error("unsupported NVWAL unit");
+ return 0;
+ }
+
+ return nvwal_size;
+}
+#endif
+
/*
* Run IDENTIFY_SYSTEM through a given connection and give back to caller
* some result information if requested:
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index 57448656e3..b4c2ab1a74 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -41,6 +41,9 @@ extern bool RunIdentifySystem(PGconn *conn, char **sysid,
XLogRecPtr *startpos,
char **db_name);
extern bool RetrieveWalSegSize(PGconn *conn);
+#ifdef USE_NVWAL
+extern size_t RetrieveNvwalSize(PGconn *conn);
+#endif
extern TimestampTz feGetCurrentTimestamp(void);
extern void feTimestampDifference(TimestampTz start_time, TimestampTz stop_time,
long *secs, int *microsecs);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 23fc749e44..858a399f52 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -360,7 +360,7 @@ main(int argc, char **argv)
pg_log_info("no rewind required");
if (writerecoveryconf && !dry_run)
WriteRecoveryConfig(conn, datadir_target,
- GenerateRecoveryConfig(conn, NULL));
+ GenerateRecoveryConfig(conn, NULL, NULL));
exit(0);
}
@@ -459,7 +459,7 @@ main(int argc, char **argv)
if (writerecoveryconf && !dry_run)
WriteRecoveryConfig(conn, datadir_target,
- GenerateRecoveryConfig(conn, NULL));
+ GenerateRecoveryConfig(conn, NULL, NULL));
pg_log_info("Done!");
diff --git a/src/fe_utils/recovery_gen.c b/src/fe_utils/recovery_gen.c
index 46ca20e20b..1e08ec3fa8 100644
--- a/src/fe_utils/recovery_gen.c
+++ b/src/fe_utils/recovery_gen.c
@@ -20,7 +20,7 @@ static char *escape_quotes(const char *src);
* return it.
*/
PQExpBuffer
-GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
+GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot, char *nvwal_path)
{
PQconninfoOption *connOptions;
PQExpBufferData conninfo_buf;
@@ -95,6 +95,13 @@ GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
replication_slot);
}
+ if (nvwal_path)
+ {
+ escaped = escape_quotes(nvwal_path);
+ appendPQExpBuffer(contents, "nvwal_path = '%s'\n", escaped);
+ free(escaped);
+ }
+
if (PQExpBufferBroken(contents))
{
pg_log_error("out of memory");
diff --git a/src/include/fe_utils/recovery_gen.h b/src/include/fe_utils/recovery_gen.h
index c8655cd294..061c59125b 100644
--- a/src/include/fe_utils/recovery_gen.h
+++ b/src/include/fe_utils/recovery_gen.h
@@ -21,7 +21,8 @@
#define MINIMUM_VERSION_FOR_RECOVERY_GUC 120000
extern PQExpBuffer GenerateRecoveryConfig(PGconn *pgconn,
- char *pg_replication_slot);
+ char *pg_replication_slot,
+ char *nvwal_path);
extern void WriteRecoveryConfig(PGconn *pgconn, char *target_dir,
PQExpBuffer contents);
--
2.17.1
v4-0005-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0005-README-for-non-volatile-WAL-buffer.patchDownload
From a5ef218e1eab55dedcc6061f88eb3fae3b057fdf Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:08:00 +0900
Subject: [PATCH v4 5/5] README for non-volatile WAL buffer
---
README.nvwal | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 184 insertions(+)
create mode 100644 README.nvwal
diff --git a/README.nvwal b/README.nvwal
new file mode 100644
index 0000000000..b6b9d576e7
--- /dev/null
+++ b/README.nvwal
@@ -0,0 +1,184 @@
+Non-volatile WAL buffer
+=======================
+Here is a PostgreSQL branch with a proof-of-concept "non-volatile WAL buffer"
+(NVWAL) feature. Putting the WAL buffer pages on persistent memory (PMEM) [1],
+inserting WAL records into it directly, and eliminating I/O for WAL segment
+files, PostgreSQL gets lower latency and higher throughput.
+
+
+Prerequisites and recommends
+----------------------------
+* An x64 system
+ * (Recommended) Supporting CLFLUSHOPT or CLWB instruction
+ * See if lscpu shows "clflushopt" or "clwb" flag
+* An OS supporting PMEM
+ * Linux: 4.15 or later (tested on 5.2)
+ * Windows: (Sorry but we have not tested on Windows yet.)
+* A filesystem supporting DAX (tested on ext4)
+* libpmem in PMDK [2] 1.4 or later (tested on 1.7)
+* ndctl [3] (tested on 61.2)
+* ipmctl [4] if you use Intel DCPMM
+* sudo privilege
+* All other prerequisites of original PostgreSQL
+* (Recommended) PMEM module(s) (NVDIMM-N or Intel DCPMM)
+ * You can emulate PMEM using DRAM [5] even if you have no PMEM module.
+* (Recommended) numactl
+
+
+Build and install PostgreSQL with NVWAL feature
+-----------------------------------------------
+We have a new configure option --with-nvwal.
+
+I believe it is good to install under your home directory with --prefix option.
+If you do so, please DO NOT forget "export PATH".
+
+ $ ./configure --with-nvwal --prefix="$HOME/postgres"
+ $ make
+ $ make install
+ $ export PATH="$HOME/postgres/bin:$PATH"
+
+NOTE: ./configure --with-nvwal will fail if libpmem is not found.
+
+
+Prepare DAX filesystem
+----------------------
+Here we use NVDIMM-N or emulated PMEM, make ext4 filesystem on namespace0.0
+(/dev/pmem0), and mount it onto /mnt/pmem0. Please DO NOT forget "-o dax" option
+on mount. For Intel DCPMM and ipmctl, please see [4].
+
+ $ ndctl list
+ [
+ {
+ "dev":"namespace1.0",
+ "mode":"raw",
+ "size":103079215104,
+ "sector_size":512,
+ "blockdev":"pmem1",
+ "numa_node":1
+ },
+ {
+ "dev":"namespace0.0",
+ "mode":"raw",
+ "size":103079215104,
+ "sector_size":512,
+ "blockdev":"pmem0",
+ "numa_node":0
+ }
+ ]
+
+ $ sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0
+ {
+ "dev":"namespace0.0",
+ "mode":"fsdax",
+ "map":"dev",
+ "size":"94.50 GiB (101.47 GB)",
+ "uuid":"e7da9d65-140b-4e1e-90ec-6548023a1b6e",
+ "sector_size":512,
+ "blockdev":"pmem0",
+ "numa_node":0
+ }
+
+ $ ls -l /dev/pmem0
+ brw-rw---- 1 root disk 259, 3 Jan 6 17:06 /dev/pmem0
+
+ $ sudo mkfs.ext4 -q -F /dev/pmem0
+ $ sudo mkdir -p /mnt/pmem0
+ $ sudo mount -o dax /dev/pmem0 /mnt/pmem0
+ $ mount -l | grep ^/dev/pmem0
+ /dev/pmem0 on /mnt/pmem0 type ext4 (rw,relatime,dax)
+
+
+Enable transparent huge page
+----------------------------
+Of course transparent huge page would not be suitable for database workload,
+but it improves performance of PMEM by reducing overhead of page walk.
+
+ $ ls -l /sys/kernel/mm/transparent_hugepage/enabled
+ -rw-r--r-- 1 root root 4096 Dec 3 10:38 /sys/kernel/mm/transparent_hugepage/enabled
+
+ $ echo always | sudo dd of=/sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null
+ $ cat /sys/kernel/mm/transparent_hugepage/enabled
+ [always] madvise never
+
+
+initdb
+------
+We have two new options:
+
+ -P, --nvwal-path=FILE path to file for non-volatile WAL buffer (NVWAL)
+ -Q, --nvwal-size=SIZE size of NVWAL, in megabytes
+
+If you want to create a new 80GB (81920MB) NVWAL file on /mnt/pmem0/pgsql/nvwal,
+please run initdb as follows:
+
+ $ sudo mkdir -p /mnt/pmem0/pgsql
+ $ sudo chown "$USER:$USER" /mnt/pmem0/pgsql
+ $ export PGDATA="$HOME/pgdata"
+ $ initdb -P /mnt/pmem0/pgsql/nvwal -Q 81920
+
+You will find there is no WAL segment file to be created in PGDATA/pg_wal
+directory. That is okay; your NVWAL file has the content of the first WAL
+segment file.
+
+NOTE:
+* initdb will fail if the given NVWAL size is not multiple of WAL segment
+ size. The segment size is given with initdb --wal-segsize, or is 16MB as
+ default.
+* postgres (executed by initdb) will fail in bootstrap if the directory in
+ which the NVWAL file is being created (/mnt/pmem0/pgsql for example
+ above) does not exist.
+* postgres (executed by initdb) will fail in bootstrap if an entry already
+ exists on the given path.
+* postgres (executed by initdb) will fail in bootstrap if the given path is
+ not on PMEM or you forget "-o dax" option on mount.
+* Resizing an NVWAL file is NOT supported yet. Please be careful to decide
+ how large your NVWAL file is to be.
+* "-Q 1024" (1024MB) will be assumed if -P is given but -Q is not.
+
+
+postgresql.conf
+---------------
+We have two new parameters nvwal_path and nvwal_size, corresponding to the two
+new options of initdb. If you run initdb as above, you will find postgresql.conf
+in your PGDATA directory like as follows:
+
+ max_wal_size = 80GB
+ min_wal_size = 80GB
+ nvwal_path = '/mnt/pmem0/pgsql/nvwal'
+ nvwal_size = 80GB
+
+NOTE:
+* postgres will fail in startup if no file exists on the given nvwal_path.
+* postgres will fail in startup if the given nvwal_size is not equal to the
+ actual NVWAL file size,
+* postgres will fail in startup if the given nvwal_path is not on PMEM or you
+ forget "-o dax" option on mount.
+* wal_buffers will be ignored if nvwal_path is given.
+* You SHOULD give both max_wal_size and min_wal_size the same value as
+ nvwal_size. postgres could possibly run even though the three values are
+ not same, however, we have not tested such a case yet.
+
+
+Startup
+-------
+Same as you know:
+
+ $ pg_ctl start
+
+or use numactl as follows to let postgres run on the specified NUMA node (typi-
+cally the one on which your NVWAL file is) if you need stable performance:
+
+ $ numactl --cpunodebind=0 --membind=0 -- pg_ctl start
+
+
+References
+----------
+[1] https://pmem.io/
+[2] https://pmem.io/pmdk/
+[3] https://docs.pmem.io/ndctl-user-guide/
+[4] https://docs.pmem.io/ipmctl-user-guide/
+[5] https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
+
+
+--
+Takashi Menjo <takashi.menjou.vg AT hco.ntt.co.jp>
--
2.17.1
Hi Takashi,
Thank you for the patch and work on accelerating PG performance with NVM. I applied the patch and made some performance test based on the patch v4. I stored database data files on NVMe SSD and stored WAL file on Intel PMem (NVM). I used two methods to store WAL file(s):
1. Leverage your patch to access PMem with libpmem (NVWAL patch).
2. Access PMem with legacy filesystem interface, that means use PMem as ordinary block device, no PG patch is required to access PMem (Storage over App Direct).
I tried two insert scenarios:
A. Insert small record (length of record to be inserted is 24 bytes), I think it is similar as your test
B. Insert large record (length of record to be inserted is 328 bytes)
My original purpose is to see higher performance gain in scenario B as it is more write intensive on WAL. But I observed that NVWAL patch method had ~5% performance improvement compared with Storage over App Direct method in scenario A, while had ~20% performance degradation in scenario B.
I made further investigation on the test. I found that NVWAL patch can improve performance of XlogFlush function, but it may impact performance of CopyXlogRecordToWAL function. It may be related to the higher latency of memcpy to Intel PMem comparing with DRAM. Here are key data in my test:
Scenario A (length of record to be inserted: 24 bytes per record):
==============================
NVWAL SoAD
------------------------------------ ------- -------
Througput (10^3 TPS) 310.5 296.0
CPU Time % of CopyXlogRecordToWAL 0.4 0.2
CPU Time % of XLogInsertRecord 1.5 0.8
CPU Time % of XLogFlush 2.1 9.6
Scenario B (length of record to be inserted: 328 bytes per record):
==============================
NVWAL SoAD
------------------------------------ ------- -------
Througput (10^3 TPS) 13.0 16.9
CPU Time % of CopyXlogRecordToWAL 3.0 1.6
CPU Time % of XLogInsertRecord 23.0 16.4
CPU Time % of XLogFlush 2.3 5.9
Best Regards,
Gang
From: Takashi Menjo <takashi.menjo@gmail.com>
Sent: Thursday, September 10, 2020 4:01 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [PoC] Non-volatile WAL buffer
Rebased.
2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>>:
Dear hackers,
I update my non-volatile WAL buffer's patchset to v3. Now we can use it in streaming replication mode.
Updates from v2:
- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.
- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one.
Best regards,
Takashi
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>>
NTT Software Innovation Center
-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>>
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org<mailto:pgsql-hackers@postgresql.org>>
Cc: 'Robert Haas' <robertmhaas@gmail.com<mailto:robertmhaas@gmail.com>>; 'Heikki Linnakangas' <hlinnaka@iki.fi<mailto:hlinnaka@iki.fi>>; 'Amit Langote'
<amitlangote09@gmail.com<mailto:amitlangote09@gmail.com>>
Subject: RE: [PoC] Non-volatile WAL bufferDear hackers,
I rebased my non-volatile WAL buffer's patchset onto master. A new v2 patchset is attached to this mail.
I also measured performance before and after patchset, varying -c/--client and -j/--jobs options of pgbench, for
each scaling factor s = 50 or 1000. The results are presented in the following tables and the attached charts.
Conditions, steps, and other details will be shown later.Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)Both throughput and average latency are improved for each scaling factor. Throughput seemed to almost reach
the upper limit when (c,j)=(36,18).The percentage in s=1000 case looks larger than in s=50 case. I think larger scaling factor leads to less
contentions on the same tables and/or indexes, that is, less lock and unlock operations. In such a situation,
write-ahead logging appears to be more significant for performance.Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patchSteps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown
in the tables above.(1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutespgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbnameI gave no -b option to use the built-in "TPC-B (sort-of)" query.
Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GABest regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>> NTT Software Innovation Center-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com<mailto:amitlangote09@gmail.com>>
Cc: 'Robert Haas' <robertmhaas@gmail.com<mailto:robertmhaas@gmail.com>>; 'Heikki Linnakangas' <hlinnaka@iki.fi<mailto:hlinnaka@iki.fi>>;'PostgreSQL-development'
<pgsql-hackers@postgresql.org<mailto:pgsql-hackers@postgresql.org>>
Subject: RE: [PoC] Non-volatile WAL bufferDear Amit,
Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...
I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later.
Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>> NTT Software
Innovation Center-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com<mailto:amitlangote09@gmail.com>>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>>
Cc: Robert Haas <robertmhaas@gmail.com<mailto:robertmhaas@gmail.com>>; Heikki Linnakangas
<hlinnaka@iki.fi<mailto:hlinnaka@iki.fi>>; PostgreSQL-development
<pgsql-hackers@postgresql.org<mailto:pgsql-hackers@postgresql.org>>
Subject: Re: [PoC] Non-volatile WAL bufferHello,
On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>> wrote:
Hello Amit,
I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have anyspecific reason to be working on REL_12_0?
Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I knowall new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or notbecause master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using release notes and user manuals.Thanks for clarifying. I see where you're coming from.
While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss tonotice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architecture of your own changes, although maybe thatrarely occurs.
Thanks,
Amit
--
Takashi Menjo <takashi.menjo@gmail.com<mailto:takashi.menjo@gmail.com>>
Hello Gang,
Thank you for your report. I have not taken care of record size deeply yet,
so your report is very interesting. I will also have a test like yours then
post results here.
Regards,
Takashi
2020年9月21日(月) 14:14 Deng, Gang <gang.deng@intel.com>:
Hi Takashi,
Thank you for the patch and work on accelerating PG performance with NVM.
I applied the patch and made some performance test based on the patch v4. I
stored database data files on NVMe SSD and stored WAL file on Intel PMem
(NVM). I used two methods to store WAL file(s):1. Leverage your patch to access PMem with libpmem (NVWAL patch).
2. Access PMem with legacy filesystem interface, that means use PMem
as ordinary block device, no PG patch is required to access PMem (Storage
over App Direct).I tried two insert scenarios:
A. Insert small record (length of record to be inserted is 24 bytes),
I think it is similar as your testB. Insert large record (length of record to be inserted is 328 bytes)
My original purpose is to see higher performance gain in scenario B as it
is more write intensive on WAL. But I observed that NVWAL patch method had
~5% performance improvement compared with Storage over App Direct method in
scenario A, while had ~20% performance degradation in scenario B.I made further investigation on the test. I found that NVWAL patch can
improve performance of XlogFlush function, but it may impact performance of
CopyXlogRecordToWAL function. It may be related to the higher latency of
memcpy to Intel PMem comparing with DRAM. Here are key data in my test:Scenario A (length of record to be inserted: 24 bytes per record):
==============================
NVWAL SoAD
------------------------------------
------- -------Througput (10^3 TPS)
310.5 296.0CPU Time % of CopyXlogRecordToWAL
0.4 0.2CPU Time % of XLogInsertRecord
1.5 0.8CPU Time % of XLogFlush
2.1 9.6Scenario B (length of record to be inserted: 328 bytes per record):
==============================
NVWAL SoAD
------------------------------------
------- -------Througput (10^3 TPS)
13.0 16.9CPU Time % of CopyXlogRecordToWAL
3.0 1.6CPU Time % of XLogInsertRecord
23.0 16.4CPU Time % of XLogFlush
2.3 5.9Best Regards,
Gang
*From:* Takashi Menjo <takashi.menjo@gmail.com>
*Sent:* Thursday, September 10, 2020 4:01 PM
*To:* Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
*Cc:* pgsql-hackers@postgresql.org
*Subject:* Re: [PoC] Non-volatile WAL bufferRebased.
2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>:
Dear hackers,
I update my non-volatile WAL buffer's patchset to v3. Now we can use it
in streaming replication mode.Updates from v2:
- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL
buffer if applicable.- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL
buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path
will be written to postgresql.auto.conf or recovery.conf. The size of the
new NVWAL is same as the master's one.Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>; 'Amit Langote'
<amitlangote09@gmail.com>
Subject: RE: [PoC] Non-volatile WAL bufferDear hackers,
I rebased my non-volatile WAL buffer's patchset onto master. A new v2
patchset is attached to this mail.
I also measured performance before and after patchset, varying
-c/--client and -j/--jobs options of pgbench, for
each scaling factor s = 50 or 1000. The results are presented in the
following tables and the attached charts.
Conditions, steps, and other details will be shown later.
Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)Both throughput and average latency are improved for each scaling
factor. Throughput seemed to almost reach
the upper limit when (c,j)=(36,18).
The percentage in s=1000 case looks larger than in s=50 case. I think
larger scaling factor leads to less
contentions on the same tables and/or indexes, that is, less lock and
unlock operations. In such a situation,
write-ahead logging appears to be more significant for performance.
Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set forpg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access(DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patchSteps
=====
For each (c,j) pair, I did the following steps three times then I foundthe median of the three as a final result shown
in the tables above.
(1) Run initdb with proper -D and -X options; and also give --nvwal-path
and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutespgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j___ dbname
I gave no -b option to use the built-in "TPC-B (sort-of)" query.
Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GABest regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software InnovationCenter
-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>;
'PostgreSQL-development'
<pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL bufferDear Amit,
Thank you for your advice. Exactly, it's so to speak "do as the
hackers do when in pgsql"...
I'm rebasing my branch onto master. I'll submit an updated patchset
and performance report later.
Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software
Innovation Center-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas
<hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL bufferHello,
On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <
takashi.menjou.vg@hco.ntt.co.jp> wrote:
Hello Amit,
I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have anyspecific reason to be working on REL_12_0?
Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I knowall new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which committhe "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or notbecause master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using releasenotes and user manuals.
Thanks for clarifying. I see where you're coming from.
While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss tonotice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architectureof your own changes, although maybe that
rarely occurs.
Thanks,
Amit--
Takashi Menjo <takashi.menjo@gmail.com>
--
Takashi Menjo <takashi.menjo@gmail.com>
Hi Gang,
I have tried to but yet cannot reproduce performance degrade you reported when inserting 328-byte records. So I think the condition of you and me would be different, such as steps to reproduce, postgresql.conf, installation setup, and so on.
My results and condition are as follows. May I have your condition in more detail? Note that I refer to your "Storage over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer."
Best regards,
Takashi
# Results
See the attached figure. In short, Non-volatile WAL buffer got better performance than Original (PMEM).
# Steps
Note that I ran postgres server and pgbench in a single-machine system but separated two NUMA nodes. PMEM and PCI SSD for the server process are on the server-side NUMA node.
01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile WAL buffer
07) Edit postgresql.conf as the attached one
- Please remove nvwal_* lines in the case of Original (PMEM)
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Change # characters of "filler" column of "pgbench_history" table to 300 (ALTER TABLE pgbench_history ALTER filler TYPE character(300);)
- This would make the row size of the table 328 bytes
12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
13) Remount the PMEM and the PCIe SSD
14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
15) Run pg_prewarm for all the four pgbench_* tables
16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r -M prepared -T 1800 -c __ -j __)
- It executes the default tpcb-like transactions
I repeated all the steps three times for each (c,j) then got the median "tps = __ (including connections establishing)" of the three as throughput and the "latency average = __ ms " of that time as average latency.
# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata
# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 sockets (256 GiB per channel x 6 channels per socket; interleaving enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C compiler: gcc 9.3.0
- libc: glibc 2.31
- Linux kernel: 5.7 (vanilla)
- Filesystem: ext4 (DAX enabled when using Optane PMem)
- PMDK: 1.9
- PostgreSQL (Original): 14devel (200f610: Jul 26, 2020)
- PostgreSQL (Non-volatile WAL buffer): 14devel (200f610: Jul 26, 2020) + non-volatile WAL buffer patchset v4
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
Show quoted text
-----Original Message-----
From: Takashi Menjo <takashi.menjo@gmail.com>
Sent: Thursday, September 24, 2020 2:38 AM
To: Deng, Gang <gang.deng@intel.com>
Cc: pgsql-hackers@postgresql.org; Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Subject: Re: [PoC] Non-volatile WAL bufferHello Gang,
Thank you for your report. I have not taken care of record size deeply yet, so your report is very interesting. I will
also have a test like yours then post results here.Regards,
Takashi2020年9月21日(月) 14:14 Deng, Gang <gang.deng@intel.com <mailto:gang.deng@intel.com> >:
Hi Takashi,
Thank you for the patch and work on accelerating PG performance with NVM. I applied the patch and made
some performance test based on the patch v4. I stored database data files on NVMe SSD and stored WAL file on
Intel PMem (NVM). I used two methods to store WAL file(s):1. Leverage your patch to access PMem with libpmem (NVWAL patch).
2. Access PMem with legacy filesystem interface, that means use PMem as ordinary block device, no
PG patch is required to access PMem (Storage over App Direct).I tried two insert scenarios:
A. Insert small record (length of record to be inserted is 24 bytes), I think it is similar as your test
B. Insert large record (length of record to be inserted is 328 bytes)
My original purpose is to see higher performance gain in scenario B as it is more write intensive on WAL.
But I observed that NVWAL patch method had ~5% performance improvement compared with Storage over App
Direct method in scenario A, while had ~20% performance degradation in scenario B.I made further investigation on the test. I found that NVWAL patch can improve performance of XlogFlush
function, but it may impact performance of CopyXlogRecordToWAL function. It may be related to the higher
latency of memcpy to Intel PMem comparing with DRAM. Here are key data in my test:Scenario A (length of record to be inserted: 24 bytes per record):
==============================
NVWAL
SoAD------------------------------------ ------- -------
Througput (10^3 TPS) 310.5
296.0CPU Time % of CopyXlogRecordToWAL 0.4 0.2
CPU Time % of XLogInsertRecord 1.5 0.8
CPU Time % of XLogFlush 2.1 9.6
Scenario B (length of record to be inserted: 328 bytes per record):
==============================
NVWAL
SoAD------------------------------------ ------- -------
Througput (10^3 TPS) 13.0
16.9CPU Time % of CopyXlogRecordToWAL 3.0 1.6
CPU Time % of XLogInsertRecord 23.0 16.4
CPU Time % of XLogFlush 2.3 5.9
Best Regards,
Gang
From: Takashi Menjo <takashi.menjo@gmail.com <mailto:takashi.menjo@gmail.com> >
Sent: Thursday, September 10, 2020 4:01 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Cc: pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL bufferRebased.
2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp
<mailto:takashi.menjou.vg@hco.ntt.co.jp> >:Dear hackers,
I update my non-volatile WAL buffer's patchset to v3. Now we can use it in streaming replication
mode.Updates from v2:
- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with
"nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path will be written to
postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one.Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
NTT Software Innovation Center-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org<mailto:pgsql-hackers@postgresql.org> >
Cc: 'Robert Haas' <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com> >; 'Heikki
Linnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; 'Amit Langote'
<amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Subject: RE: [PoC] Non-volatile WAL bufferDear hackers,
I rebased my non-volatile WAL buffer's patchset onto master. A new v2 patchset is attached
to this mail.
I also measured performance before and after patchset, varying -c/--client and -j/--jobs
options of pgbench, for
each scaling factor s = 50 or 1000. The results are presented in the following tables and the
attached charts.
Conditions, steps, and other details will be shown later.
Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)Both throughput and average latency are improved for each scaling factor. Throughput seemed
to almost reach
the upper limit when (c,j)=(36,18).
The percentage in s=1000 case looks larger than in s=50 case. I think larger scaling factor
leads to less
contentions on the same tables and/or indexes, that is, less lock and unlock operations. In such
a situation,
write-ahead logging appears to be more significant for performance.
Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patchSteps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three asa final result shown
in the tables above.
(1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size
options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutespgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbnameI gave no -b option to use the built-in "TPC-B (sort-of)" query.
Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GABest regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >NTT Software Innovation Center
-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Cc: 'Robert Haas' <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com> >; 'HeikkiLinnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >;
'PostgreSQL-development'
<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: RE: [PoC] Non-volatile WAL bufferDear Amit,
Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...
I'm rebasing my branch onto master. I'll submit an updated patchset and performance report
later.
Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp>NTT Software
Innovation Center
-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Cc: Robert Haas <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com> >; Heikki
Linnakangas
<hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; PostgreSQL-development
<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: Re: [PoC] Non-volatile WAL bufferHello,
On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp
<mailto:takashi.menjou.vg@hco.ntt.co.jp> > wrote:
Hello Amit,
I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have anyspecific reason to be working on REL_12_0?
Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I knowall new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or notbecause master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using release notes and usermanuals.
Thanks for clarifying. I see where you're coming from.
While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss tonotice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architecture of your own changes,although maybe that
rarely occurs.
Thanks,
Amit--
Takashi Menjo <takashi.menjo@gmail.com <mailto:takashi.menjo@gmail.com> >
--
Takashi Menjo <takashi.menjo@gmail.com <mailto:takashi.menjo@gmail.com> >
Hi Takashi,
There are some differences between our HW/SW configuration and test steps. I attached postgresql.conf I used for your reference. I would like to try postgresql.conf and steps you provided in the later days to see if I can find cause.
I also ran pgbench and postgres server on the same server but on different NUMA node, and ensure server process and PMEM on the same NUMA node. I used similar steps are yours from step 1 to 9. But some difference in later steps, major of them are:
In step 10), I created a database and table for test by:
#create database:
psql -c "create database insert_bench;"
#create table:
psql -d insert_bench -c "create table test(crt_time timestamp, info text default '75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d79a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');"
in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds to warm up.
In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f ./test.sql -T 600 -c _ -j _ insert_bench. (test.sql can be found in attachment)
For HW/SW conf, the major differences are:
CPU: I used Xeon 8268 (24c@2.9Ghz, HT enabled)
OS Distro: CentOS 8.2.2004
Kernel: 4.18.0-193.6.3.el8_2.x86_64
GCC: 8.3.1
Best regards
Gang
-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Tuesday, October 6, 2020 4:49 PM
To: Deng, Gang <gang.deng@intel.com>
Cc: pgsql-hackers@postgresql.org; 'Takashi Menjo' <takashi.menjo@gmail.com>
Subject: RE: [PoC] Non-volatile WAL buffer
Hi Gang,
I have tried to but yet cannot reproduce performance degrade you reported when inserting 328-byte records. So I think the condition of you and me would be different, such as steps to reproduce, postgresql.conf, installation setup, and so on.
My results and condition are as follows. May I have your condition in more detail? Note that I refer to your "Storage over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer."
Best regards,
Takashi
# Results
See the attached figure. In short, Non-volatile WAL buffer got better performance than Original (PMEM).
# Steps
Note that I ran postgres server and pgbench in a single-machine system but separated two NUMA nodes. PMEM and PCI SSD for the server process are on the server-side NUMA node.
01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile WAL buffer
07) Edit postgresql.conf as the attached one
- Please remove nvwal_* lines in the case of Original (PMEM)
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Change # characters of "filler" column of "pgbench_history" table to 300 (ALTER TABLE pgbench_history ALTER filler TYPE character(300);)
- This would make the row size of the table 328 bytes
12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
13) Remount the PMEM and the PCIe SSD
14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
15) Run pg_prewarm for all the four pgbench_* tables
16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r -M prepared -T 1800 -c __ -j __)
- It executes the default tpcb-like transactions
I repeated all the steps three times for each (c,j) then got the median "tps = __ (including connections establishing)" of the three as throughput and the "latency average = __ ms " of that time as average latency.
# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata
# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 sockets (256 GiB per channel x 6 channels per socket; interleaving enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C compiler: gcc 9.3.0
- libc: glibc 2.31
- Linux kernel: 5.7 (vanilla)
- Filesystem: ext4 (DAX enabled when using Optane PMem)
- PMDK: 1.9
- PostgreSQL (Original): 14devel (200f610: Jul 26, 2020)
- PostgreSQL (Non-volatile WAL buffer): 14devel (200f610: Jul 26, 2020) + non-volatile WAL buffer patchset v4
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center
Show quoted text
-----Original Message-----
From: Takashi Menjo <takashi.menjo@gmail.com>
Sent: Thursday, September 24, 2020 2:38 AM
To: Deng, Gang <gang.deng@intel.com>
Cc: pgsql-hackers@postgresql.org; Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp>
Subject: Re: [PoC] Non-volatile WAL bufferHello Gang,
Thank you for your report. I have not taken care of record size deeply
yet, so your report is very interesting. I will also have a test like yours then post results here.Regards,
Takashi2020年9月21日(月) 14:14 Deng, Gang <gang.deng@intel.com <mailto:gang.deng@intel.com> >:
Hi Takashi,
Thank you for the patch and work on accelerating PG performance with
NVM. I applied the patch and made some performance test based on the
patch v4. I stored database data files on NVMe SSD and stored WAL file on Intel PMem (NVM). I used two methods to store WAL file(s):1. Leverage your patch to access PMem with libpmem (NVWAL patch).
2. Access PMem with legacy filesystem interface, that means use PMem as ordinary block device, no
PG patch is required to access PMem (Storage over App Direct).I tried two insert scenarios:
A. Insert small record (length of record to be inserted is 24 bytes), I think it is similar as your test
B. Insert large record (length of record to be inserted is 328 bytes)
My original purpose is to see higher performance gain in scenario B as it is more write intensive on WAL.
But I observed that NVWAL patch method had ~5% performance improvement
compared with Storage over App Direct method in scenario A, while had ~20% performance degradation in scenario B.I made further investigation on the test. I found that NVWAL patch
can improve performance of XlogFlush function, but it may impact
performance of CopyXlogRecordToWAL function. It may be related to the higher latency of memcpy to Intel PMem comparing with DRAM. Here are key data in my test:Scenario A (length of record to be inserted: 24 bytes per record):
==============================
NVWAL SoAD
------------------------------------ ------- -------
Througput (10^3 TPS) 310.5
296.0CPU Time % of CopyXlogRecordToWAL 0.4 0.2
CPU Time % of XLogInsertRecord 1.5 0.8
CPU Time % of XLogFlush 2.1 9.6
Scenario B (length of record to be inserted: 328 bytes per record):
==============================
NVWAL SoAD
------------------------------------ ------- -------
Througput (10^3 TPS) 13.0
16.9CPU Time % of CopyXlogRecordToWAL 3.0 1.6
CPU Time % of XLogInsertRecord 23.0 16.4
CPU Time % of XLogFlush 2.3 5.9
Best Regards,
Gang
From: Takashi Menjo <takashi.menjo@gmail.com <mailto:takashi.menjo@gmail.com> >
Sent: Thursday, September 10, 2020 4:01 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Cc: pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL bufferRebased.
2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp
<mailto:takashi.menjou.vg@hco.ntt.co.jp> >:Dear hackers,
I update my non-volatile WAL buffer's patchset to v3. Now we can
use it in streaming replication mode.Updates from v2:
- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL
buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The
path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one.Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
NTT Software Innovation Center-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org<mailto:pgsql-hackers@postgresql.org> >
Cc: 'Robert Haas' <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com> >; 'Heikki Linnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; 'Amit Langote'
<amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Subject: RE: [PoC] Non-volatile WAL bufferDear hackers,
I rebased my non-volatile WAL buffer's patchset onto master. A
new v2 patchset is attached to this mail.
I also measured performance before and after patchset, varying
-c/--client and -j/--jobs options of pgbench, for
each scaling factor s = 50 or 1000. The results are presented in
the following tables and the attached charts.
Conditions, steps, and other details will be shown later.
Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)Both throughput and average latency are improved for each scaling
factor. Throughput seemed to almost reach
the upper limit when (c,j)=(36,18).
The percentage in s=1000 case looks larger than in s=50 case. I
think larger scaling factor leads to less
contentions on the same tables and/or indexes, that is, less lock
and unlock operations. In such a situation,
write-ahead logging appears to be more significant for performance.
Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patchSteps
=====
For each (c,j) pair, I did the following steps three times then Ifound the median of the three as a final result shown
in the tables above.
(1) Run initdb with proper -D and -X options; and also give
--nvwal-path and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutespgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbnameI gave no -b option to use the built-in "TPC-B (sort-of)" query.
Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GABest regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp> > NTT Software Innovation Center
-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Cc: 'Robert Haas' <robertmhaas@gmail.com<mailto:robertmhaas@gmail.com> >; 'Heikki Linnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >;
'PostgreSQL-development'
<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: RE: [PoC] Non-volatile WAL bufferDear Amit,
Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...
I'm rebasing my branch onto master. I'll submit an updated
patchset and performance report later.
Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>
NTT Software
Innovation Center
-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Cc: Robert Haas <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com> >; Heikki Linnakangas
<hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; PostgreSQL-development
<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: Re: [PoC] Non-volatile WAL bufferHello,
On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> > wrote:
Hello Amit,
I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have anyspecific reason to be working on REL_12_0?
Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I knowall new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or notbecause master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by usingrelease notes and user manuals.
Thanks for clarifying. I see where you're coming from.
While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss tonotice impact from any relevant developments in the master branch,
even developments which possibly require rethinking thearchitecture of your own changes, although maybe that
rarely occurs.
Thanks,
Amit--
Takashi Menjo <takashi.menjo@gmail.com
<mailto:takashi.menjo@gmail.com> >--
Takashi Menjo <takashi.menjo@gmail.com
<mailto:takashi.menjo@gmail.com> >
Hi Gang,
Thanks. I have tried to reproduce performance degrade, using your configuration, query, and steps. And today, I got some results that Original (PMEM) achieved better performance than Non-volatile WAL buffer on my Ubuntu environment. Now I work for further investigation.
Best regards,
Takashi
--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center
Show quoted text
-----Original Message-----
From: Deng, Gang <gang.deng@intel.com>
Sent: Friday, October 9, 2020 3:10 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: pgsql-hackers@postgresql.org; 'Takashi Menjo' <takashi.menjo@gmail.com>
Subject: RE: [PoC] Non-volatile WAL bufferHi Takashi,
There are some differences between our HW/SW configuration and test steps. I attached postgresql.conf I used
for your reference. I would like to try postgresql.conf and steps you provided in the later days to see if I can find
cause.I also ran pgbench and postgres server on the same server but on different NUMA node, and ensure server process
and PMEM on the same NUMA node. I used similar steps are yours from step 1 to 9. But some difference in later
steps, major of them are:In step 10), I created a database and table for test by:
#create database:
psql -c "create database insert_bench;"
#create table:
psql -d insert_bench -c "create table test(crt_time timestamp, info text default
'75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc
48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1
d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d7
9a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');"in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds to warm up.
In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f ./test.sql -T 600 -c _ -j _
insert_bench. (test.sql can be found in attachment)For HW/SW conf, the major differences are:
CPU: I used Xeon 8268 (24c@2.9Ghz, HT enabled) OS Distro: CentOS 8.2.2004
Kernel: 4.18.0-193.6.3.el8_2.x86_64
GCC: 8.3.1Best regards
Gang-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Tuesday, October 6, 2020 4:49 PM
To: Deng, Gang <gang.deng@intel.com>
Cc: pgsql-hackers@postgresql.org; 'Takashi Menjo' <takashi.menjo@gmail.com>
Subject: RE: [PoC] Non-volatile WAL bufferHi Gang,
I have tried to but yet cannot reproduce performance degrade you reported when inserting 328-byte records. So
I think the condition of you and me would be different, such as steps to reproduce, postgresql.conf, installation
setup, and so on.My results and condition are as follows. May I have your condition in more detail? Note that I refer to your "Storage
over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer."Best regards,
Takashi# Results
See the attached figure. In short, Non-volatile WAL buffer got better performance than Original (PMEM).# Steps
Note that I ran postgres server and pgbench in a single-machine system but separated two NUMA nodes. PMEM
and PCI SSD for the server process are on the server-side NUMA node.01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo
mount -o dax /dev/pmem0 /mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount
/dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile WAL buffer
07) Edit postgresql.conf as the attached one
- Please remove nvwal_* lines in the case of Original (PMEM)
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Change # characters of "filler" column of "pgbench_history" table to 300 (ALTER TABLE pgbench_history
ALTER filler TYPE character(300);)
- This would make the row size of the table 328 bytes
12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
13) Remount the PMEM and the PCIe SSD
14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
15) Run pg_prewarm for all the four pgbench_* tables
16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r -M prepared -T 1800 -c __
-j __)
- It executes the default tpcb-like transactionsI repeated all the steps three times for each (c,j) then got the median "tps = __ (including connections
establishing)" of the three as throughput and the "latency average = __ ms " of that time as average latency.# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 sockets (256 GiB per channel
x 6 channels per socket; interleaving enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C compiler: gcc 9.3.0
- libc: glibc 2.31
- Linux kernel: 5.7 (vanilla)
- Filesystem: ext4 (DAX enabled when using Optane PMem)
- PMDK: 1.9
- PostgreSQL (Original): 14devel (200f610: Jul 26, 2020)
- PostgreSQL (Non-volatile WAL buffer): 14devel (200f610: Jul 26, 2020) + non-volatile WAL buffer patchset
v4--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center-----Original Message-----
From: Takashi Menjo <takashi.menjo@gmail.com>
Sent: Thursday, September 24, 2020 2:38 AM
To: Deng, Gang <gang.deng@intel.com>
Cc: pgsql-hackers@postgresql.org; Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp>
Subject: Re: [PoC] Non-volatile WAL bufferHello Gang,
Thank you for your report. I have not taken care of record size deeply
yet, so your report is very interesting. I will also have a test like yours then post results here.Regards,
Takashi2020年9月21日(月) 14:14 Deng, Gang <gang.deng@intel.com <mailto:gang.deng@intel.com> >:
Hi Takashi,
Thank you for the patch and work on accelerating PG performance with
NVM. I applied the patch and made some performance test based on the
patch v4. I stored database data files on NVMe SSD and stored WAL file on Intel PMem (NVM). I used twomethods to store WAL file(s):
1. Leverage your patch to access PMem with libpmem (NVWAL patch).
2. Access PMem with legacy filesystem interface, that means use PMem as ordinary block device, no
PG patch is required to access PMem (Storage over App Direct).I tried two insert scenarios:
A. Insert small record (length of record to be inserted is 24 bytes), I think it is similar as your test
B. Insert large record (length of record to be inserted is 328 bytes)
My original purpose is to see higher performance gain in scenario B as it is more write intensive on WAL.
But I observed that NVWAL patch method had ~5% performance improvement
compared with Storage over App Direct method in scenario A, while had ~20% performance degradation inscenario B.
I made further investigation on the test. I found that NVWAL patch
can improve performance of XlogFlush function, but it may impact
performance of CopyXlogRecordToWAL function. It may be related to the higher latency of memcpy to IntelPMem comparing with DRAM. Here are key data in my test:
Scenario A (length of record to be inserted: 24 bytes per record):
==============================
NVWAL SoAD
------------------------------------ ------- -------
Througput (10^3 TPS) 310.5
296.0CPU Time % of CopyXlogRecordToWAL 0.4 0.2
CPU Time % of XLogInsertRecord 1.5 0.8
CPU Time % of XLogFlush 2.1 9.6
Scenario B (length of record to be inserted: 328 bytes per record):
==============================
NVWAL SoAD
------------------------------------ ------- -------
Througput (10^3 TPS) 13.0
16.9CPU Time % of CopyXlogRecordToWAL 3.0 1.6
CPU Time % of XLogInsertRecord 23.0 16.4
CPU Time % of XLogFlush 2.3 5.9
Best Regards,
Gang
From: Takashi Menjo <takashi.menjo@gmail.com <mailto:takashi.menjo@gmail.com> >
Sent: Thursday, September 10, 2020 4:01 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Cc: pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL bufferRebased.
2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp
<mailto:takashi.menjou.vg@hco.ntt.co.jp> >:Dear hackers,
I update my non-volatile WAL buffer's patchset to v3. Now we can
use it in streaming replication mode.Updates from v2:
- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL
buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The
path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as themaster's one.
Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
NTT Software Innovation Center-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org<mailto:pgsql-hackers@postgresql.org> >
Cc: 'Robert Haas' <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com> >; 'Heikki Linnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; 'Amit
Langote'
<amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Subject: RE: [PoC] Non-volatile WAL bufferDear hackers,
I rebased my non-volatile WAL buffer's patchset onto master. A
new v2 patchset is attached to this mail.
I also measured performance before and after patchset, varying
-c/--client and -j/--jobs options of pgbench, for
each scaling factor s = 50 or 1000. The results are presented in
the following tables and the attached charts.
Conditions, steps, and other details will be shown later.
Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)Both throughput and average latency are improved for each scaling
factor. Throughput seemed to almost reach
the upper limit when (c,j)=(36,18).
The percentage in s=1000 case looks larger than in s=50 case. I
think larger scaling factor leads to less
contentions on the same tables and/or indexes, that is, less lock
and unlock operations. In such a situation,
write-ahead logging appears to be more significant for performance.
Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patchSteps
=====
For each (c,j) pair, I did the following steps three times then Ifound the median of the three as a final result shown
in the tables above.
(1) Run initdb with proper -D and -X options; and also give
--nvwal-path and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutespgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbnameI gave no -b option to use the built-in "TPC-B (sort-of)" query.
Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GABest regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp> > NTT Software Innovation Center
-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Cc: 'Robert Haas' <robertmhaas@gmail.com<mailto:robertmhaas@gmail.com> >; 'Heikki Linnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >;
'PostgreSQL-development'
<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: RE: [PoC] Non-volatile WAL bufferDear Amit,
Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...
I'm rebasing my branch onto master. I'll submit an updated
patchset and performance report later.
Best regards,
Takashi--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>
NTT Software
Innovation Center
-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Cc: Robert Haas <robertmhaas@gmail.com
<mailto:robertmhaas@gmail.com> >; Heikki Linnakangas
<hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; PostgreSQL-development
<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: Re: [PoC] Non-volatile WAL bufferHello,
On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> > wrote:
Hello Amit,
I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have anyspecific reason to be working on REL_12_0?
Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I knowall new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or notbecause master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by usingrelease notes and user manuals.
Thanks for clarifying. I see where you're coming from.
While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss tonotice impact from any relevant developments in the master branch,
even developments which possibly require rethinking thearchitecture of your own changes, although maybe that
rarely occurs.
Thanks,
Amit--
Takashi Menjo <takashi.menjo@gmail.com
<mailto:takashi.menjo@gmail.com> >--
Takashi Menjo <takashi.menjo@gmail.com
<mailto:takashi.menjo@gmail.com> >
I had a new look at this thread today, trying to figure out where we
are. I'm a bit confused.
One thing we have established: mmap()ing WAL files performs worse than
the current method, if pg_wal is not on a persistent memory device. This
is because the kernel faults in existing content of each page, even
though we're overwriting everything.
That's unfortunate. I was hoping that mmap() would be a good option even
without persistent memory hardware. I wish we could tell the kernel to
zero the pages instead of reading them from the file. Maybe clear the
file with ftruncate() before mmapping it?
That should not be problem with a real persistent memory device, however
(or when emulating it with DRAM). With DAX, the storage is memory-mapped
directly and there is no page cache, and no pre-faulting.
Because of that, I'm baffled by what the
v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it
correctly, it puts the WAL buffers in a separate file, which is stored
on the NVRAM. Why? I realize that this is just a Proof of Concept, but
I'm very much not interested in anything that requires the DBA to manage
a second WAL location. Did you test the mmap() patches with persistent
memory hardware? Did you compare that with the pmem patchset, on the
same hardware? If there's a meaningful performance difference between
the two, what's causing it?
- Heikki
Hi Heikki,
I had a new look at this thread today, trying to figure out where we are.
I'm a bit confused.
One thing we have established: mmap()ing WAL files performs worse than
the current method, if pg_wal is not on
a persistent memory device. This is because the kernel faults in existing
content of each page, even though we're
overwriting everything.
Yes. In addition, after a certain page (in the sense of OS page) is
msync()ed, another page fault will occur again when something is stored
into that page.
That's unfortunate. I was hoping that mmap() would be a good option even
without persistent memory hardware.
I wish we could tell the kernel to zero the pages instead of reading them
from the file. Maybe clear the file with
ftruncate() before mmapping it?
The area extended by ftruncate() appears as if it were zero-filled [1]https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html.
Please note that it merely "appears as if." It might not be actually
zero-filled as data blocks on devices, so pre-allocating files should
improve transaction performance. At least, on Linux 5.7 and ext4, it takes
more time to store into the mapped file just open(O_CREAT)ed and
ftruncate()d than into the one filled already and actually.
That should not be problem with a real persistent memory device, however
(or when emulating it with DRAM). With
DAX, the storage is memory-mapped directly and there is no page cache,
and no pre-faulting.
Yes, with filesystem DAX, there is no page cache for file data. A page
fault still occurs but for each 2MiB DAX hugepage, so its overhead
decreases compared with 4KiB page fault. Such a DAX hugepage fault is only
applied to DAX-mapped files and is different from a general transparent
hugepage fault.
Because of that, I'm baffled by what the
v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it
correctly, it puts the WAL buffers in a separate file, which is stored on
the NVRAM. Why? I realize that this is just
a Proof of Concept, but I'm very much not interested in anything that
requires the DBA to manage a second WAL
location. Did you test the mmap() patches with persistent memory
hardware? Did you compare that with the pmem
patchset, on the same hardware? If there's a meaningful performance
difference between the two, what's causing
it?
Yes, this patchset puts the WAL buffers into the file specified by
"nvwal_path" in postgresql.conf.
Why this patchset puts the buffers into the separated file, not existing
segment files in PGDATA/pg_wal, is because it reduces the overhead due to
system calls such as open(), mmap(), munmap(), and close(). It open()s and
mmap()s the file "nvwal_path" once, and keeps that file mapped while
running. On the other hand, as for the patchset mmap()ing the segment
files, a backend process should munmap() and close() the current mapped
file and open() and mmap() the new one for each time the inserting location
for that process goes over segments. This causes the performance difference
between the two.
Best regards,
Takashi
[1]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html
https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html
--
Takashi Menjo <takashi.menjo@gmail.com>
Hi Gang,
I appreciate your patience. I reproduced the results you reported to me, on
my environment.
First of all, the condition you gave to me was a little unstable on my
environment, so I made the values of {max_,min_,nv}wal_size larger and the
pre-warm duration longer to get stable performance. I didn't modify your
table and query, and benchmark duration.
Under the stable condition, Original (PMEM) still got better performance
than Non-volatile WAL Buffer. To sum up, the reason was that Non-volatile
WAL Buffer on Optane PMem spent much more time than Original (PMEM) for
XLogInsert when using your table and query. It offset the improvement of
XLogFlush, and degraded performance in total. VTune told me that
Non-volatile WAL Buffer took more CPU time than Original (PMEM) for
(XLogInsert => XLogInsertRecord => CopyXLogRecordsToWAL =>) memcpy while it
took less time for XLogFlush. This profile was very similar to the one you
reported.
In general, when WAL buffers are on Optane PMem rather than DRAM, it is
obvious that it takes more time to memcpy WAL records into the buffers
because Optane PMem is a little slower than DRAM. In return for that,
Non-volatile WAL Buffer reduces the time to let the records hit to devices
because it doesn't need to write them out of the buffers to somewhere else,
but just need to flush out of CPU caches to the underlying memory-mapped
file.
Your report shows that Non-volatile WAL Buffer on Optane PMem is not good
for certain kinds of transactions, and is good for others. I have tried to
fix how to insert and flush WAL records, or the configurations or constants
that could change performance such as NUM_XLOGINSERT_LOCKS, but
Non-volatile WAL Buffer have not achieved better performance than Original
(PMEM) yet when using your table and query. I will continue to work on this
issue and will report if I have any update.
By the way, did your performance progress reported by pgbench with -P
option get down to zero when you run Non-volatile WAL Buffer? If so, your
{max_,min_,nv}wal_size might be too small or your checkpoint configurations
might be not appropriate. Could you check your results again?
Best regards,
Takashi
--
Takashi Menjo <takashi.menjo@gmail.com>
Hi,
These patches no longer apply :-( A rebased version would be nice.
I've been interested in what performance improvements this might bring,
so I've been running some extensive benchmarks on a machine with PMEM
hardware. So let me share some interesting results. (I used commit from
early September, to make the patch apply cleanly.)
Note: The hardware was provided by Intel, and they are interested in
supporting the development and providing access to machines with PMEM to
developers. So if you're interested in this patch & PMEM, but don't have
access to suitable hardware, try contacting Steve Shaw
<steve.shaw@intel.com> who's the person responsible for open source
databases at Intel (he's also the author of HammerDB).
The benchmarks were done on a machine with 2 x Xeon Platinum (24/48
cores), 128GB RAM, NVMe and PMEM SSDs. I did some basic pgbench tests
with different scales (500, 5000, 15000) with and without these patches.
I did some usual tuning (shared buffers, max_wal_size etc.), the most
important changes being:
- maintenance_work_mem = 256MB
- max_connections = 200
- random_page_cost = 1.2
- shared_buffers = 16GB
- work_mem = 64MB
- checkpoint_completion_target = 0.9
- checkpoint_timeout = 20min
- max_wal_size = 96GB
- autovacuum_analyze_scale_factor = 0.1
- autovacuum_vacuum_insert_scale_factor = 0.05
- autovacuum_vacuum_scale_factor = 0.01
- vacuum_cost_limit = 1000
And on the patched version:
- nvwal_size = 128GB
- nvwal_path = … points to the PMEM DAX device …
The machine has multiple SSDs (all Optane-based, IIRC):
- NVMe SSD (Optane)
- PMEM in BTT mode
- PMEM in DAX mode
So I've tested all of them - the data was always on the NVMe device, and
the WAL was placed on one of those devices. That means we have these
four cases to compare:
- nvme - master with WAL on the NVMe SSD
- pmembtt - master with WAL on PMEM in BTT mode
- pmemdax - master with WAL on PMEM in DAX mode
- pmemdax-ntt - patched version with WAL on PMEM in DAX mode
The "nvme" is a bit disadvantaged as it places both data and WAL on the
same device, so consider that while evaluating the results. But for the
smaller data sets this should be fairly negligible, I believe.
I'm not entirely sure whether the "pmemdax" (i.e. unpatched instance
with WAL on PMEM DAX device) is actually safe, but I included it anyway
to see what difference is.
Now let's look at results for the basic data sizes and client counts.
I've also attached some charts to illustrate this. These numbers are tps
averages from 3 runs, each about 30 minutes long.
1) scale 500 (fits into shared buffers)
---------------------------------------
wal 1 16 32 64 96
----------------------------------------------------------
nvme 6321 73794 132687 185409 192228
pmembtt 6248 60105 85272 82943 84124
pmemdax 6686 86188 154850 105219 149224
pmemdax-ntt 8062 104887 211722 231085 252593
The NVMe performs well (the single device is not an issue, as there
should be very little non-WAL I/O). The PMBM/BTT has a clear bottleneck
~85k tps. It's interesting the PMEM/DAX performs much worse without the
patch, and the drop at 64 clients. Not sure what that's about.
2) scale 5000 (fits into RAM)
-----------------------------
wal 1 16 32 64 96
-----------------------------------------------------------
nvme 4804 43636 61443 79807 86414
pmembtt 4203 28354 37562 41562 43684
pmemdax 5580 62180 92361 112935 117261
pmemdax-ntt 6325 79887 128259 141793 127224
The differences are more significant, compared to the small scale. The
BTT seems to have bottleneck around ~43k tps, the PMEM/DAX dominates.
3) scale 15000 (bigger than RAM)
--------------------------------
wal 1 16 32 64 96
-----------------------------------------------------------
pmembtt 3638 20630 28985 32019 31303
pmemdax 5164 48230 69822 85740 90452
pmemdax-ntt 5382 62359 80038 83779 80191
I have not included the nvme results here, because the impact of placing
both data and WAL on the same device was too significant IMHO.
The remaining results seem nice. It's interesting the patched case is a
bit slower than master. Not sure why.
Overall, these results seem pretty nice, I guess. Of course, this does
not say the current patch is the best way to implement this (or whether
it's correct), but it does suggest supporting PMEM might bring sizeable
performance boost.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 10/30/20 6:57 AM, Takashi Menjo wrote:
Hi Heikki,
I had a new look at this thread today, trying to figure out where
we are.I'm a bit confused.
One thing we have established: mmap()ing WAL files performs worse
than the current method, if pg_wal is not on a persistent memory
device. This is because the kernel faults in existing content of
each page, even though we're overwriting everything.Yes. In addition, after a certain page (in the sense of OS page) is
msync()ed, another page fault will occur again when something is
stored into that page.That's unfortunate. I was hoping that mmap() would be a good option
even without persistent memory hardware. I wish we could tell the
kernel to zero the pages instead of reading them from the file.
Maybe clear the file with ftruncate() before mmapping it?The area extended by ftruncate() appears as if it were zero-filled
[1]. Please note that it merely "appears as if." It might not be
actually zero-filled as data blocks on devices, so pre-allocating
files should improve transaction performance. At least, on Linux 5.7
and ext4, it takes more time to store into the mapped file just
open(O_CREAT)ed and ftruncate()d than into the one filled already and
actually.
Does is really matter that it only appears zero-filled? I think Heikki's
point was that maybe ftruncate() would prevent the kernel from faulting
the existing page content when we're overwriting it.
Not sure I understand what the benchmark with ext4 was doing, exactly.
How was that measured? Might be interesting to have some simple
benchmarking tool to demonstrate this (I believe a small standalone tool
written in C should do the trick).
That should not be problem with a real persistent memory device,
however (or when emulating it with DRAM). With DAX, the storage is
memory-mapped directly and there is no page cache, and no
pre-faulting.Yes, with filesystem DAX, there is no page cache for file data. A
page fault still occurs but for each 2MiB DAX hugepage, so its
overhead decreases compared with 4KiB page fault. Such a DAX
hugepage fault is only applied to DAX-mapped files and is different
from a general transparent hugepage fault.
I don't follow - if there are page faults even when overwriting all the
data, I'd say it's still an issue even with 2MB DAX pages. How big is
the difference between 4kB and 2MB pages?
Not sure I understand how is this different from general THP fault?
Because of that, I'm baffled by what the
v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it
correctly, it puts the WAL buffers in a separate file, which is
stored on the NVRAM. Why? I realize that this is just a Proof of
Concept, but I'm very much not interested in anything that requires
the DBA to manage a second WAL location. Did you test the mmap()
patches with persistent memory hardware? Did you compare that with
the pmem patchset, on the same hardware? If there's a meaningful
performance difference between the two, what's causing it?
Yes, this patchset puts the WAL buffers into the file specified by
"nvwal_path" in postgresql.conf.Why this patchset puts the buffers into the separated file, not
existing segment files in PGDATA/pg_wal, is because it reduces the
overhead due to system calls such as open(), mmap(), munmap(), and
close(). It open()s and mmap()s the file "nvwal_path" once, and keeps
that file mapped while running. On the other hand, as for the
patchset mmap()ing the segment files, a backend process should
munmap() and close() the current mapped file and open() and mmap()
the new one for each time the inserting location for that process
goes over segments. This causes the performance difference between
the two.
I kinda agree with Heikki here - having to manage yet another location
for WAL data is rather inconvenient. We should aim not to make the life
of DBAs unnecessarily difficult, IMO.
I wonder how significant the syscall overhead is - can you show share
some numbers? I don't see any such results in this thread, so I'm not
sure if it means losing 1% or 10% throughput.
Also, maybe there are alternative ways to reduce the overhead? For
example, we can increase the size of the WAL segment, and with 1GB
segments we'd do 1/64 of syscalls. Or maybe we could do some of this
asynchronously - request a segment ahead, and let another process do the
actual work etc. so that the running process does not wait.
Do I understand correctly that the patch removes "regular" WAL buffers
and instead writes the data into the non-volatile PMEM buffer, without
writing that to the WAL segments at all (unless in archiving mode)?
Firstly, I guess many (most?) instances will have to write the WAL
segments anyway because of PITR/backups, so I'm not sure we can save
much here.
But more importantly - doesn't that mean the nvwal_size value is
essentially a hard limit? With max_wal_size, it's a soft limit i.e.
we're allowed to temporarily use more WAL when needed. But with a
pre-allocated file, that's clearly not possible. So what would happen in
those cases?
Also, is it possible to change nvwal_size? I haven't tried, but I wonder
what happens with the current contents of the file.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 11/23/20 3:01 AM, Tomas Vondra wrote:
Hi,
On 10/30/20 6:57 AM, Takashi Menjo wrote:
Hi Heikki,
I had a new look at this thread today, trying to figure out where
we are.I'm a bit confused.
One thing we have established: mmap()ing WAL files performs worse
than the current method, if pg_wal is not on a persistent memory
device. This is because the kernel faults in existing content of
each page, even though we're overwriting everything.Yes. In addition, after a certain page (in the sense of OS page) is
msync()ed, another page fault will occur again when something is
stored into that page.That's unfortunate. I was hoping that mmap() would be a good option
even without persistent memory hardware. I wish we could tell the
kernel to zero the pages instead of reading them from the file.
Maybe clear the file with ftruncate() before mmapping it?The area extended by ftruncate() appears as if it were zero-filled
[1]. Please note that it merely "appears as if." It might not be
actually zero-filled as data blocks on devices, so pre-allocating
files should improve transaction performance. At least, on Linux 5.7
and ext4, it takes more time to store into the mapped file just
open(O_CREAT)ed and ftruncate()d than into the one filled already and
actually.Does is really matter that it only appears zero-filled? I think Heikki's
point was that maybe ftruncate() would prevent the kernel from faulting
the existing page content when we're overwriting it.Not sure I understand what the benchmark with ext4 was doing, exactly.
How was that measured? Might be interesting to have some simple
benchmarking tool to demonstrate this (I believe a small standalone tool
written in C should do the trick).
One more thought about this - if ftruncate() is not enough to convince
the mmap() to not load existing data from the file, what about not
reusing the WAL segments at all? I haven't tried, though.
That should not be problem with a real persistent memory device,
however (or when emulating it with DRAM). With DAX, the storage is
memory-mapped directly and there is no page cache, and no
pre-faulting.Yes, with filesystem DAX, there is no page cache for file data. A
page fault still occurs but for each 2MiB DAX hugepage, so its
overhead decreases compared with 4KiB page fault. Such a DAX
hugepage fault is only applied to DAX-mapped files and is different
from a general transparent hugepage fault.I don't follow - if there are page faults even when overwriting all the
data, I'd say it's still an issue even with 2MB DAX pages. How big is
the difference between 4kB and 2MB pages?Not sure I understand how is this different from general THP fault?
Because of that, I'm baffled by what the
v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it
correctly, it puts the WAL buffers in a separate file, which is
stored on the NVRAM. Why? I realize that this is just a Proof of
Concept, but I'm very much not interested in anything that requires
the DBA to manage a second WAL location. Did you test the mmap()
patches with persistent memory hardware? Did you compare that with
the pmem patchset, on the same hardware? If there's a meaningful
performance difference between the two, what's causing it?Yes, this patchset puts the WAL buffers into the file specified by
"nvwal_path" in postgresql.conf.Why this patchset puts the buffers into the separated file, not
existing segment files in PGDATA/pg_wal, is because it reduces the
overhead due to system calls such as open(), mmap(), munmap(), and
close(). It open()s and mmap()s the file "nvwal_path" once, and keeps
that file mapped while running. On the other hand, as for the
patchset mmap()ing the segment files, a backend process should
munmap() and close() the current mapped file and open() and mmap()
the new one for each time the inserting location for that process
goes over segments. This causes the performance difference between
the two.I kinda agree with Heikki here - having to manage yet another location
for WAL data is rather inconvenient. We should aim not to make the life
of DBAs unnecessarily difficult, IMO.I wonder how significant the syscall overhead is - can you show share
some numbers? I don't see any such results in this thread, so I'm not
sure if it means losing 1% or 10% throughput.Also, maybe there are alternative ways to reduce the overhead? For
example, we can increase the size of the WAL segment, and with 1GB
segments we'd do 1/64 of syscalls. Or maybe we could do some of this
asynchronously - request a segment ahead, and let another process do the
actual work etc. so that the running process does not wait.Do I understand correctly that the patch removes "regular" WAL buffers
and instead writes the data into the non-volatile PMEM buffer, without
writing that to the WAL segments at all (unless in archiving mode)?Firstly, I guess many (most?) instances will have to write the WAL
segments anyway because of PITR/backups, so I'm not sure we can save
much here.But more importantly - doesn't that mean the nvwal_size value is
essentially a hard limit? With max_wal_size, it's a soft limit i.e.
we're allowed to temporarily use more WAL when needed. But with a
pre-allocated file, that's clearly not possible. So what would happen in
those cases?Also, is it possible to change nvwal_size? I haven't tried, but I wonder
what happens with the current contents of the file.
I've been thinking about the current design (which essentially places
the WAL buffers on PMEM) a bit more. I wonder whether that's actually
the right design ...
The way I understand the current design is that we're essentially
switching from this architecture:
clients -> wal buffers (DRAM) -> wal segments (storage)
to this
clients -> wal buffers (PMEM)
(Assuming there we don't have to write segments because of archiving.)
The first thing to consider is that PMEM is actually somewhat slower
than DRAM, the difference is roughly 100ns vs. 300ns (see [1]https://pmem.io/2019/12/19/performance.html and [2]https://arxiv.org/pdf/1904.01614.pdf).
From this POV it's a bit strange that we're moving the WAL buffer to a
slower medium.
Of course, PMEM is significantly faster than other storage types (e.g.
order of magnitude faster than flash) and we're eliminating the need to
write the WAL from PMEM in some cases, and that may help.
The second thing I notice is that PMEM does not seem to handle many
clients particularly well - if you look at Figure 2 in [2]https://arxiv.org/pdf/1904.01614.pdf, you'll see
that there's a clear drop-off in write bandwidth after only a few
clients. For DRAM there's no such issue. (The total PMEM bandwidth seems
much worse than for DRAM too.)
So I wonder if using PMEM for the WAL buffer is the right way forward.
AFAIK the WAL buffer is quite concurrent (multiple clients writing
data), which seems to contradict the PMEM vs. DRAM trade-offs.
The design I've originally expected would look more like this
clients -> wal buffers (DRAM) -> wal segments (PMEM DAX)
i.e. mostly what we have now, but instead of writing the WAL segments
"the usual way" we'd write them using mmap/memcpy, without fsync.
I suppose that's what Heikki meant too, but I'm not sure.
regards
[1]: https://pmem.io/2019/12/19/performance.html
[2]: https://arxiv.org/pdf/1904.01614.pdf
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
So I wonder if using PMEM for the WAL buffer is the right way forward.
AFAIK the WAL buffer is quite concurrent (multiple clients writing
data), which seems to contradict the PMEM vs. DRAM trade-offs.The design I've originally expected would look more like this
clients -> wal buffers (DRAM) -> wal segments (PMEM DAX)
i.e. mostly what we have now, but instead of writing the WAL segments
"the usual way" we'd write them using mmap/memcpy, without fsync.I suppose that's what Heikki meant too, but I'm not sure.
SQL Server probably does so. Please see the following page and the links in "Next steps" section. I'm saying "probably" because the document doesn't clearly state whether SQL Server memcpys data from DRAM log cache to non-volatile log cache only for transaction commits or for all log cache writes. I presume the former.
Add persisted log buffer to a database
https://docs.microsoft.com/en-us/sql/relational-databases/databases/add-persisted-log-buffer?view=sql-server-ver15
--------------------------------------------------
With non-volatile, tail of the log storage the pattern is
memcpy to LC
memcpy to NV LC
Set status
Return control to caller (commit is now valid)
...
With this new functionality, we use a region of memory which is mapped to a file on a DAX volume to hold that buffer. Since the memory hosted by the DAX volume is already persistent, we have no need to perform a separate flush, and can immediately continue with processing the next operation. Data is flushed from this buffer to more traditional storage in the background.
--------------------------------------------------
Regards
Takayuki Tsunakawa
On 11/24/20 7:34 AM, tsunakawa.takay@fujitsu.com wrote:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
So I wonder if using PMEM for the WAL buffer is the right way forward.
AFAIK the WAL buffer is quite concurrent (multiple clients writing
data), which seems to contradict the PMEM vs. DRAM trade-offs.The design I've originally expected would look more like this
clients -> wal buffers (DRAM) -> wal segments (PMEM DAX)
i.e. mostly what we have now, but instead of writing the WAL segments
"the usual way" we'd write them using mmap/memcpy, without fsync.I suppose that's what Heikki meant too, but I'm not sure.
SQL Server probably does so. Please see the following page and the links in "Next steps" section. I'm saying "probably" because the document doesn't clearly state whether SQL Server memcpys data from DRAM log cache to non-volatile log cache only for transaction commits or for all log cache writes. I presume the former.
Add persisted log buffer to a database
https://docs.microsoft.com/en-us/sql/relational-databases/databases/add-persisted-log-buffer?view=sql-server-ver15
--------------------------------------------------
With non-volatile, tail of the log storage the pattern ismemcpy to LC
memcpy to NV LC
Set status
Return control to caller (commit is now valid)
...With this new functionality, we use a region of memory which is mapped to a file on a DAX volume to hold that buffer. Since the memory hosted by the DAX volume is already persistent, we have no need to perform a separate flush, and can immediately continue with processing the next operation. Data is flushed from this buffer to more traditional storage in the background.
--------------------------------------------------
Interesting, thanks for the likn. If I understand [1]https://docs.microsoft.com/en-us/archive/blogs/bobsql/how-it-works-it-just-runs-faster-non-volatile-memory-sql-server-tail-of-log-caching-on-nvdimm correctly, they
essentially do this:
clients -> buffers (DRAM) -> buffers (PMEM) -> wal (storage)
that is, they insert the PMEM buffer between the LC (in DRAM) and
traditional (non-PMEM) storage, so that a commit does not need to do any
fsyncs etc.
It seems to imply the memcpy between DRAM and PMEM happens right when
writing the WAL, but I guess that's not strictly required - we might
just as well do that in the background, I think.
It's interesting that they only place the tail of the log on PMEM, i.e.
the PMEM buffer has limited size, and the rest of the log is not on
PMEM. It's a bit as if we inserted a PMEM buffer between our wal buffers
and the WAL segments, and kept the WAL segments on regular storage. That
could work, but I'd bet they did that because at that time the NV
devices were much smaller, and placing the whole log on PMEM was not
quite possible. So it might be unnecessarily complicated, considering
the PMEM device capacity is much higher now.
So I'd suggest we simply try this:
clients -> buffers (DRAM) -> wal segments (PMEM)
I plan to do some hacking and maybe hack together some simple tools to
benchmarks various approaches.
regards
[1]: https://docs.microsoft.com/en-us/archive/blogs/bobsql/how-it-works-it-just-runs-faster-non-volatile-memory-sql-server-tail-of-log-caching-on-nvdimm
https://docs.microsoft.com/en-us/archive/blogs/bobsql/how-it-works-it-just-runs-faster-non-volatile-memory-sql-server-tail-of-log-caching-on-nvdimm
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
It's interesting that they only place the tail of the log on PMEM, i.e.
the PMEM buffer has limited size, and the rest of the log is not on
PMEM. It's a bit as if we inserted a PMEM buffer between our wal buffers
and the WAL segments, and kept the WAL segments on regular storage. That
could work, but I'd bet they did that because at that time the NV
devices were much smaller, and placing the whole log on PMEM was not
quite possible. So it might be unnecessarily complicated, considering
the PMEM device capacity is much higher now.So I'd suggest we simply try this:
clients -> buffers (DRAM) -> wal segments (PMEM)
I plan to do some hacking and maybe hack together some simple tools to
benchmarks various approaches.
I'm in favor of your approach. Yes, Intel PMEM were available in 128/256/512 GB when I checked last year. That's more than enough to place all WAL segments, so a small PMEM wal buffer is not necessary. I'm excited to see Postgres gain more power.
Regards
Takayuki Tsunakawa
On Sun, Nov 22, 2020 at 5:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:
I'm not entirely sure whether the "pmemdax" (i.e. unpatched instance
with WAL on PMEM DAX device) is actually safe, but I included it anyway
to see what difference is.
I am curious to learn more on this aspect. Kernels have provided support
for "pmemdax" mode so what part is unsafe in stack.
Reading the numbers it seems only at smaller scale modified PostgreSQL is
giving enhanced benefit over unmodified PostgreSQL with "pmemdax". For most
of other cases the numbers are pretty close between these two setups, so
curious to learn, why even modify PostgreSQL if unmodified PostgreSQL can
provide similar benefit with just DAX mode.
On 11/25/20 1:27 AM, tsunakawa.takay@fujitsu.com wrote:
From: Tomas Vondra <tomas.vondra@enterprisedb.com>
It's interesting that they only place the tail of the log on PMEM,
i.e. the PMEM buffer has limited size, and the rest of the log is
not on PMEM. It's a bit as if we inserted a PMEM buffer between our
wal buffers and the WAL segments, and kept the WAL segments on
regular storage. That could work, but I'd bet they did that because
at that time the NV devices were much smaller, and placing the
whole log on PMEM was not quite possible. So it might be
unnecessarily complicated, considering the PMEM device capacity is
much higher now.So I'd suggest we simply try this:
clients -> buffers (DRAM) -> wal segments (PMEM)
I plan to do some hacking and maybe hack together some simple tools
to benchmarks various approaches.I'm in favor of your approach. Yes, Intel PMEM were available in
128/256/512 GB when I checked last year. That's more than enough to
place all WAL segments, so a small PMEM wal buffer is not necessary.
I'm excited to see Postgres gain more power.
Cool. FWIW I'm not 100% sure it's the right approach, but I think it's
worth testing. In the worst case we'll discover that this architecture
does not allow fully leveraging PMEM benefits, or maybe it won't work
for some other reason and the approach proposed here will work better.
Let's play a bit and we'll see.
I have hacked a very simple patch doing this (essentially replacing
open/write/close calls in xlog.c with pmem calls). It's a bit rough but
seems good enough for testing/experimenting. I'll polish it a bit, do
some benchmarks, and share some numbers in a day or two.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 11/25/20 2:10 AM, Ashwin Agrawal wrote:
On Sun, Nov 22, 2020 at 5:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:I'm not entirely sure whether the "pmemdax" (i.e. unpatched instance
with WAL on PMEM DAX device) is actually safe, but I included it anyway
to see what difference is.
I am curious to learn more on this aspect. Kernels have provided supportfor "pmemdax" mode so what part is unsafe in stack.
I do admit I'm not 100% certain about this, so I err on the side of
caution. While discussing this with Steve Shaw, he suggested that
applications may get broken because DAX devices don't behave like block
devices in some respects (atomicity, addressability, ...).
Reading the numbers it seems only at smaller scale modified PostgreSQL is
giving enhanced benefit over unmodified PostgreSQL with "pmemdax". For most
of other cases the numbers are pretty close between these two setups, so
curious to learn, why even modify PostgreSQL if unmodified PostgreSQL can
provide similar benefit with just DAX mode.
That's a valid questions, but I wouldn't say the ~20% difference on the
medium scale is negligible. And it's possible that for the larger scales
the primary bottleneck is the storage used for data directory, not WAL
(notice that nvme is missing for the large scale).
Of course, it's faster than flash storage but the PMEM costs more too,
and when you pay $$$ for hardware you probably want to get as much
benefit from it as possible.
[1]: https://ark.intel.com/content/www/us/en/ark/products/203879/intel-optane-persistent-memory-200-series-128gb-pmem-module.html
https://ark.intel.com/content/www/us/en/ark/products/203879/intel-optane-persistent-memory-200-series-128gb-pmem-module.html
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
Here's the "simple patch" that I'm currently experimenting with. It
essentially replaces open/close/write/fsync with pmem calls
(map/unmap/memcpy/persist variants), and it's by no means committable.
But it works well enough for experiments / measurements, etc.
The numbers (5-minute pgbench runs on scale 500) look like this:
master/btt master/dax ntt simple
-----------------------------------------------------------
1 5469 7402 7977 6746
16 48222 80869 107025 82343
32 73974 158189 214718 158348
64 85921 154540 225715 164248
96 150602 221159 237008 217253
A chart illustrating these results is attached. The four columns are
showing unpatched master with WAL on a pmem device, in BTT or DAX modes,
"ntt" is the patch submitted to this thread, and "simple" is the patch
I've hacked together.
As expected, the BTT case performs poorly (compared to the rest).
The "master/dax" and "simple" perform about the same. There are some
differences, but those may be attributed to noise. The NTT patch does
outperform these cases by ~20-40% in some cases.
The question is why. I recall suggestions this is due to page faults
when writing data into the WAL, but I did experiment with various
settings that I think should prevent that (e.g. disabling WAL reuse
and/or disabling zeroing the segments) but that made no measurable
difference.
So I've added some primitive instrumentation to the code, counting the
calls and measuring duration for each of the PMEM operations, and
printing the stats regularly into log (after ~1M ops).
Typical results from a run with a single client look like this (slightly
formatted/wrapped for e-mail):
PMEM STATS
COUNT total 1000000 map 30 unmap 20
memcpy 510210 persist 489740
TIME total 0 map 931080 unmap 188750
memcpy 4938866752 persist 187846686
LENGTH memcpy 4337647616 persist 329824672
This shows that a majority of the 1M calls is memcpy/persist, the rest
is mostly negligible - both in terms of number of calls and duration.
The time values are in nanoseconds, BTW.
So for example we did 30 map_file calls, taking ~0.9ms in total, and the
unmap calls took even less time. So the direct impact of map/unmap calls
is rather negligible, I think.
The dominant part is clearly the memcpy (~5s) and persist (~2s). It's
not much per call, but it's overall it costs much more than the map and
unmap calls.
Finally, let's look at the LENGTH, which is a sum of the ranges either
copied to PMEM (memcpy) or fsynced (persist). Those are in bytes, and
the memcpy value is way higher than the persist one. In this particular
case, it's something like 4.3MB vs. 300kB, so an order of magnitude.
It's entirely possible this is a bug/measurement error in the patch. I'm
not all that familiar with the XLOG stuff, so maybe I did some silly
mistake somewhere.
But I think it might be also explained by the fact that XLogWrite()
always writes the WAL in a multiple of 8kB pages. Which is perfectly
reasonable for regular block-oriented storage, but pmem/dax is exactly
about not having to do that - PMEM is byte-addressable. And with pgbech,
the individual WAL records are tiny, so having to instead write/flush
the whole 8kB page (or more of them) repeatedly, as we append the WAL
records, seems a bit wasteful. So I wonder if this is why the trivial
patch does not show any benefits.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Attachments:
patches.pngimage/png; name=patches.pngDownload
�PNG
IHDR f�/u &iCCPicc H���gP�Y���<��@B�PC�*%��Z(���@��PEl��+��4E�E\�"kE��t�,��qQAYp���?���{�o�s��s��p ��e��{bR�������(������t��� ��{������i�����r�)�t ��e��JOY���L���gWX�\�2�X��y�K��,�����]~
)����s��T8�����l�OrTzV� ���� ���$G�&D~S�����Gf��Dnr�&AltL:�5204_g���K�!F��gE_��z �s ��z�� t�@��WOm���| :��3��z��
��@(U� t�0��8 �|A� � $��`( E`8�@-h M����<�����.��L��@�����A2���@F�� 7�
�B�h( ��r��PT
UAuP�t������84�
}���aX���0v�}��p4�
����^���O���6<���"�@�]��p$�B�V�)G��V��C�!Bd����h(&Je�rF����T�VT1�
u���E�C��D��h2Z���@����ht� ]�nD��������w��aa�0�� Lf3�s�����L`��X�Vk����a���J�I�%�v�G�)��p��`\.W�k�]�
��pxq�:�����o������;�I�A��"X| q��
B+�a���H$���^�X�vb��q���D%i���Ri/�8�2�!�
�L� ��������&�U�S�{1���O,Bl�X�X����+
��N�P6Pr(��3�;�Yq���8W<L|�x��9�Q�9 �����D�D�D��M�i*��Au�FP����W�4��J�������5�$Cg�y�8z�g� ]$I�4�������� )d
���(a�f�0>J)Hq�"��H�J
I�K�I�JGJJ�IK�a�8�������y"��������="{MvV�.g)��+�;-�H������,L�_~NAQ�I!E�R����"C�V1N�L����M�Z)V�L����$��L`V0{�"eyeg��:�����J�J��U�*[5J�L�GU����������H��V�Q?���>��������1��f�X9���&Y�F3U�^��F���uX��6�m��]�}G�1���9�3�
��|U���U��$]�n�n���C�M/O�S����~��~�>��&
�
��.�y���i������&�v\�mu����:���G���L�Mv���|253��������������lOv1��9���|��y����-������l��^�Z��a����U�U����ij}�Zh�lfSo��V�6���v���������3�����s-�[���{'�B����C��SG�h�G����f���hgW����<������lq�u%���V�>s�v�u���.������MZ��<x<�x�<S=��xyzU{=�6�������l�i�y�k�[���O�/��������?`P ��x;H6(6�+��<��a��u�!&!!#�Y����� �!a�����a���CB�C�<�����y�5�">���2�6�,b&�*�4r*�*�4j:�*�@�L�MLy�l,7�*�u�s\m�|�G��������D\bh��$jR|Ro�brv�`�NJA�0�"�`�H�*hL����u���?����]�����������dKd'e�o���g�T�c�O�Q���{r�sw��o�l��
m
���Mu[����N�O� ����[�A^i���;�����O�r��R V (�m�����?�Y��r�����[EE�E����[?�X������%�%G�a�%��o��D�DiN���e�����7�Yn\^{�p(�������R�r_�bUL�p�]u[�|���������i�U�-��x4���:���z���c�c���7�7������Q����������'z�������KZ������!'��l�sW�nk]���8�q��/����v=�s�}������vZ{a���C��)�
�<�r������W�_��W>_}A�B�E����K�r.�]N�<{%��D����W�������z��u��W�8}�nX�8����[�[��Mow�����f�[���@��;]w��v��8d3t������y�o��y02*|�`�a����2-<�>�+|"���������~o�
/����?�y�x�?����?'�����O)M5MM��q���b����)/f������������/
M��^�������o����y�=}��na�������>|�Z�Z�.V|��������R���?B,���sMT cHRM z&