[PoC] Non-volatile WAL buffer

Started by Takashi Menjoalmost 6 years ago71 messages
#1Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
3 attachment(s)

Dear hackers,

I propose "non-volatile WAL buffer," a proof-of-concept new feature. It
enables WAL records to be durable without output to WAL segment files by
residing on persistent memory (PMEM) instead of DRAM. It improves database
performance by reducing copies of WAL and shortening the time of write
transactions.

I attach the first patchset that can be applied to PostgreSQL 12.0 (refs/
tags/REL_12_0). Please see README.nvwal (added by the patch 0003) to use
the new feature.

PMEM [1]Persistent Memory (SNIA) https://www.snia.org/PM is fast, non-volatile, and byte-addressable memory installed into
DIMM slots. Such products have been already available. For example, an
NVDIMM-N is a type of PMEM module that contains both DRAM and NAND flash.
It can be accessed like a regular DRAM, but on power loss, it can save its
contents into flash area. On power restore, it performs the reverse, that
is, the contents are copied back into DRAM. PMEM also has been already
supported by major operating systems such as Linux and Windows, and new
open-source libraries such as Persistent Memory Development Kit (PMDK) [2]Persistent Memory Development Kit (pmem.io) https://pmem.io/pmdk/.
Furthermore, several DBMSes have started to support PMEM.

It's time for PostgreSQL. PMEM is faster than a solid state disk and
naively can be used as a block storage. However, we cannot gain much
performance in that way because it is so fast that the overhead of
traditional software stacks now becomes unignorable, such as user buffers,
filesystems, and block layers. Non-volatile WAL buffer is a work to make
PostgreSQL PMEM-aware, that is, accessing directly to PMEM as a RAM to
bypass such overhead and achieve the maximum possible benefit. I believe
WAL is one of the most important modules to be redesigned for PMEM because
it has assumed slow disks such as HDDs and SSDs but PMEM is not so.

This work is inspired by "Non-volatile Memory Logging" talked in PGCon
2016 [3]Non-volatile Memory Logging (PGCon 2016) https://www.pgcon.org/2016/schedule/track/Performance/945.en.html to gain more benefit from PMEM than my and Yoshimi's previous
work did [4]Introducing PMDK into PostgreSQL (PGCon 2018) https://www.pgcon.org/2018/schedule/events/1154.en.html[5]Applying PMDK to WAL operations for persistent memory (pgsql-hackers) /messages/by-id/C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp. I submitted a talk proposal for PGCon in this year, and
have measured and analyzed performance of my PostgreSQL with non-volatile
WAL buffer, comparing with the original one that uses PMEM as "a faster-
than-SSD storage." I will talk about the results if accepted.

Best regards,
Takashi Menjo

[1]: Persistent Memory (SNIA) https://www.snia.org/PM
https://www.snia.org/PM
[2]: Persistent Memory Development Kit (pmem.io) https://pmem.io/pmdk/
https://pmem.io/pmdk/
[3]: Non-volatile Memory Logging (PGCon 2016) https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
[4]: Introducing PMDK into PostgreSQL (PGCon 2018) https://www.pgcon.org/2018/schedule/events/1154.en.html
https://www.pgcon.org/2018/schedule/events/1154.en.html
[5]: Applying PMDK to WAL operations for persistent memory (pgsql-hackers) /messages/by-id/C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp
/messages/by-id/C20D38E97BCB33DAD59E3A1@lab.ntt.co.jp

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Attachments:

0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=0001-Support-GUCs-for-external-WAL-buffer.patchDownload
From 02896517f42d60e8f436ec5d0ab1a55b0ce1a3f9 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Fri, 24 Jan 2020 13:16:26 +0900
Subject: [PATCH 1/3] Support GUCs for external WAL buffer

To implement non-volatile WAL buffer, we add two new GUCs nvwal_path
and nvwal_size.  Now postgres maps a file at that path onto memory to
use it as WAL buffer.  Note that the buffer is still volatile for now.
---
 configure                                     |  99 +++++++++++
 configure.in                                  |  19 ++
 src/backend/access/transam/Makefile           |   2 +-
 src/backend/access/transam/nv_xlog_buffer.c   |  95 ++++++++++
 src/backend/access/transam/xlog.c             | 164 ++++++++++++++++--
 src/backend/utils/misc/guc.c                  |  23 ++-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/bin/initdb/initdb.c                       |  95 +++++++++-
 src/include/access/nv_xlog_buffer.h           |  71 ++++++++
 src/include/access/xlog.h                     |   2 +
 src/include/pg_config.h.in                    |   6 +
 src/include/utils/guc.h                       |   4 +
 12 files changed, 560 insertions(+), 22 deletions(-)
 create mode 100644 src/backend/access/transam/nv_xlog_buffer.c
 create mode 100644 src/include/access/nv_xlog_buffer.h

diff --git a/configure b/configure
index 54c852aca5..4674419094 100755
--- a/configure
+++ b/configure
@@ -864,6 +864,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_nvwal
 with_gnu_ld
 enable_largefile
 enable_float4_byval
@@ -1570,6 +1571,7 @@ Optional Packages:
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
+  --with-nvwal            use non-volatile WAL buffer (NVWAL)
   --with-gnu-ld           assume the C compiler uses GNU ld [default=no]
 
 Some influential environment variables:
@@ -8306,6 +8308,40 @@ fi
 
 
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with non-volatile WAL buffer (NVWAL)" >&5
+$as_echo_n "checking whether to build with non-volatile WAL buffer (NVWAL)... " >&6; }
+
+
+
+# Check whether --with-nvwal was given.
+if test "${with_nvwal+set}" = set; then :
+  withval=$with_nvwal;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_NVWAL 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-nvwal option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_nvwal=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_nvwal" >&5
+$as_echo "$with_nvwal" >&6; }
+
 #
 # Elf
 #
@@ -12694,6 +12730,57 @@ fi
 fi
 
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'libpmem' is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+fi
+
 
 ##
 ## Header files
@@ -13373,6 +13460,18 @@ fi
 
 done
 
+fi
+
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+
 fi
 
 if test "$PORTNAME" = "win32" ; then
diff --git a/configure.in b/configure.in
index 6942f81d1e..d2062d020a 100644
--- a/configure.in
+++ b/configure.in
@@ -964,6 +964,14 @@ PGAC_ARG_BOOL(with, zlib, yes,
               [do not use Zlib])
 AC_SUBST(with_zlib)
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+AC_MSG_CHECKING([whether to build with non-volatile WAL buffer (NVWAL)])
+PGAC_ARG_BOOL(with, nvwal, no, [use non-volatile WAL buffer (NVWAL)],
+              [AC_DEFINE([USE_NVWAL], 1, [Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal)])])
+AC_MSG_RESULT([$with_nvwal])
+
 #
 # Elf
 #
@@ -1287,6 +1295,12 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [],
+               [AC_MSG_ERROR([library 'libpmem' is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 
 ##
 ## Header files
@@ -1467,6 +1481,11 @@ elif test "$with_uuid" = ossp ; then
       [AC_MSG_ERROR([header file <ossp/uuid.h> or <uuid.h> is required for OSSP UUID])])])
 fi
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 if test "$PORTNAME" = "win32" ; then
    AC_CHECK_HEADERS(crtdefs.h)
 fi
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 16fbe47269..addeae9477 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -15,7 +15,7 @@ include $(top_builddir)/src/Makefile.global
 OBJS = clog.o commit_ts.o generic_xlog.o multixact.o parallel.o rmgr.o slru.o \
 	subtrans.o timeline.o transam.o twophase.o twophase_rmgr.o varsup.o \
 	xact.o xlog.o xlogarchive.o xlogfuncs.o \
-	xloginsert.o xlogreader.o xlogutils.o
+	xloginsert.o xlogreader.o xlogutils.o nv_xlog_buffer.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/nv_xlog_buffer.c b/src/backend/access/transam/nv_xlog_buffer.c
new file mode 100644
index 0000000000..cfc6a6376b
--- /dev/null
+++ b/src/backend/access/transam/nv_xlog_buffer.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * nv_xlog_buffer.c
+ *		PostgreSQL non-volatile WAL buffer
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/nv_xlog_buffer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#ifdef USE_NVWAL
+
+#include <libpmem.h>
+#include "access/nv_xlog_buffer.h"
+
+#include "miscadmin.h" /* IsBootstrapProcessingMode */
+#include "common/file_perm.h" /* pg_file_create_mode */
+
+/*
+ * Maps non-volatile WAL buffer on shared memory.
+ *
+ * Returns a mapped address if success; PANICs and never return otherwise.
+ */
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+	void	   *addr;
+	size_t		map_len = 0;
+	int			is_pmem = 0;
+
+	Assert(fname != NULL);
+	Assert(fsize > 0);
+
+	if (IsBootstrapProcessingMode())
+	{
+		/*
+		 * Create and map a new file if we are in bootstrap mode (typically
+		 * executed by initdb).
+		 */
+		addr = pmem_map_file(fname, fsize, PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+							 pg_file_create_mode, &map_len, &is_pmem);
+	}
+	else
+	{
+		/*
+		 * Map an existing file.  The second argument (len) should be zero,
+		 * the third argument (flags) should have neither PMEM_FILE_CREATE nor
+		 * PMEM_FILE_EXCL, and the fourth argument (mode) will be ignored.
+		 */
+		addr = pmem_map_file(fname, 0, 0, 0, &map_len, &is_pmem);
+	}
+
+	if (addr == NULL)
+		elog(PANIC, "could not map non-volatile WAL buffer '%s': %m", fname);
+
+	if (map_len != fsize)
+		elog(PANIC, "size of non-volatile WAL buffer '%s' is invalid; "
+					"expected %zu; actual %zu",
+			 fname, fsize, map_len);
+
+	if (!is_pmem)
+		elog(PANIC, "non-volatile WAL buffer '%s' is not on persistent memory",
+			 fname);
+
+	/*
+	 * Assert page boundary alignment (8KiB as default).  It should pass because
+	 * PMDK considers hugepage boundary alignment (2MiB or 1GiB on x64).
+	 */
+	Assert((uint64) addr % XLOG_BLCKSZ == 0);
+
+	elog(LOG, "non-volatile WAL buffer '%s' is mapped on [%p-%p)",
+		 fname, addr, (char *) addr + map_len);
+	return addr;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+	Assert(addr != NULL);
+
+	if (pmem_unmap(addr, fsize) < 0)
+	{
+		elog(WARNING, "could not unmap non-volatile WAL buffer: %m");
+		return;
+	}
+
+	elog(LOG, "non-volatile WAL buffer unmapped");
+}
+
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 77ad765989..eae0c01e3c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "access/nv_xlog_buffer.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
@@ -848,6 +849,12 @@ static bool InRedo = false;
 /* Have we launched bgwriter during recovery? */
 static bool bgwriterLaunched = false;
 
+/* For non-volatile WAL buffer (NVWAL) */
+char	   *NvwalPath = NULL;	/* a GUC parameter */
+int			NvwalSizeMB = 1024;	/* a direct GUC parameter */
+static Size	NvwalSize = 0;		/* an indirect GUC parameter */
+static bool	NvwalAvail = false;
+
 /* For WALInsertLockAcquire/Release functions */
 static int	MyLockNo = 0;
 static bool holdingAllLocks = false;
@@ -4906,6 +4913,76 @@ check_wal_buffers(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * GUC check_hook for nvwal_path.
+ */
+bool
+check_nvwal_path(char **newval, void **extra, GucSource source)
+{
+#ifndef USE_NVWAL
+	Assert(!NvwalAvail);
+
+	if (**newval != '\0')
+	{
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("nvwal_path is invalid parameter without NVWAL");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_nvwal_path(const char *newval, void *extra)
+{
+	/* true if not empty; false if empty */
+	NvwalAvail = (bool) (*newval != '\0');
+}
+
+/*
+ * GUC check_hook for nvwal_size.
+ *
+ * It checks the boundary only and DOES NOT check if the size is multiple
+ * of wal_segment_size because the segment size (probably stored in the
+ * control file) have not been set properly here yet.
+ *
+ * See XLOGShmemSize for more validation.
+ */
+bool
+check_nvwal_size(int *newval, void **extra, GucSource source)
+{
+#ifdef USE_NVWAL
+	Size		buf_size;
+	int64		npages;
+
+	Assert(*newval > 0);
+
+	buf_size = (Size) (*newval) * 1024 * 1024;
+	npages = (int64) buf_size / XLOG_BLCKSZ;
+	Assert(npages > 0);
+
+	if (npages > INT_MAX)
+	{
+		/* XLOG_BLCKSZ could be so small that npages exceeds INT_MAX */
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("invalid value for nvwal_size (%dMB): "
+						 "the number of WAL pages too large; "
+						 "buf_size %zu; XLOG_BLCKSZ %d",
+						 *newval, buf_size, (int) XLOG_BLCKSZ);
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_nvwal_size(int newval, void *extra)
+{
+	NvwalSize = (Size) newval * 1024 * 1024;
+}
+
 /*
  * Read the control file, set respective GUCs.
  *
@@ -4934,13 +5011,49 @@ XLOGShmemSize(void)
 {
 	Size		size;
 
+	/*
+	 * If we use non-volatile WAL buffer, we don't use the given wal_buffers.
+	 * Instead, we set it the value based on the size of the file for the
+	 * buffer. This should be done here because of xlblocks array calculation.
+	 */
+	if (NvwalAvail)
+	{
+		char		buf[32];
+		int64		npages;
+
+		Assert(NvwalSizeMB > 0);
+		Assert(NvwalSize > 0);
+		Assert(wal_segment_size > 0);
+		Assert(wal_segment_size % XLOG_BLCKSZ == 0);
+
+		/*
+		 * At last, we can check if the size of non-volatile WAL buffer
+		 * (nvwal_size) is multiple of WAL segment size.
+		 *
+		 * Note that NvwalSize has already been calculated in assign_nvwal_size.
+		 */
+		if (NvwalSize % wal_segment_size != 0)
+		{
+			elog(PANIC,
+				 "invalid value for nvwal_size (%dMB): "
+				 "it should be multiple of WAL segment size; "
+				 "NvwalSize %zu; wal_segment_size %d",
+				 NvwalSizeMB, NvwalSize, wal_segment_size);
+		}
+
+		npages = (int64) NvwalSize / XLOG_BLCKSZ;
+		Assert(npages > 0 && npages <= INT_MAX);
+
+		snprintf(buf, sizeof(buf), "%d", (int) npages);
+		SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
+	}
 	/*
 	 * If the value of wal_buffers is -1, use the preferred auto-tune value.
 	 * This isn't an amazingly clean place to do this, but we must wait till
 	 * NBuffers has received its final value, and must do it before using the
 	 * value of XLOGbuffers to do anything important.
 	 */
-	if (XLOGbuffers == -1)
+	else if (XLOGbuffers == -1)
 	{
 		char		buf[32];
 
@@ -4956,10 +5069,13 @@ XLOGShmemSize(void)
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
-	/* extra alignment padding for XLOG I/O buffers */
-	size = add_size(size, XLOG_BLCKSZ);
-	/* and the buffers themselves */
-	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+	if (!NvwalAvail)
+	{
+		/* extra alignment padding for XLOG I/O buffers */
+		size = add_size(size, XLOG_BLCKSZ);
+		/* and the buffers themselves */
+		size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+	}
 
 	/*
 	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5056,13 +5172,32 @@ XLOGShmemInit(void)
 	}
 
 	/*
-	 * Align the start of the page buffers to a full xlog block size boundary.
-	 * This simplifies some calculations in XLOG insertion. It is also
-	 * required for O_DIRECT.
+	 * Open and memory-map a file for non-volatile XLOG buffer. The PMDK will
+	 * align the start of the buffer to 2-MiB boundary if the size of the
+	 * buffer is larger than or equal to 4 MiB.
 	 */
-	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
-	XLogCtl->pages = allocptr;
-	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+	if (NvwalAvail)
+	{
+		/* Logging and error-handling should be done in the function */
+		XLogCtl->pages = MapNonVolatileXLogBuffer(NvwalPath, NvwalSize);
+
+		/*
+		 * Do not memset non-volatile XLOG buffer (XLogCtl->pages) here
+		 * because it would contain records for recovery. We should do so in
+		 * checkpoint after the recovery completes successfully.
+		 */
+	}
+	else
+	{
+		/*
+		 * Align the start of the page buffers to a full xlog block size
+		 * boundary. This simplifies some calculations in XLOG insertion. It
+		 * is also required for O_DIRECT.
+		 */
+		allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+		XLogCtl->pages = allocptr;
+		memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+	}
 
 	/*
 	 * Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
@@ -8343,6 +8478,13 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+
+	/*
+	 * If we use non-volatile XLOG buffer, unmap it.
+	 */
+	if (NvwalAvail)
+		UnmapNonVolatileXLogBuffer(XLogCtl->pages, NvwalSize);
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f0ed326a1b..39d087d2d1 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2606,7 +2606,7 @@ static struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_XBLOCKS
 		},
 		&XLOGbuffers,
-		-1, -1, (INT_MAX / XLOG_BLCKSZ),
+		-1, -1, INT_MAX,
 		check_wal_buffers, NULL, NULL
 	},
 
@@ -3194,6 +3194,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, assign_tcp_user_timeout, show_tcp_user_timeout
 	},
 
+	{
+		{"nvwal_size", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Size of non-volatile WAL buffer (NVWAL)."),
+			NULL,
+			GUC_UNIT_MB
+		},
+		&NvwalSizeMB,
+		1024, 1, INT_MAX,
+		check_nvwal_size, assign_nvwal_size, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -4199,6 +4210,16 @@ static struct config_string ConfigureNamesString[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"nvwal_path", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Path to file for non-volatile WAL buffer (NVWAL)."),
+			NULL
+		},
+		&NvwalPath,
+		"",
+		check_nvwal_path, assign_nvwal_path, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index b61e66932c..f77a4a7d0e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -223,6 +223,8 @@
 #checkpoint_timeout = 5min		# range 30s-1d
 #max_wal_size = 1GB
 #min_wal_size = 80MB
+#nvwal_path = '/path/to/nvwal'
+#nvwal_size = 1GB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index acf610808e..f08da4da9b 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -144,7 +144,10 @@ static bool show_setting = false;
 static bool data_checksums = false;
 static char *xlog_dir = NULL;
 static char *str_wal_segment_size_mb = NULL;
+static char *nvwal_path = NULL;
+static char *str_nvwal_size_mb = NULL;
 static int	wal_segment_size_mb;
+static int	nvwal_size_mb;
 
 
 /* internal vars */
@@ -1115,14 +1118,78 @@ setup_config(void)
 	conflines = replace_token(conflines, "#port = 5432", repltok);
 #endif
 
-	/* set default max_wal_size and min_wal_size */
-	snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
-			 pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
-	conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
-
-	snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
-			 pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
-	conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+	if (nvwal_path != NULL)
+	{
+		int nr_segs;
+
+		if (str_nvwal_size_mb == NULL)
+			nvwal_size_mb = 1024;
+		else
+		{
+			char *endptr;
+
+			/* check that the argument is a number */
+			nvwal_size_mb = strtol(str_nvwal_size_mb, &endptr, 10);
+
+			/* verify that the size of non-volatile WAL buffer is valid */
+			if (endptr == str_nvwal_size_mb || *endptr != '\0')
+			{
+				pg_log_error("argument of --nvwal-size must be a number; "
+							 "str_nvwal_size_mb '%s'",
+							 str_nvwal_size_mb);
+				exit(1);
+			}
+			if (nvwal_size_mb <= 0)
+			{
+				pg_log_error("argument of --nvwal-size must be a positive number; "
+							 "str_nvwal_size_mb '%s'; nvwal_size_mb %d",
+							 str_nvwal_size_mb, nvwal_size_mb);
+				exit(1);
+			}
+			if (nvwal_size_mb % wal_segment_size_mb != 0)
+			{
+				pg_log_error("argument of --nvwal-size must be multiple of WAL segment size; "
+							 "str_nvwal_size_mb '%s'; nvwal_size_mb %d; wal_segment_size_mb %d",
+							 str_nvwal_size_mb, nvwal_size_mb, wal_segment_size_mb);
+				exit(1);
+			}
+		}
+
+		/*
+		 * XXX We set {min_,max_,nv}wal_size to the same value.  Note that
+		 * postgres might bootstrap and run if the three config does not have
+		 * the same value, but have not been tested yet.
+		 */
+		nr_segs = nvwal_size_mb / wal_segment_size_mb;
+
+		snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "nvwal_path = '%s'",
+				 nvwal_path);
+		conflines = replace_token(conflines,
+								  "#nvwal_path = '/path/to/nvwal'", repltok);
+
+		snprintf(repltok, sizeof(repltok), "nvwal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#nvwal_size = 1GB", repltok);
+	}
+	else
+	{
+		/* set default max_wal_size and min_wal_size */
+		snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+				 pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
+		conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+				 pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
+		conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+	}
 
 	snprintf(repltok, sizeof(repltok), "lc_messages = '%s'",
 			 escape_quotes(lc_messages));
@@ -2373,6 +2440,8 @@ usage(const char *progname)
 	printf(_("  -W, --pwprompt            prompt for a password for the new superuser\n"));
 	printf(_("  -X, --waldir=WALDIR       location for the write-ahead log directory\n"));
 	printf(_("      --wal-segsize=SIZE    size of WAL segments, in megabytes\n"));
+	printf(_("  -P, --nvwal-path=FILE     path to file for non-volatile WAL buffer (NVWAL)\n"));
+	printf(_("  -Q, --nvwal-size=SIZE     size of NVWAL, in megabytes\n"));
 	printf(_("\nLess commonly used options:\n"));
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("  -k, --data-checksums      use data page checksums\n"));
@@ -3051,6 +3120,8 @@ main(int argc, char *argv[])
 		{"sync-only", no_argument, NULL, 'S'},
 		{"waldir", required_argument, NULL, 'X'},
 		{"wal-segsize", required_argument, NULL, 12},
+		{"nvwal-path", required_argument, NULL, 'P'},
+		{"nvwal-size", required_argument, NULL, 'Q'},
 		{"data-checksums", no_argument, NULL, 'k'},
 		{"allow-group-access", no_argument, NULL, 'g'},
 		{NULL, 0, NULL, 0}
@@ -3094,7 +3165,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:g", long_options, &option_index)) != -1)
+	while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:P:Q:g", long_options, &option_index)) != -1)
 	{
 		switch (c)
 		{
@@ -3188,6 +3259,12 @@ main(int argc, char *argv[])
 			case 12:
 				str_wal_segment_size_mb = pg_strdup(optarg);
 				break;
+			case 'P':
+				nvwal_path = pg_strdup(optarg);
+				break;
+			case 'Q':
+				str_nvwal_size_mb = pg_strdup(optarg);
+				break;
 			case 'g':
 				SetDataDirectoryCreatePerm(PG_DIR_MODE_GROUP);
 				break;
diff --git a/src/include/access/nv_xlog_buffer.h b/src/include/access/nv_xlog_buffer.h
new file mode 100644
index 0000000000..b58878c92b
--- /dev/null
+++ b/src/include/access/nv_xlog_buffer.h
@@ -0,0 +1,71 @@
+/*
+ * nv_xlog_buffer.h
+ *
+ * PostgreSQL non-volatile WAL buffer
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nv_xlog_buffer.h
+ */
+#ifndef NV_XLOG_BUFFER_H
+#define NV_XLOG_BUFFER_H
+
+extern void *MapNonVolatileXLogBuffer(const char *fname, Size fsize);
+extern void	UnmapNonVolatileXLogBuffer(void *addr, Size fsize);
+
+#ifdef USE_NVWAL
+#include <libpmem.h>
+
+#define nv_memset_persist	pmem_memset_persist
+#define nv_memcpy_nodrain	pmem_memcpy_nodrain
+#define nv_flush			pmem_flush
+#define nv_drain			pmem_drain
+#define nv_persist			pmem_persist
+
+#else
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+	return NULL;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+	return;
+}
+
+static inline void *
+nv_memset_persist(void *pmemdest, int c, size_t len)
+{
+	return NULL;
+}
+
+static inline void *
+nv_memcpy_nodrain(void *pmemdest, const void *src,
+				  size_t len)
+{
+	return NULL;
+}
+
+static inline void
+nv_flush(void *pmemdest, size_t len)
+{
+	return;
+}
+
+static inline void
+nv_drain(void)
+{
+	return;
+}
+
+static inline void
+nv_persist(const void *addr, size_t len)
+{
+	return;
+}
+
+#endif							/* USE_NVWAL */
+#endif							/* NV_XLOG_BUFFER_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index d519252aad..bc09fa104c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -129,6 +129,8 @@ extern int	recoveryTargetAction;
 extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
+extern char *NvwalPath;
+extern int  NvwalSizeMB;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 512213aa32..bd2b434d93 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -356,6 +356,9 @@
 /* Define to 1 if you have the `pam' library (-lpam). */
 #undef HAVE_LIBPAM
 
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
 /* Define if you have a function readline library */
 #undef HAVE_LIBREADLINE
 
@@ -932,6 +935,9 @@
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
+/* Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal) */
+#undef USE_NVWAL
+
 /* Define to build with OpenSSL support. (--with-openssl) */
 #undef USE_OPENSSL
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index a93ed77c9c..3bd4bbb872 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -432,6 +432,10 @@ extern void assign_search_path(const char *newval, void *extra);
 
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_nvwal_path(char **newval, void **extra, GucSource source);
+extern void assign_nvwal_path(const char *newval, void *extra);
+extern bool check_nvwal_size(int *newval, void **extra, GucSource source);
+extern void assign_nvwal_size(int newval, void *extra);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
 #endif							/* GUC_H */
-- 
2.20.1

0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=0002-Non-volatile-WAL-buffer.patchDownload
From 6d75e271b7475cc853b13ef54d13ba1c0b2fab1d Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Fri, 24 Jan 2020 13:16:27 +0900
Subject: [PATCH 2/3] Non-volatile WAL buffer

Now external WAL buffer becomes non-volatile.

Bumps PG_CONTROL_VERSION.
---
 src/backend/access/transam/xlog.c       | 975 +++++++++++++++++++++---
 src/backend/replication/walsender.c     |  50 ++
 src/bin/pg_controldata/pg_controldata.c |   3 +
 src/include/access/xlog.h               |   6 +
 src/include/catalog/pg_control.h        |  17 +-
 5 files changed, 948 insertions(+), 103 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eae0c01e3c..ba89d3c158 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -643,6 +643,13 @@ typedef struct XLogCtlData
 	TimeLineID	ThisTimeLineID;
 	TimeLineID	PrevTimeLineID;
 
+	/*
+	 * Used for non-volatile WAL buffer (NVWAL).
+	 *
+	 * All the records up to this LSN are persistent in NVWAL.
+	 */
+	XLogRecPtr	persistentUpTo;
+
 	/*
 	 * SharedRecoveryInProgress indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
@@ -766,11 +773,12 @@ typedef enum
 	XLOG_FROM_ANY = 0,			/* request to read WAL from any source */
 	XLOG_FROM_ARCHIVE,			/* restored using restore_command */
 	XLOG_FROM_PG_WAL,			/* existing file in pg_wal */
+	XLOG_FROM_NVWAL,			/* non-volatile WAL buffer */
 	XLOG_FROM_STREAM			/* streamed from master */
 } XLogSource;
 
 /* human-readable names for XLogSources, for debugging output */
-static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
+static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "nvwal", "stream"};
 
 /*
  * openLogFile is -1 or a kernel FD for an open log file segment.
@@ -898,6 +906,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
+static void PreallocNonVolatileXlogBuffer(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
@@ -1177,6 +1186,43 @@ XLogInsertRecord(XLogRecData *rdata,
 		}
 	}
 
+	/*
+	 * Request a checkpoint here if non-volatile WAL buffer is used and we
+	 * have consumed too much WAL since the last checkpoint.
+	 *
+	 * We first screen under the condition (1) OR (2) below:
+	 *
+	 * (1) The record was the first one in a certain segment.
+	 * (2) The record was inserted across segments.
+	 *
+	 * We then check the segment number which the record was inserted into.
+	 */
+	if (NvwalAvail && inserted &&
+		(StartPos % wal_segment_size == SizeOfXLogLongPHD ||
+		 StartPos / wal_segment_size < EndPos / wal_segment_size))
+	{
+		XLogSegNo	end_segno;
+
+		XLByteToSeg(EndPos, end_segno, wal_segment_size);
+
+		/*
+		 * NOTE: We do not signal walsender here because the inserted record
+		 * have not drained by NVWAL buffer yet.
+		 *
+		 * NOTE: We do not signal walarchiver here because the inserted record
+		 * have not flushed to a segment file.  So we don't need to update
+		 * XLogCtl->lastSegSwitch{Time,LSN}, used only by CheckArchiveTimeout.
+		 */
+
+		/* Two-step checking for speed (see also XLogWrite) */
+		if (IsUnderPostmaster && XLogCheckpointNeeded(end_segno))
+		{
+			(void) GetRedoRecPtr();
+			if (XLogCheckpointNeeded(end_segno))
+				RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+		}
+	}
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
 	{
@@ -2100,6 +2146,15 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
 	int			npages = 0;
+	bool		is_firstpage;
+
+	if (NvwalAvail)
+		elog(DEBUG1, "XLogCtl->InitializedUpTo %X/%X; upto %X/%X; opportunistic %s",
+			 (uint32) (XLogCtl->InitializedUpTo >> 32),
+			 (uint32) XLogCtl->InitializedUpTo,
+			 (uint32) (upto >> 32),
+			 (uint32) upto,
+			 opportunistic ? "true" : "false");
 
 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
 
@@ -2161,7 +2216,25 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 				{
 					/* Have to write it ourselves */
 					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
-					WriteRqst.Write = OldPageRqstPtr;
+
+					if (NvwalAvail)
+					{
+						/*
+						 * If we use non-volatile WAL buffer, it is a special
+						 * but expected case to write the buffer pages out to
+						 * segment files, and for simplicity, it is done in
+						 * segment by segment.
+						 */
+						XLogRecPtr		OldSegEndPtr;
+
+						OldSegEndPtr = OldPageRqstPtr - XLOG_BLCKSZ + wal_segment_size;
+						Assert(OldSegEndPtr % wal_segment_size == 0);
+
+						WriteRqst.Write = OldSegEndPtr;
+					}
+					else
+						WriteRqst.Write = OldPageRqstPtr;
+
 					WriteRqst.Flush = 0;
 					XLogWrite(WriteRqst, false);
 					LWLockRelease(WALWriteLock);
@@ -2188,7 +2261,20 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		 * Be sure to re-zero the buffer so that bytes beyond what we've
 		 * written will look like zeroes and not valid XLOG records...
 		 */
-		MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
+		if (NvwalAvail)
+		{
+			/*
+			 * We do not take the way that combines MemSet() and pmem_persist()
+			 * because pmem_persist() may use slow and strong-ordered cache
+			 * flush instruction if weak-ordered fast one is not supported.
+			 * Instead, we first fill the buffer with zero by
+			 * pmem_memset_persist() that can leverage non-temporal fast store
+			 * instructions, then make the header persistent later.
+			 */
+			nv_memset_persist(NewPage, 0, XLOG_BLCKSZ);
+		}
+		else
+			MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
 
 		/*
 		 * Fill the new page's header
@@ -2220,7 +2306,8 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		/*
 		 * If first page of an XLOG segment file, make it a long header.
 		 */
-		if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+		is_firstpage = ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0);
+		if (is_firstpage)
 		{
 			XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
 
@@ -2235,7 +2322,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
 		 * holding a lock.
 		 */
-		pg_write_barrier();
+		if (NvwalAvail)
+		{
+			/* Make the header persistent on PMEM */
+			nv_persist(NewPage, is_firstpage ? SizeOfXLogLongPHD : SizeOfXLogShortPHD);
+		}
+		else
+			pg_write_barrier();
 
 		*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
 
@@ -2245,6 +2338,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	}
 	LWLockRelease(WALBufMappingLock);
 
+	if (NvwalAvail)
+		elog(DEBUG1, "ControlFile->discardedUpTo %X/%X; XLogCtl->InitializedUpTo %X/%X",
+			 (uint32) (ControlFile->discardedUpTo >> 32),
+			 (uint32) ControlFile->discardedUpTo,
+			 (uint32) (XLogCtl->InitializedUpTo >> 32),
+			 (uint32) XLogCtl->InitializedUpTo);
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG && npages > 0)
 	{
@@ -2616,6 +2716,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		LogwrtResult.Flush = LogwrtResult.Write;
 	}
 
+	/*
+	 * Update discardedUpTo if NVWAL is used.  A new value should not fall
+	 * behind the old one.
+	 */
+	if (NvwalAvail)
+	{
+		Assert(LogwrtResult.Write == LogwrtResult.Flush);
+
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		if (ControlFile->discardedUpTo < LogwrtResult.Write)
+		{
+			ControlFile->discardedUpTo = LogwrtResult.Write;
+			UpdateControlFile();
+		}
+		LWLockRelease(ControlFileLock);
+	}
+
 	/*
 	 * Update shared-memory status
 	 *
@@ -2820,6 +2937,123 @@ XLogFlush(XLogRecPtr record)
 		return;
 	}
 
+	if (NvwalAvail)
+	{
+		XLogRecPtr	FromPos;
+
+		/*
+		 * No page on the NVWAL is to be flushed to segment files.  Instead,
+		 * we wait all the insertions preceding this one complete.  We will
+		 * wait for all the records to be persistent on the NVWAL below.
+		 */
+		record = WaitXLogInsertionsToFinish(record);
+
+		/*
+		 * Check if another backend already have done what I am doing.
+		 *
+		 * We can compare something <= XLogCtl->persistentUpTo without
+		 * holding XLogCtl->info_lck spinlock because persistentUpTo is
+		 * monotonically increasing and can be loaded atomically on each
+		 * NVWAL-supported platform (now x64 only).
+		 */
+		FromPos = *((volatile XLogRecPtr *) &XLogCtl->persistentUpTo);
+		if (record <= FromPos)
+			return;
+
+		/*
+		 * In a very rare case, we rounded whole the NVWAL.  We do not need
+		 * to care old pages here because they already have been evicted to
+		 * segment files at record insertion.
+		 *
+		 * In such a case, we flush whole the NVWAL.  We also log it as
+		 * warning because it can be time-consuming operation.
+		 *
+		 * TODO Advance XLogCtl->persistentUpTo at the end of XLogWrite, and
+		 * we can remove the following first if-block.
+		 */
+		if (record - FromPos > NvwalSize)
+		{
+			elog(WARNING, "flush whole the NVWAL; FromPos %X/%X; record %X/%X",
+				 (uint32) (FromPos >> 32), (uint32) FromPos,
+				 (uint32) (record >> 32), (uint32) record);
+
+			nv_flush(XLogCtl->pages, NvwalSize);
+		}
+		else
+		{
+			char   *frompos;
+			char   *uptopos;
+			size_t	fromoff;
+			size_t	uptooff;
+
+			/*
+			 * Flush each record that is probably not flushed yet.
+			 *
+			 * We have two reasons why we say "probably".  The first is because
+			 * such a record copied with non-temporal store instruction has
+			 * already "flushed" but we cannot distinguish it.  nv_flush is
+			 * harmless for it in consistency.
+			 *
+			 * The second reason is that the target record might have already
+			 * been evicted to a segment file until now.  Also in this case,
+			 * nv_flush is harmless in consistency.
+			 */
+			uptooff = record % NvwalSize;
+			uptopos = XLogCtl->pages + uptooff;
+			fromoff = FromPos % NvwalSize;
+			frompos = XLogCtl->pages + fromoff;
+
+			/* Handles rotation */
+			if (uptopos <= frompos)
+			{
+				nv_flush(frompos, NvwalSize - fromoff);
+				fromoff = 0;
+				frompos = XLogCtl->pages;
+			}
+
+			nv_flush(frompos, uptooff - fromoff);
+		}
+
+		/*
+		 * To guarantee durability ("D" of ACID), we should satisfy the
+		 * following two for each transaction X:
+		 *
+		 *  (1) All the WAL records inserted by X, including the commit record
+		 *      of X, should persist on NVWAL before the server commits X.
+		 *
+		 *  (2) All the WAL records inserted by any other transactions than
+		 *      X, that have less LSN than the commit record just inserted
+		 *      by X, should persist on NVWAL before the server commits X.
+		 *
+		 * The (1) can be satisfied by a store barrier after the commit record
+		 * of X is flushed because each WAL record on X is already flushed in
+		 * the end of its insertion.  The (2) can be satisfied by waiting for
+		 * any record insertions that have less LSN than the commit record just
+		 * inserted by X, and by a store barrier as well.
+		 *
+		 * Now is the time.  Have a store barrier.
+		 */
+		nv_drain();
+
+		/*
+		 * Remember where the last persistent record is.  A new value should
+		 * not fall behind the old one.
+		 */
+		SpinLockAcquire(&XLogCtl->info_lck);
+		if (XLogCtl->persistentUpTo < record)
+			XLogCtl->persistentUpTo = record;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		/*
+		 * The records up to the returned "record" have been persisntent on
+		 * NVWAL.  Now signal walsenders.
+		 */
+		WalSndWakeupRequest();
+		WalSndWakeupProcessRequests();
+
+		return;
+	}
+
 	/* Quick exit if already known flushed */
 	if (record <= LogwrtResult.Flush)
 		return;
@@ -3003,6 +3237,13 @@ XLogBackgroundFlush(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/*
+	 * Quick exit if NVWAL buffer is used and archiving is not active. In this
+	 * case, we need no WAL segment file in pg_wal directory.
+	 */
+	if (NvwalAvail && !XLogArchivingActive())
+		return false;
+
 	/* read LogwrtResult and update local state */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
@@ -3021,6 +3262,18 @@ XLogBackgroundFlush(void)
 		flexible = false;		/* ensure it all gets written */
 	}
 
+	/*
+	 * If NVWAL is used, back off to the last compeleted segment boundary
+	 * for writing the buffer page to files in segment by segment.  We do so
+	 * nowhere but here after XLogCtl->asyncXactLSN is loaded because it
+	 * should be considered.
+	 */
+	if (NvwalAvail)
+	{
+		WriteRqst.Write -= WriteRqst.Write % wal_segment_size;
+		flexible = false;		/* ensure it all gets written */
+	}
+
 	/*
 	 * If already known flushed, we're done. Just need to check if we are
 	 * holding an open file handle to a logfile that's no longer in use,
@@ -3047,7 +3300,12 @@ XLogBackgroundFlush(void)
 	flushbytes =
 		WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
 
-	if (WalWriterFlushAfter == 0 || lastflush == 0)
+	if (NvwalAvail)
+	{
+		WriteRqst.Flush = WriteRqst.Write;
+		lastflush = now;
+	}
+	else if (WalWriterFlushAfter == 0 || lastflush == 0)
 	{
 		/* first call, or block based limits disabled */
 		WriteRqst.Flush = WriteRqst.Write;
@@ -3106,7 +3364,28 @@ XLogBackgroundFlush(void)
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
 	 */
-	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
+	if (NvwalAvail && max_wal_senders == 0)
+	{
+		XLogRecPtr		upto;
+
+		/*
+		 * If NVWAL is used and there is no walsender, nobody is to load
+		 * segments on the buffer.  So let's recycle segments up to {where we
+		 * have requested to write and flush} + NvwalSize.
+		 *
+		 * Note that if NVWAL is used and a walsender seems running, we have to
+		 * do nothing; keep the written pages on the buffer for walsenders to be
+		 * loaded from the buffer, not from the segment files.  Note that the
+		 * buffer pages are eventually to be recycled by checkpoint.
+		 */
+		Assert(WriteRqst.Write == WriteRqst.Flush);
+		Assert(WriteRqst.Write % wal_segment_size == 0);
+
+		upto = WriteRqst.Write + NvwalSize;
+		AdvanceXLInsertBuffer(upto - XLOG_BLCKSZ, false);
+	}
+	else
+		AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
 
 	/*
 	 * If we determined that we need to write data, but somebody else
@@ -3806,6 +4085,43 @@ XLogFileClose(void)
 	openLogFile = -1;
 }
 
+/*
+ * Preallocate non-volatile XLOG buffers.
+ *
+ * This zeroes buffers and prepare page headers up to
+ * ControlFile->discardedUpTo + S, where S is the total size of
+ * the non-volatile XLOG buffers.
+ *
+ * It is caller's responsibility to update ControlFile->discardedUpTo
+ * and to set XLogCtl->InitializedUpTo properly.
+ */
+static void
+PreallocNonVolatileXlogBuffer(void)
+{
+	XLogRecPtr	newupto,
+				InitializedUpTo;
+
+	Assert(NvwalAvail);
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	newupto = ControlFile->discardedUpTo;
+	LWLockRelease(ControlFileLock);
+
+	InitializedUpTo = XLogCtl->InitializedUpTo;
+
+	newupto += NvwalSize;
+	Assert(newupto % wal_segment_size == 0);
+
+	if (newupto <= InitializedUpTo)
+		return;
+
+	/*
+	 * Subtracting XLOG_BLCKSZ is important, because AdvanceXLInsertBuffer
+	 * handles the first argument as the beginning of pages, not the end.
+	 */
+	AdvanceXLInsertBuffer(newupto - XLOG_BLCKSZ, false);
+}
+
 /*
  * Preallocate log files beyond the specified log endpoint.
  *
@@ -4101,8 +4417,11 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 	 * Before deleting the file, see if it can be recycled as a future log
 	 * segment. Only recycle normal files, pg_standby for example can create
 	 * symbolic links pointing to a separate archive directory.
+	 *
+	 * If NVWAL buffer is used, a log segment file is never to be recycled
+	 * (that is, always go into else block).
 	 */
-	if (wal_recycle &&
+	if (!NvwalAvail && wal_recycle &&
 		endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
@@ -5336,36 +5655,53 @@ BootStrapXLOG(void)
 	record->xl_crc = crc;
 
 	/* Create first XLOG segment file */
-	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
-
-	/* Write the first page with the initial record */
-	errno = 0;
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+	if (NvwalAvail)
 	{
-		/* if write didn't set errno, assume problem is no disk space */
-		if (errno == 0)
-			errno = ENOSPC;
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not write bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		nv_memcpy_nodrain(XLogCtl->pages + wal_segment_size, page, XLOG_BLCKSZ);
+		pgstat_report_wait_end();
+
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+		nv_drain();
+		pgstat_report_wait_end();
+
+		/*
+		 * Other WAL stuffs will be initialized in startup process.
+		 */
 	}
-	pgstat_report_wait_end();
+	else
+	{
+		use_existent = false;
+		openLogFile = XLogFileInit(1, &use_existent, false);
+
+		/* Write the first page with the initial record */
+		errno = 0;
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write bootstrap write-ahead log file: %m")));
+		}
+		pgstat_report_wait_end();
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
-	pgstat_report_wait_end();
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+		if (pg_fsync(openLogFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_end();
 
-	if (close(openLogFile))
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not close bootstrap write-ahead log file: %m")));
+		if (close(openLogFile))
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not close bootstrap write-ahead log file: %m")));
 
-	openLogFile = -1;
+		openLogFile = -1;
+	}
 
 	/* Now create pg_control */
 
@@ -5378,6 +5714,7 @@ BootStrapXLOG(void)
 	ControlFile->checkPoint = checkPoint.redo;
 	ControlFile->checkPointCopy = checkPoint;
 	ControlFile->unloggedLSN = FirstNormalUnloggedLSN;
+	ControlFile->discardedUpTo = (NvwalAvail) ? wal_segment_size : InvalidXLogRecPtr;
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
@@ -5638,35 +5975,41 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * happens in the middle of a segment, copy data from the last WAL segment
 	 * of the old timeline up to the switch point, to the starting WAL segment
 	 * on the new timeline.
+	 *
+	 * If non-volatile WAL buffer is used, no new segment file is created. Data
+	 * up to the switch point will be copied into NVWAL buffer by StartupXLOG().
 	 */
-	if (endLogSegNo == startLogSegNo)
+	if (!NvwalAvail)
 	{
-		/*
-		 * Make a copy of the file on the new timeline.
-		 *
-		 * Writing WAL isn't allowed yet, so there are no locking
-		 * considerations. But we should be just as tense as XLogFileInit to
-		 * avoid emplacing a bogus file.
-		 */
-		XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
-					 XLogSegmentOffset(endOfLog, wal_segment_size));
-	}
-	else
-	{
-		/*
-		 * The switch happened at a segment boundary, so just create the next
-		 * segment on the new timeline.
-		 */
-		bool		use_existent = true;
-		int			fd;
+		if (endLogSegNo == startLogSegNo)
+		{
+			/*
+			 * Make a copy of the file on the new timeline.
+			 *
+			 * Writing WAL isn't allowed yet, so there are no locking
+			 * considerations. But we should be just as tense as XLogFileInit to
+			 * avoid emplacing a bogus file.
+			 */
+			XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
+						 XLogSegmentOffset(endOfLog, wal_segment_size));
+		}
+		else
+		{
+			/*
+			 * The switch happened at a segment boundary, so just create the next
+			 * segment on the new timeline.
+			 */
+			bool		use_existent = true;
+			int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+			fd = XLogFileInit(startLogSegNo, &use_existent, true);
 
-		if (close(fd))
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not close file \"%s\": %m",
-							XLogFileNameP(ThisTimeLineID, startLogSegNo))));
+			if (close(fd))
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not close file \"%s\": %m",
+									XLogFileNameP(ThisTimeLineID, startLogSegNo))));
+		}
 	}
 
 	/*
@@ -6888,6 +7231,11 @@ StartupXLOG(void)
 		InRecovery = true;
 	}
 
+	/* Dump discardedUpTo just before REDO */
+	elog(LOG, "ControlFile->discardedUpTo %X/%X",
+		 (uint32) (ControlFile->discardedUpTo >> 32),
+		 (uint32) ControlFile->discardedUpTo);
+
 	/* REDO */
 	if (InRecovery)
 	{
@@ -7635,10 +7983,88 @@ StartupXLOG(void)
 	Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
 	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
 
+	if (NvwalAvail)
+	{
+		XLogRecPtr	discardedUpTo;
+
+		discardedUpTo = ControlFile->discardedUpTo;
+		Assert(discardedUpTo == InvalidXLogRecPtr ||
+			   discardedUpTo % wal_segment_size == 0);
+
+		if (discardedUpTo == InvalidXLogRecPtr)
+		{
+			elog(DEBUG1, "brand-new NVWAL");
+
+			/* The following "Tricky point" is to initialize the buffer */
+		}
+		else if (EndOfLog <= discardedUpTo)
+		{
+			elog(DEBUG1, "no record on NVWAL has been UNDONE");
+
+			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+			ControlFile->discardedUpTo = InvalidXLogRecPtr;
+			UpdateControlFile();
+			LWLockRelease(ControlFileLock);
+
+			nv_memset_persist(XLogCtl->pages, 0, NvwalSize);
+
+			/* The following "Tricky point" is to initialize the buffer */
+		}
+		else
+		{
+			int			last_idx;
+			int			idx;
+			XLogRecPtr	ptr;
+
+			elog(DEBUG1, "some records on NVWAL have been UNDONE; keep them");
+
+			/*
+			 * Initialize xlblock array because we decided to keep UNDONE
+			 * records on NVWAL buffer; or each page on the buffer that meets
+			 * xlblocks == 0 (initialized as so by XLOGShmemInit) is to be
+			 * accidentally cleared by the following AdvanceXLInsertBuffer!
+			 *
+			 * Two cases can be considered:
+			 *
+			 * 1) EndOfLog is on a page boundary (divisible by XLOG_BLCKSZ):
+			 *    Initialize up to (and including) the page containing the last
+			 *    record.  That page should end with EndOfLog.  The one more
+			 *    next page "N" beginning with EndOfLog is to be untouched
+			 *    because, in such a very corner case that all the NVWAL
+			 *    buffer pages are already filled, page N is on the same
+			 *    location as the first page "F" beginning with discardedUpTo.
+			 *    Of cource we should not overwrite the page F.
+			 *
+			 *    In this case, we first get XLogRecPtrToBufIdx(EndOfLog) as
+			 *    last_idx, indicating the page N.  Then, we go forward from
+			 *    the page F up to (but excluding) page N that have the same
+			 *    index as the page F.
+			 *
+			 * 2) EndOfLog is not on a page boundary:  Initialize all the pages
+			 *    but the page "L" having the last record. The page L is to be
+			 *    initialized by the following "Tricky point", including its
+			 *    content.
+			 *
+			 * In either case, XLogCtl->InitializedUpTo is to be initialized in
+			 * the following "Tricky" if-else block.
+			 */
+
+			last_idx = XLogRecPtrToBufIdx(EndOfLog);
+
+			ptr = discardedUpTo;
+			for (idx = XLogRecPtrToBufIdx(ptr); idx != last_idx;
+				 idx = NextBufIdx(idx))
+			{
+				ptr += XLOG_BLCKSZ;
+				XLogCtl->xlblocks[idx] = ptr;
+			}
+		}
+	}
+
 	/*
-	 * Tricky point here: readBuf contains the *last* block that the LastRec
-	 * record spans, not the one it starts in.  The last block is indeed the
-	 * one we want to use.
+	 * Tricky point here: readBuf contains the *last* block that the
+	 * LastRec record spans, not the one it starts in.  The last block is
+	 * indeed the one we want to use.
 	 */
 	if (EndOfLog % XLOG_BLCKSZ != 0)
 	{
@@ -7658,6 +8084,9 @@ StartupXLOG(void)
 		memcpy(page, xlogreader->readBuf, len);
 		memset(page + len, 0, XLOG_BLCKSZ - len);
 
+		if (NvwalAvail)
+			nv_persist(page, XLOG_BLCKSZ);
+
 		XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
 		XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
 	}
@@ -7671,12 +8100,54 @@ StartupXLOG(void)
 		XLogCtl->InitializedUpTo = EndOfLog;
 	}
 
-	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+	if (NvwalAvail)
+	{
+		XLogRecPtr	SegBeginPtr;
 
-	XLogCtl->LogwrtResult = LogwrtResult;
+		/*
+		 * If NVWAL buffer is used, writing records out to segment files should
+		 * be done in segment by segment. So Logwrt{Rqst,Result} (and also
+		 * discardedUpTo) should be multiple of wal_segment_size.  Let's get
+		 * them back off to the last segment boundary.
+		 */
 
-	XLogCtl->LogwrtRqst.Write = EndOfLog;
-	XLogCtl->LogwrtRqst.Flush = EndOfLog;
+		SegBeginPtr = EndOfLog - (EndOfLog % wal_segment_size);
+		LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+		XLogCtl->LogwrtResult = LogwrtResult;
+		XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+		XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+
+		/*
+		 * persistentUpTo does not need to be multiple of wal_segment_size,
+		 * and should be drained-up-to LSN. walsender will use it to load
+		 * records from NVWAL buffer.
+		 */
+		XLogCtl->persistentUpTo = EndOfLog;
+
+		/* Update discardedUpTo in pg_control if still invalid */
+		if (ControlFile->discardedUpTo == InvalidXLogRecPtr)
+		{
+			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+			ControlFile->discardedUpTo = SegBeginPtr;
+			UpdateControlFile();
+			LWLockRelease(ControlFileLock);
+		}
+
+		elog(DEBUG1, "EndOfLog: %X/%X",
+			 (uint32) (EndOfLog >> 32), (uint32) EndOfLog);
+
+		elog(DEBUG1, "SegBeginPtr: %X/%X",
+			 (uint32) (SegBeginPtr >> 32), (uint32) SegBeginPtr);
+	}
+	else
+	{
+		LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+
+		XLogCtl->LogwrtResult = LogwrtResult;
+
+		XLogCtl->LogwrtRqst.Write = EndOfLog;
+		XLogCtl->LogwrtRqst.Flush = EndOfLog;
+	}
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7807,6 +8278,7 @@ StartupXLOG(void)
 				char		origpath[MAXPGPATH];
 				char		partialfname[MAXFNAMELEN];
 				char		partialpath[MAXPGPATH];
+				XLogRecPtr	discardedUpTo;
 
 				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
 				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
@@ -7818,6 +8290,53 @@ StartupXLOG(void)
 				 */
 				XLogArchiveCleanup(partialfname);
 
+				/*
+				 * If NVWAL is also used for archival recovery, write old
+				 * records out to segment files to archive them.  Note that we
+				 * need locks related to WAL because LocalXLogInsertAllowed
+				 * already got to -1.
+				 */
+				discardedUpTo = ControlFile->discardedUpTo;
+				if (NvwalAvail && discardedUpTo != InvalidXLogRecPtr &&
+					discardedUpTo < EndOfLog)
+				{
+					XLogwrtRqst WriteRqst;
+					TimeLineID	thisTLI = ThisTimeLineID;
+					XLogRecPtr	SegBeginPtr =
+						EndOfLog - (EndOfLog % wal_segment_size);
+
+					/*
+					 * XXX Assume that all the records have the same TLI.
+					 */
+					ThisTimeLineID = EndOfLogTLI;
+
+					WriteRqst.Write = EndOfLog;
+					WriteRqst.Flush = 0;
+
+					LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+					XLogWrite(WriteRqst, false);
+
+					/*
+					 * Force back-off to the last segment boundary.
+					 */
+					LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+					ControlFile->discardedUpTo = SegBeginPtr;
+					UpdateControlFile();
+					LWLockRelease(ControlFileLock);
+
+					LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+
+					SpinLockAcquire(&XLogCtl->info_lck);
+					XLogCtl->LogwrtResult = LogwrtResult;
+					XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+					XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+					SpinLockRelease(&XLogCtl->info_lck);
+
+					LWLockRelease(WALWriteLock);
+
+					ThisTimeLineID = thisTLI;
+				}
+
 				durable_rename(origpath, partialpath, ERROR);
 				XLogArchiveNotify(partialfname);
 			}
@@ -7827,7 +8346,10 @@ StartupXLOG(void)
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
-	PreallocXlogFiles(EndOfLog);
+	if (NvwalAvail)
+		PreallocNonVolatileXlogBuffer();
+	else
+		PreallocXlogFiles(EndOfLog);
 
 	/*
 	 * Okay, we're officially UP.
@@ -8371,10 +8893,24 @@ GetInsertRecPtr(void)
 /*
  * GetFlushRecPtr -- Returns the current flush position, ie, the last WAL
  * position known to be fsync'd to disk.
+ *
+ * If NVWAL is used, this returns the last persistent WAL position instead.
  */
 XLogRecPtr
 GetFlushRecPtr(void)
 {
+	if (NvwalAvail)
+	{
+		XLogRecPtr		ret;
+
+		SpinLockAcquire(&XLogCtl->info_lck);
+		LogwrtResult = XLogCtl->LogwrtResult;
+		ret = XLogCtl->persistentUpTo;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		return ret;
+	}
+
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	SpinLockRelease(&XLogCtl->info_lck);
@@ -8674,6 +9210,9 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
+	/* for non-volatile WAL buffer */
+	XLogRecPtr	newDiscardedUpTo = 0;
+
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
 	 * issued at a different time.
@@ -8985,6 +9524,22 @@ CreateCheckPoint(int flags)
 	 */
 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+	/*
+	 * If non-volatile WAL buffer is used, discardedUpTo should be updated and
+	 * persist on the control file. So the new value should be caluculated
+	 * here.
+	 *
+	 * TODO Do not copy and paste codes...
+	 */
+	if (NvwalAvail)
+	{
+		XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+		KeepLogSeg(recptr, &_logSegNo);
+		_logSegNo--;
+
+		newDiscardedUpTo = _logSegNo * wal_segment_size;
+	}
+
 	/*
 	 * Update the control file.
 	 */
@@ -8993,6 +9548,16 @@ CreateCheckPoint(int flags)
 		ControlFile->state = DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
+	if (NvwalAvail)
+	{
+		/*
+		 * A new value should not fall behind the old one.
+		 */
+		if (ControlFile->discardedUpTo < newDiscardedUpTo)
+			ControlFile->discardedUpTo = newDiscardedUpTo;
+		else
+			newDiscardedUpTo = ControlFile->discardedUpTo;
+	}
 	ControlFile->time = (pg_time_t) time(NULL);
 	/* crash recovery should always recover to the end of WAL */
 	ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
@@ -9010,6 +9575,44 @@ CreateCheckPoint(int flags)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * If we use non-volatile XLOG buffer, update XLogCtl->Logwrt{Rqst,Result}
+	 * so that the XLOG records older than newDiscardedUpTo are treated as
+	 * "already written and flushed."
+	 */
+	if (NvwalAvail)
+	{
+		Assert(newDiscardedUpTo > 0);
+
+		/* Update process-local variables */
+		LogwrtResult.Write = LogwrtResult.Flush = newDiscardedUpTo;
+
+		/*
+		 * Update shared-memory variables. We need both light-weight lock and
+		 * spin lock to update them.
+		 */
+		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+		SpinLockAcquire(&XLogCtl->info_lck);
+
+		/*
+		 * Note that there can be a corner case that process-local
+		 * LogwrtResult falls behind shared XLogCtl->LogwrtResult if whole the
+		 * non-volatile XLOG buffer is filled and some pages are written out
+		 * to segment files between UpdateControlFile and LWLockAcquire above.
+		 *
+		 * TODO For now, we ignore that case because it can hardly occur.
+		 */
+		XLogCtl->LogwrtResult = LogwrtResult;
+
+		if (XLogCtl->LogwrtRqst.Write < newDiscardedUpTo)
+			XLogCtl->LogwrtRqst.Write = newDiscardedUpTo;
+		if (XLogCtl->LogwrtRqst.Flush < newDiscardedUpTo)
+			XLogCtl->LogwrtRqst.Flush = newDiscardedUpTo;
+
+		SpinLockRelease(&XLogCtl->info_lck);
+		LWLockRelease(WALWriteLock);
+	}
+
 	/* Update shared-memory copy of checkpoint XID/epoch */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->ckptFullXid = checkPoint.nextFullXid;
@@ -9033,21 +9636,31 @@ CreateCheckPoint(int flags)
 	if (PriorRedoPtr != InvalidXLogRecPtr)
 		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
 
-	/*
-	 * Delete old log files, those no longer needed for last checkpoint to
-	 * prevent the disk holding the xlog from growing full.
-	 */
-	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
-	KeepLogSeg(recptr, &_logSegNo);
-	_logSegNo--;
-	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	if (NvwalAvail)
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	else
+	{
+		/*
+		 * Delete old log files, those no longer needed for last checkpoint to
+		 * prevent the disk holding the xlog from growing full.
+		 */
+		XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+		KeepLogSeg(recptr, &_logSegNo);
+		_logSegNo--;
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	}
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
 	if (!shutdown)
-		PreallocXlogFiles(recptr);
+	{
+		if (NvwalAvail)
+			PreallocNonVolatileXlogBuffer();
+		else
+			PreallocXlogFiles(recptr);
+	}
 
 	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
@@ -11651,6 +12264,76 @@ CancelBackup(void)
 	}
 }
 
+/*
+ * Is NVWAL used?
+ */
+bool
+IsNvwalAvail(void)
+{
+	return NvwalAvail;
+}
+
+/*
+ * Get a pointer to the *possibly* right location in the NVWAL buffer
+ * containing the target XLogRecPtr; NULL if the target have already been
+ * discarded.
+ *
+ * Note that the target would be discarded by checkpoint after this
+ * function returns.  The caller should check if the copied record has
+ * expected LSN.
+ */
+char *
+GetNvwalBuffer(XLogRecPtr target, Size *max_read)
+{
+	Size		off;
+	XLogRecPtr	discardedUpTo;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	discardedUpTo = ControlFile->discardedUpTo;
+	LWLockRelease(ControlFileLock);
+
+	if (target < discardedUpTo)
+		return NULL;
+
+	off = target % NvwalSize;
+	*max_read = NvwalSize - off;
+	return XLogCtl->pages + off;
+}
+
+/*
+ * Returns the size we can load from NVWAL and sets nvwalptr to load-from LSN.
+ */
+Size
+GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
+{
+	XLogRecPtr	readUpTo;
+	XLogRecPtr	discardedUpTo;
+
+	Assert(IsNvwalAvail());
+	Assert(nvwalptr != NULL);
+
+	readUpTo = target + count;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	discardedUpTo = ControlFile->discardedUpTo;
+	LWLockRelease(ControlFileLock);
+
+	/* Check if all the records are on WAL segment files */
+	if (readUpTo <= discardedUpTo)
+		return 0;
+
+	/* Check if all the records are on NVWAL */
+	if (discardedUpTo <= target)
+	{
+		*nvwalptr = target;
+		return count;
+	}
+
+	/* Some on WAL segment files, some on NVWAL */
+	*nvwalptr = discardedUpTo;
+	return (Size) (readUpTo - discardedUpTo);
+}
+
 /*
  * Read the XLOG page containing RecPtr into readBuf (if not read already).
  * Returns number of bytes read, if the page is read successfully, or -1
@@ -11718,7 +12401,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
+	if ((readSource != XLOG_FROM_NVWAL && readFile < 0) ||
 		(readSource == XLOG_FROM_STREAM &&
 		 receivedUpto < targetPagePtr + reqLen))
 	{
@@ -11730,10 +12413,68 @@ retry:
 			if (readFile >= 0)
 				close(readFile);
 			readFile = -1;
-			readLen = 0;
-			readSource = 0;
 
-			return -1;
+			/*
+			 * Try non-volatile WAL buffer as last resort.
+			 *
+			 * XXX It is not supported yet on stanby mode.
+			 */
+			if (NvwalAvail && !StandbyMode && readSource != XLOG_FROM_STREAM)
+			{
+				XLogRecPtr	discardedUpTo;
+
+				elog(DEBUG1, "see if NVWAL has records to be UNDONE");
+
+				discardedUpTo = ControlFile->discardedUpTo;
+				if (discardedUpTo != InvalidXLogRecPtr &&
+					discardedUpTo <= targetPagePtr)
+				{
+					elog(DEBUG1, "recovering NVWAL");
+
+					/* Loading records from non-volatile WAL buffer */
+					currentSource = XLOG_FROM_NVWAL;
+					lastSourceFailed = false;
+
+					/* Report recovery progress in PS display */
+					set_ps_display("recovering NVWAL", false);
+
+					/* Track source of data */
+					readSource = XLOG_FROM_NVWAL;
+					XLogReceiptSource = XLOG_FROM_NVWAL;
+
+					/* Track receipt time */
+					XLogReceiptTime = GetCurrentTimestamp();
+
+					/*
+					 * Construct expectedTLEs.  This is necessary to recover
+					 * only from NVWAL because its filename does not have any
+					 * TLI information.
+					 */
+					if (!expectedTLEs)
+					{
+						TimeLineHistoryEntry *entry;
+
+						entry = (TimeLineHistoryEntry *) palloc(sizeof(TimeLineHistoryEntry));
+						entry->tli = recoveryTargetTLI;
+						entry->begin = entry->end = InvalidXLogRecPtr;
+
+						expectedTLEs = list_make1(entry);
+
+						elog(DEBUG1, "expectedTLEs: [%u]", (uint32) recoveryTargetTLI);
+					}
+				}
+			}
+			else
+				elog(DEBUG1, "do not recover NVWAL");
+
+			/* See if the try above succeeded or not */
+			if (readSource != XLOG_FROM_NVWAL)
+			{
+				readLen = 0;
+				readSource = 0;
+
+				return -1;
+			}
 		}
 	}
 
@@ -11741,7 +12482,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(readFile != -1 || readSource == XLOG_FROM_NVWAL);
 
 	/*
 	 * If the current segment is being streamed from master, calculate how
@@ -11760,41 +12501,60 @@ retry:
 	else
 		readLen = XLOG_BLCKSZ;
 
-	/* Read the requested page */
 	readOff = targetPageOff;
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
-	if (r != XLOG_BLCKSZ)
+	if (currentSource == XLOG_FROM_NVWAL)
 	{
-		char		fname[MAXFNAMELEN];
-		int			save_errno = errno;
+		Size		offset = (Size) (targetPagePtr % NvwalSize);
+		char	   *readpos = XLogCtl->pages + offset;
 
+		Assert(readLen == XLOG_BLCKSZ);
+		Assert(offset % XLOG_BLCKSZ == 0);
+
+		/* Load the requested page from non-volatile WAL buffer */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		memcpy(readBuf, readpos, readLen);
 		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		if (r < 0)
+
+		/* There are not any other clues of TLI... */
+		*readTLI = ((XLogPageHeader) readBuf)->xlp_tli;
+	}
+	else
+	{
+		/* Read the requested page from file */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+		if (r != XLOG_BLCKSZ)
 		{
-			errno = save_errno;
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
+			char		fname[MAXFNAMELEN];
+			int			save_errno = errno;
+
+			pgstat_report_wait_end();
+			XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+			if (r < 0)
+			{
+				errno = save_errno;
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode_for_file_access(),
+						 errmsg("could not read from log segment %s, offset %u: %m",
+								fname, readOff)));
+			}
+			else
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+								fname, readOff, r, (Size) XLOG_BLCKSZ)));
+			goto next_record_is_invalid;
 		}
-		else
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							fname, readOff, r, (Size) XLOG_BLCKSZ)));
-		goto next_record_is_invalid;
+		pgstat_report_wait_end();
+
+		*readTLI = curFileTLI;
 	}
-	pgstat_report_wait_end();
 
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	*readTLI = curFileTLI;
-
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
 	 * it's not valid. This may seem unnecessary, because XLogReadRecord()
@@ -11828,6 +12588,17 @@ retry:
 		goto next_record_is_invalid;
 	}
 
+	/*
+	 * Updating curFileTLI on each page verified if non-volatile WAL buffer
+	 * is used because there is no TimeLineID information in NVWAL's filename.
+	 */
+	if (readSource == XLOG_FROM_NVWAL &&
+		curFileTLI != xlogreader->latestPageTLI)
+	{
+		curFileTLI = xlogreader->latestPageTLI;
+		elog(DEBUG1, "curFileTLI: %u", curFileTLI);
+	}
+
 	return readLen;
 
 next_record_is_invalid:
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 92fa86fc9d..f2992b4a85 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2368,7 +2368,9 @@ XLogRead(char *buf, XLogRecPtr startptr, Size count)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
+	XLogRecPtr	recptr_nvwal = 0;
 	Size		nbytes;
+	Size		nbytes_nvwal = 0;
 	XLogSegNo	segno;
 
 retry:
@@ -2376,6 +2378,13 @@ retry:
 	recptr = startptr;
 	nbytes = count;
 
+	/* Try to load records directly from NVWAL if used */
+	if (IsNvwalAvail())
+	{
+		nbytes_nvwal = GetLoadableSizeFromNvwal(startptr, count, &recptr_nvwal);
+		nbytes = count - nbytes_nvwal;
+	}
+
 	while (nbytes > 0)
 	{
 		uint32		startoff;
@@ -2500,6 +2509,47 @@ retry:
 		p += readbytes;
 	}
 
+	/* Load records directly from NVWAL */
+	while (nbytes_nvwal > 0)
+	{
+		char	   *src;
+		Size		max_read = 0;
+		Size		readbytes;
+
+		Assert(IsNvwalAvail());
+
+		/*
+		 * Get the target address on NVWAL and the size we can load from it at
+		 * once because WAL buffer can rotate and we might have to load what we
+		 * want devided into two or more.
+		 *
+		 * Note that, in a rare case, some records on NVWAL might have been
+		 * already discarded.  We retry in such a case.
+		 */
+		src = GetNvwalBuffer(recptr_nvwal, &max_read);
+		if (src == NULL)
+		{
+			elog(WARNING, "some records on NVWAL had been discarded; retry");
+			goto retry;
+		}
+
+		if (nbytes_nvwal < max_read)
+			readbytes = nbytes_nvwal;
+		else
+			readbytes = max_read;
+
+		memcpy(p, src, readbytes);
+
+		/*
+		 * Update state for load.  Note that we do not need to update sendOff
+		 * because it indicates an offset in a segment file and we do not use
+		 * any segment file inside this loop.
+		 */
+		recptr_nvwal += readbytes;
+		nbytes_nvwal -= readbytes;
+		p += readbytes;
+	}
+
 	/*
 	 * After reading into the buffer, check that what we read was valid. We do
 	 * this after reading, because even though the segment was present when we
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index d955b97c0b..a47caefa99 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -280,6 +280,9 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("discarded Up To:                      %X/%X\n"),
+		   (uint32) (ControlFile->discardedUpTo >> 32),
+		   (uint32) ControlFile->discardedUpTo);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index bc09fa104c..14efb904be 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -325,6 +325,12 @@ extern void XLogRequestWalReceiverReply(void);
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
+extern bool IsNvwalAvail(void);
+extern char *GetNvwalBuffer(XLogRecPtr target, Size *max_read);
+extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
+										   Size count,
+										   XLogRecPtr *nvwalptr);
+
 /*
  * Routines to start, stop, and get status of a base backup.
  */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index ff98d9e91a..04b1b94645 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
 
 
 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION	1201
+#define PG_CONTROL_VERSION	9901
 
 /* Nonce key length, see below */
 #define MOCK_AUTH_NONCE_LEN		32
@@ -132,6 +132,21 @@ typedef struct ControlFileData
 
 	XLogRecPtr	unloggedLSN;	/* current fake LSN value, for unlogged rels */
 
+	/*
+	 * Used for non-volatile WAL buffer (NVWAL).
+	 *
+	 * discardedUpTo is updated to the oldest LSN in the NVWAL when either a
+	 * checkpoint or a restartpoint is completed successfully, or whole the
+	 * NVWAL is filled with WAL records and a new record is being inserted.
+	 * This field tells that the NVWAL contains WAL records in the range of
+	 * [discardedUpTo, discardedUpTo+S), where S is the size of the NVWAL.
+	 * Note that the WAL records whose LSN are less than discardedUpTo would
+	 * remain in WAL segment files and be needed for recovery.
+	 *
+	 * It is set to zero when NVWAL is not used.
+	 */
+	XLogRecPtr	discardedUpTo;
+
 	/*
 	 * These two values determine the minimum point we must recover up to
 	 * before starting up:
-- 
2.20.1

0003-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=0003-README-for-non-volatile-WAL-buffer.patchDownload
From e98b3c3fd4c48b21b4fe26d568899722cd202dc9 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Fri, 24 Jan 2020 13:16:28 +0900
Subject: [PATCH 3/3] README for non-volatile WAL buffer

---
 README.nvwal | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 184 insertions(+)
 create mode 100644 README.nvwal

diff --git a/README.nvwal b/README.nvwal
new file mode 100644
index 0000000000..b6b9d576e7
--- /dev/null
+++ b/README.nvwal
@@ -0,0 +1,184 @@
+Non-volatile WAL buffer
+=======================
+Here is a PostgreSQL branch with a proof-of-concept "non-volatile WAL buffer"
+(NVWAL) feature. Putting the WAL buffer pages on persistent memory (PMEM) [1],
+inserting WAL records into it directly, and eliminating I/O for WAL segment
+files, PostgreSQL gets lower latency and higher throughput.
+
+
+Prerequisites and recommends
+----------------------------
+* An x64 system
+  * (Recommended) Supporting CLFLUSHOPT or CLWB instruction
+    * See if lscpu shows "clflushopt" or "clwb" flag
+* An OS supporting PMEM
+  * Linux: 4.15 or later (tested on 5.2)
+  * Windows: (Sorry but we have not tested on Windows yet.)
+* A filesystem supporting DAX (tested on ext4)
+* libpmem in PMDK [2] 1.4 or later (tested on 1.7)
+* ndctl [3] (tested on 61.2)
+* ipmctl [4] if you use Intel DCPMM
+* sudo privilege
+* All other prerequisites of original PostgreSQL
+* (Recommended) PMEM module(s) (NVDIMM-N or Intel DCPMM)
+  * You can emulate PMEM using DRAM [5] even if you have no PMEM module.
+* (Recommended) numactl
+
+
+Build and install PostgreSQL with NVWAL feature
+-----------------------------------------------
+We have a new configure option --with-nvwal.
+
+I believe it is good to install under your home directory with --prefix option.
+If you do so, please DO NOT forget "export PATH".
+
+  $ ./configure --with-nvwal --prefix="$HOME/postgres"
+  $ make
+  $ make install
+  $ export PATH="$HOME/postgres/bin:$PATH"
+
+NOTE: ./configure --with-nvwal will fail if libpmem is not found.
+
+
+Prepare DAX filesystem
+----------------------
+Here we use NVDIMM-N or emulated PMEM, make ext4 filesystem on namespace0.0
+(/dev/pmem0), and mount it onto /mnt/pmem0. Please DO NOT forget "-o dax" option
+on mount. For Intel DCPMM and ipmctl, please see [4].
+
+  $ ndctl list
+  [
+    {
+      "dev":"namespace1.0",
+      "mode":"raw",
+      "size":103079215104,
+      "sector_size":512,
+      "blockdev":"pmem1",
+      "numa_node":1
+    },
+    {
+      "dev":"namespace0.0",
+      "mode":"raw",
+      "size":103079215104,
+      "sector_size":512,
+      "blockdev":"pmem0",
+      "numa_node":0
+    }
+  ]
+
+  $ sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0
+  {
+    "dev":"namespace0.0",
+    "mode":"fsdax",
+    "map":"dev",
+    "size":"94.50 GiB (101.47 GB)",
+    "uuid":"e7da9d65-140b-4e1e-90ec-6548023a1b6e",
+    "sector_size":512,
+    "blockdev":"pmem0",
+    "numa_node":0
+  }
+
+  $ ls -l /dev/pmem0
+  brw-rw---- 1 root disk 259, 3 Jan  6 17:06 /dev/pmem0
+
+  $ sudo mkfs.ext4 -q -F /dev/pmem0
+  $ sudo mkdir -p /mnt/pmem0
+  $ sudo mount -o dax /dev/pmem0 /mnt/pmem0
+  $ mount -l | grep ^/dev/pmem0
+  /dev/pmem0 on /mnt/pmem0 type ext4 (rw,relatime,dax)
+
+
+Enable transparent huge page
+----------------------------
+Of course transparent huge page would not be suitable for database workload,
+but it improves performance of PMEM by reducing overhead of page walk.
+
+  $ ls -l /sys/kernel/mm/transparent_hugepage/enabled
+  -rw-r--r-- 1 root root 4096 Dec  3 10:38 /sys/kernel/mm/transparent_hugepage/enabled
+
+  $ echo always | sudo dd of=/sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null
+  $ cat /sys/kernel/mm/transparent_hugepage/enabled
+  [always] madvise never
+
+
+initdb
+------
+We have two new options:
+
+  -P, --nvwal-path=FILE  path to file for non-volatile WAL buffer (NVWAL)
+  -Q, --nvwal-size=SIZE  size of NVWAL, in megabytes
+
+If you want to create a new 80GB (81920MB) NVWAL file on /mnt/pmem0/pgsql/nvwal,
+please run initdb as follows:
+
+  $ sudo mkdir -p /mnt/pmem0/pgsql
+  $ sudo chown "$USER:$USER" /mnt/pmem0/pgsql
+  $ export PGDATA="$HOME/pgdata"
+  $ initdb -P /mnt/pmem0/pgsql/nvwal -Q 81920
+
+You will find there is no WAL segment file to be created in PGDATA/pg_wal
+directory. That is okay; your NVWAL file has the content of the first WAL
+segment file.
+
+NOTE:
+* initdb will fail if the given NVWAL size is not multiple of WAL segment
+  size. The segment size is given with initdb --wal-segsize, or is 16MB as
+  default.
+* postgres (executed by initdb) will fail in bootstrap if the directory in
+  which the NVWAL file is being created (/mnt/pmem0/pgsql for example
+  above) does not exist.
+* postgres (executed by initdb) will fail in bootstrap if an entry already
+  exists on the given path.
+* postgres (executed by initdb) will fail in bootstrap if the given path is
+  not on PMEM or you forget "-o dax" option on mount.
+* Resizing an NVWAL file is NOT supported yet. Please be careful to decide
+  how large your NVWAL file is to be.
+* "-Q 1024" (1024MB) will be assumed if -P is given but -Q is not.
+
+
+postgresql.conf
+---------------
+We have two new parameters nvwal_path and nvwal_size, corresponding to the two
+new options of initdb. If you run initdb as above, you will find postgresql.conf
+in your PGDATA directory like as follows:
+
+  max_wal_size = 80GB
+  min_wal_size = 80GB
+  nvwal_path = '/mnt/pmem0/pgsql/nvwal'
+  nvwal_size = 80GB
+
+NOTE:
+* postgres will fail in startup if no file exists on the given nvwal_path.
+* postgres will fail in startup if the given nvwal_size is not equal to the
+  actual NVWAL file size,
+* postgres will fail in startup if the given nvwal_path is not on PMEM or you
+  forget "-o dax" option on mount.
+* wal_buffers will be ignored if nvwal_path is given.
+* You SHOULD give both max_wal_size and min_wal_size the same value as
+  nvwal_size. postgres could possibly run even though the three values are
+  not same, however, we have not tested such a case yet.
+
+
+Startup
+-------
+Same as you know:
+
+  $ pg_ctl start
+
+or use numactl as follows to let postgres run on the specified NUMA node (typi-
+cally the one on which your NVWAL file is) if you need stable performance:
+
+  $ numactl --cpunodebind=0 --membind=0 -- pg_ctl start
+
+
+References
+----------
+[1] https://pmem.io/
+[2] https://pmem.io/pmdk/
+[3] https://docs.pmem.io/ndctl-user-guide/
+[4] https://docs.pmem.io/ipmctl-user-guide/
+[5] https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
+
+
+--
+Takashi Menjo <takashi.menjou.vg AT hco.ntt.co.jp>
-- 
2.20.1

#2Fabien COELHO
coelho@cri.ensmp.fr
In reply to: Takashi Menjo (#1)
Re: [PoC] Non-volatile WAL buffer

Hello,

+1 on the idea.

By quickly looking at the patch, I notice that there are no tests.

Is it possible to emulate somthing without the actual hardware, at least
for testing purposes?

--
Fabien.

#3Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Takashi Menjo (#1)
Re: [PoC] Non-volatile WAL buffer

On 24/01/2020 10:06, Takashi Menjo wrote:

I propose "non-volatile WAL buffer," a proof-of-concept new feature. It
enables WAL records to be durable without output to WAL segment files by
residing on persistent memory (PMEM) instead of DRAM. It improves database
performance by reducing copies of WAL and shortening the time of write
transactions.

I attach the first patchset that can be applied to PostgreSQL 12.0 (refs/
tags/REL_12_0). Please see README.nvwal (added by the patch 0003) to use
the new feature.

I have the same comments on this that I had on the previous patch, see:

/messages/by-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8@iki.fi

- Heikki

#4Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Fabien COELHO (#2)
RE: [PoC] Non-volatile WAL buffer

Hello Fabien,

Thank you for your +1 :)

Is it possible to emulate somthing without the actual hardware, at least
for testing purposes?

Yes, you can emulate PMEM using DRAM on Linux, via "memmap=nnG!ssG" kernel
parameter. Please see [1]How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM) https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server and [2]how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system for emulation details. If your emulation
does not work well, please check if the kernel configuration options (like
CONFIG_ FOOBAR) for PMEM and DAX (in [1]How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM) https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server and [3]Persistent Memory Wiki https://nvdimm.wiki.kernel.org/) are set up properly.

Best regards,
Takashi

[1]: How to Emulate Persistent Memory Using Dynamic Random-access Memory (DRAM) https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
[2]: how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
https://nvdimm.wiki.kernel.org/how_to_choose_the_correct_memmap_kernel_parameter_for_pmem_on_your_system
[3]: Persistent Memory Wiki https://nvdimm.wiki.kernel.org/
https://nvdimm.wiki.kernel.org/

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

#5Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Heikki Linnakangas (#3)
RE: [PoC] Non-volatile WAL buffer

Hello Heikki,

I have the same comments on this that I had on the previous patch, see:

/messages/by-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8@iki.fi

Thanks. I re-read your messages [1]/messages/by-id/83eafbfd-d9c5-6623-2423-7cab1be3888c@iki.fi[2]/messages/by-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8@iki.fi. What you meant, AFAIU, is how
about using memory-mapped WAL segment files as WAL buffers, and switching
CPU instructions or msync() depending on whether the segment files are on
PMEM or not, to sync inserted WAL records.

It sounds reasonable, but I'm sorry that I haven't tested such a program
yet. I'll try it to compare with my non-volatile WAL buffer. For now, I'm
a little worried about the overhead of mmap()/munmap() for each WAL segment
file.

You also told a SIGBUS problem of memory-mapped I/O. I think it's true for
reading from bad memory blocks, as you mentioned, and also true for writing
to such blocks [3]https://pmem.io/2018/11/26/bad-blocks.htm. Handling SIGBUS properly or working around it is future
work.

Best regards,
Takashi

[1]: /messages/by-id/83eafbfd-d9c5-6623-2423-7cab1be3888c@iki.fi
[2]: /messages/by-id/2aec6e2a-6a32-0c39-e4e2-aad854543aa8@iki.fi
[3]: https://pmem.io/2018/11/26/bad-blocks.htm

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

#6Robert Haas
robertmhaas@gmail.com
In reply to: Takashi Menjo (#5)
Re: [PoC] Non-volatile WAL buffer

On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:

It sounds reasonable, but I'm sorry that I haven't tested such a program
yet. I'll try it to compare with my non-volatile WAL buffer. For now, I'm
a little worried about the overhead of mmap()/munmap() for each WAL segment
file.

I guess the question here is how the cost of one mmap() and munmap()
pair per WAL segment (normally 16MB) compares to the cost of one
write() per block (normally 8kB). It could be that mmap() is a more
expensive call than read(), but by a small enough margin that the
vastly reduced number of system calls makes it a winner. But that's
just speculation, because I don't know how heavy mmap() actually is.

I have a different concern. I think that, right now, when we reuse a
WAL segment, we write entire blocks at a time, so the old contents of
the WAL segment are overwritten without ever being read. But that
behavior might not be maintained when using mmap(). It might be that
as soon as we write the first byte to a mapped page, the old contents
have to be faulted into memory. Indeed, it's unclear how it could be
otherwise, since the VM page must be made read-write at that point and
the system cannot know that we will overwrite the whole page. But
reading in the old contents of a recycled WAL file just to overwrite
them seems like it would be disastrously expensive.

A related, but more minor, concern is whether there are any
differences in in the write-back behavior when modifying a mapped
region vs. using write(). Either way, the same pages of the same file
will get dirtied, but the kernel might not have the same idea in
either case about when the changed pages should be written back down
to disk, and that could make a big difference to performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Robert Haas (#6)
RE: [PoC] Non-volatile WAL buffer

Hello Robert,

I think our concerns are roughly classified into two:

(1) Performance
(2) Consistency

And your "different concern" is rather into (2), I think.

I'm also worried about it, but I have no good answer for now. I suppose mmap(flags|=MAP_SHARED) called by multiple backend processes for the same file works consistently for both PMEM and non-PMEM devices. However, I have not found any evidence such as specification documents yet.

I also made a tiny program calling memcpy() and msync() on the same mmap()-ed file but mutually distinct address range in parallel, and found that there was no corrupted data. However, that result does not ensure any consistency I'm worried about.  I could give it up if there *were* corrupted data...

So I will go to (1) first. I will test the way Heikki told us to answer whether the cost of mmap() and munmap() per WAL segment, etc, is reasonable or not. If it really is, then I will go to (2).

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

#8Robert Haas
robertmhaas@gmail.com
In reply to: Takashi Menjo (#7)
Re: [PoC] Non-volatile WAL buffer

On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:

I think our concerns are roughly classified into two:

(1) Performance
(2) Consistency

And your "different concern" is rather into (2), I think.

Actually, I think it was mostly a performance concern (writes
triggering lots of reading) but there might be a consistency issue as
well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#6)
Re: [PoC] Non-volatile WAL buffer

Hi,

On 2020-01-27 13:54:38 -0500, Robert Haas wrote:

On Mon, Jan 27, 2020 at 2:01 AM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:

It sounds reasonable, but I'm sorry that I haven't tested such a program
yet. I'll try it to compare with my non-volatile WAL buffer. For now, I'm
a little worried about the overhead of mmap()/munmap() for each WAL segment
file.

I guess the question here is how the cost of one mmap() and munmap()
pair per WAL segment (normally 16MB) compares to the cost of one
write() per block (normally 8kB). It could be that mmap() is a more
expensive call than read(), but by a small enough margin that the
vastly reduced number of system calls makes it a winner. But that's
just speculation, because I don't know how heavy mmap() actually is.

mmap()/munmap() on a regular basis does have pretty bad scalability
impacts. I don't think they'd fully hit us, because we're not in a
threaded world however.

My issue with the proposal to go towards mmap()/munmap() is that I think
doing so forcloses a lot of improvements. Even today, on fast storage,
using the open_datasync is faster (at least when somehow hitting the
O_DIRECT path, which isn't that easy these days) - and that's despite it
being really unoptimized. I think our WAL scalability is a serious
issue. There's a fair bit that we can improve by just fix without really
changing the way we do IO:

- Split WALWriteLock into one lock for writing and one for flushing the
WAL. Right now we prevent other sessions from writing out WAL - even
to other segments - when one session is doing a WAL flush. But there's
absolutely no need for that.
- Stop increasing the size of the flush request to the max when flushing
WAL (cf "try to write/flush later additions to XLOG as well" in
XLogFlush()) - that currently reduces throughput in OLTP workloads
quite noticably. It made some sense in the spinning disk times, but I
don't think it does for a halfway decent SSD. By writing the maximum
ready to write, we hold the lock for longer, increasing latency for
the committing transaction *and* preventing more WAL from being written.
- We should immediately ask the OS to flush writes for full XLOG pages
back to the OS. Right now the IO for that will never be started before
the commit comes around in an OLTP workload, which means that we just
waste the time between the XLogWrite() and the commit.

That'll gain us 2-3x, I think. But after that I think we're going to
have to actually change more fundamentally how we do IO for WAL
writes. Using async IO I can do like 18k individual durable 8kb writes
(using O_DSYNC) a second, at a queue depth of 32. On my laptop. If I
make it 4k writes, it's 22k.

That's not directly comparable with postgres WAL flushes, of course, as
it's all separate blocks, whereas WAL will often end up overwriting the
last block. But it doesn't at all account for group commits either,
which we *constantly* end up doing.

Postgres manages somewhere between ~450 (multiple users) ~800 (single
user) individually durable WAL writes / sec on the same hardware. Yes,
that's more than an order of magnitude less. Of course some of that is
just that postgres does more than just IO - but that's not effect on the
order of a magnitude.

So, why am I bringing this up in this thread? Only because I do not see
a way to actually utilize non-pmem hardware to a much higher degree than
we are doing now by using mmap(). Doing so requires using direct IO,
which is fundamentally incompatible with using mmap().

I have a different concern. I think that, right now, when we reuse a
WAL segment, we write entire blocks at a time, so the old contents of
the WAL segment are overwritten without ever being read. But that
behavior might not be maintained when using mmap(). It might be that
as soon as we write the first byte to a mapped page, the old contents
have to be faulted into memory. Indeed, it's unclear how it could be
otherwise, since the VM page must be made read-write at that point and
the system cannot know that we will overwrite the whole page. But
reading in the old contents of a recycled WAL file just to overwrite
them seems like it would be disastrously expensive.

Yea, that's a serious concern.

A related, but more minor, concern is whether there are any
differences in in the write-back behavior when modifying a mapped
region vs. using write(). Either way, the same pages of the same file
will get dirtied, but the kernel might not have the same idea in
either case about when the changed pages should be written back down
to disk, and that could make a big difference to performance.

I don't think there's a significant difference in case of linux - no
idea about others. And either way we probably should force the kernels
hand to start flushing much sooner.

Greetings,

Andres Freund

#10Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Robert Haas (#8)
5 attachment(s)
RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I made another WIP patchset to mmap WAL segments as WAL buffers. Note that this is not a non-volatile WAL buffer patchset but its competitor. I am measuring and analyzing the performance of this patchset to compare with my N.V.WAL buffer.

Please wait for a several more days for the result report...

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Robert Haas <robertmhaas@gmail.com>
Sent: Wednesday, January 29, 2020 6:00 AM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Heikki Linnakangas <hlinnaka@iki.fi>; pgsql-hackers@postgresql.org
Subject: Re: [PoC] Non-volatile WAL buffer

On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

I think our concerns are roughly classified into two:

(1) Performance
(2) Consistency

And your "different concern" is rather into (2), I think.

Actually, I think it was mostly a performance concern (writes triggering lots of reading) but there might be a
consistency issue as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company

Attachments:

0001-Preallocate-more-WAL-segments.patchapplication/octet-stream; name=0001-Preallocate-more-WAL-segments.patchDownload
From 72728138ef92b744b64464d21ba35d4b717a55bb Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 10 Feb 2020 17:53:11 +0900
Subject: [msync 1/5] Preallocate more WAL segments

Please run ./configure with LIBS=-lpmem to build this patchset.
---
 src/backend/access/transam/xlog.c | 27 ++++++++++-----------------
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 77ad765989..e2cd34057f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -891,7 +891,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
-static void PreallocXlogFiles(XLogRecPtr endptr);
+static void PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
 static void RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
@@ -3801,27 +3801,20 @@ XLogFileClose(void)
 
 /*
  * Preallocate log files beyond the specified log endpoint.
- *
- * XXX this is currently extremely conservative, since it forces only one
- * future log segment to exist, and even that only if we are 75% done with
- * the current one.  This is only appropriate for very low-WAL-volume systems.
- * High-volume systems will be OK once they've built up a sufficient set of
- * recycled log segments, but the startup transient is likely to include
- * a lot of segment creations by foreground processes, which is not so good.
  */
 static void
-PreallocXlogFiles(XLogRecPtr endptr)
+PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 {
 	XLogSegNo	_logSegNo;
+	XLogSegNo	endSegNo;
+	XLogSegNo	recycleSegNo;
 	int			lf;
 	bool		use_existent;
-	uint64		offset;
 
-	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
-	offset = XLogSegmentOffset(endptr - 1, wal_segment_size);
-	if (offset >= (uint32) (0.75 * wal_segment_size))
+	XLByteToPrevSeg(endptr, endSegNo, wal_segment_size);
+	recycleSegNo = XLOGfileslop(RedoRecPtr);
+	for (_logSegNo = endSegNo + 1; _logSegNo <= recycleSegNo; _logSegNo++)
 	{
-		_logSegNo++;
 		use_existent = true;
 		lf = XLogFileInit(_logSegNo, &use_existent, true);
 		close(lf);
@@ -7692,7 +7685,7 @@ StartupXLOG(void)
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
-	PreallocXlogFiles(EndOfLog);
+	PreallocXlogFiles(RedoRecPtr, EndOfLog);
 
 	/*
 	 * Okay, we're officially UP.
@@ -8905,7 +8898,7 @@ CreateCheckPoint(int flags)
 	 * segments, since that may supply some of the needed files.)
 	 */
 	if (!shutdown)
-		PreallocXlogFiles(recptr);
+		PreallocXlogFiles(RedoRecPtr, recptr);
 
 	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
@@ -9255,7 +9248,7 @@ CreateRestartPoint(int flags)
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
-	PreallocXlogFiles(endptr);
+	PreallocXlogFiles(RedoRecPtr, endptr);
 
 	/*
 	 * ThisTimeLineID is normally not set when we're still in recovery.
-- 
2.20.1

0002-Use-WAL-segments-as-WAL-buffers.patchapplication/octet-stream; name=0002-Use-WAL-segments-as-WAL-buffers.patchDownload
From 76cda0eb7b660654b0aa379883ebe4952658cdf7 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 10 Feb 2020 17:53:12 +0900
Subject: [msync 2/5] Use WAL segments as WAL buffers

Note that we ignore wal_sync_method from here.
---
 src/backend/access/transam/xlog.c | 833 +++++++++---------------------
 1 file changed, 243 insertions(+), 590 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e2cd34057f..43f9a8affc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -15,6 +15,7 @@
 #include "postgres.h"
 
 #include <ctype.h>
+#include <libpmem.h>
 #include <math.h>
 #include <time.h>
 #include <fcntl.h>
@@ -613,24 +614,8 @@ typedef struct XLogCtlData
 	XLogwrtResult LogwrtResult;
 
 	/*
-	 * Latest initialized page in the cache (last byte position + 1).
-	 *
-	 * To change the identity of a buffer (and InitializedUpTo), you need to
-	 * hold WALBufMappingLock.  To change the identity of a buffer that's
-	 * still dirty, the old page needs to be written out first, and for that
-	 * you need WALWriteLock, and you need to ensure that there are no
-	 * in-progress insertions to the page by calling
-	 * WaitXLogInsertionsToFinish().
+	 * This value does not change after startup.
 	 */
-	XLogRecPtr	InitializedUpTo;
-
-	/*
-	 * These values do not change after startup, although the pointed-to pages
-	 * and xlblocks values certainly do.  xlblock values are protected by
-	 * WALBufMappingLock.
-	 */
-	char	   *pages;			/* buffers for unwritten XLOG pages */
-	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
 	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
 
 	/*
@@ -775,9 +760,16 @@ static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
  * openLogFile is -1 or a kernel FD for an open log file segment.
  * openLogSegNo identifies the segment.  These variables are only used to
  * write the XLOG, and so will normally refer to the active segment.
+ *
+ * mappedPages is mmap(2)-ed address for an open log file segment.
+ * It is used as WAL buffer instead of XLogCtl->pages.
+ *
+ * pmemMapped is true if mappedPages is on PMEM.
  */
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
+static char *mappedPages = NULL;
+static bool pmemMapped = 0;
 
 /*
  * These variables are used similarly to the ones above, but for reading
@@ -875,12 +867,12 @@ static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
-static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 								   bool find_free, XLogSegNo max_segno,
 								   bool use_lock);
+static char *XLogFileMap(XLogSegNo segno, bool *is_pmem);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 int source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
@@ -891,6 +883,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
+static void XLogFileUnmap(char *pages, XLogSegNo segno);
 static void PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
@@ -940,7 +933,6 @@ static void checkXLogConsistency(XLogReaderState *record);
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
-static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
@@ -1574,27 +1566,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		 */
 		while (CurrPos < EndPos)
 		{
-			/*
-			 * The minimal action to flush the page would be to call
-			 * WALInsertLockUpdateInsertingAt(CurrPos) followed by
-			 * AdvanceXLInsertBuffer(...).  The page would be left initialized
-			 * mostly to zeros, except for the page header (always the short
-			 * variant, as this is never a segment's first page).
-			 *
-			 * The large vistas of zeros are good for compressibility, but the
-			 * headers interrupting them every XLOG_BLCKSZ (with values that
-			 * differ from page to page) are not.  The effect varies with
-			 * compression tool, but bzip2 for instance compresses about an
-			 * order of magnitude worse if those headers are left in place.
-			 *
-			 * Rather than complicating AdvanceXLInsertBuffer itself (which is
-			 * called in heavily-loaded circumstances as well as this lightly-
-			 * loaded one) with variant behavior, we just use GetXLogBuffer
-			 * (which itself calls the two methods we need) to get the pointer
-			 * and zero most of the page.  Then we just zero the page header.
-			 */
-			currpos = GetXLogBuffer(CurrPos);
-			MemSet(currpos, 0, SizeOfXLogShortPHD);
+			/* XXX We assume that XLogFileInit does what we did here */
 
 			CurrPos += XLOG_BLCKSZ;
 		}
@@ -1708,29 +1680,6 @@ WALInsertLockRelease(void)
 	}
 }
 
-/*
- * Update our insertingAt value, to let others know that we've finished
- * inserting up to that point.
- */
-static void
-WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
-{
-	if (holdingAllLocks)
-	{
-		/*
-		 * We use the last lock to mark our actual position, see comments in
-		 * WALInsertLockAcquireExclusive.
-		 */
-		LWLockUpdateVar(&WALInsertLocks[NUM_XLOGINSERT_LOCKS - 1].l.lock,
-						&WALInsertLocks[NUM_XLOGINSERT_LOCKS - 1].l.insertingAt,
-						insertingAt);
-	}
-	else
-		LWLockUpdateVar(&WALInsertLocks[MyLockNo].l.lock,
-						&WALInsertLocks[MyLockNo].l.insertingAt,
-						insertingAt);
-}
-
 /*
  * Wait for any WAL insertions < upto to finish.
  *
@@ -1831,123 +1780,37 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 /*
  * Get a pointer to the right location in the WAL buffer containing the
  * given XLogRecPtr.
- *
- * If the page is not initialized yet, it is initialized. That might require
- * evicting an old dirty buffer from the buffer cache, which means I/O.
- *
- * The caller must ensure that the page containing the requested location
- * isn't evicted yet, and won't be evicted. The way to ensure that is to
- * hold onto a WAL insertion lock with the insertingAt position set to
- * something <= ptr. GetXLogBuffer() will update insertingAt if it needs
- * to evict an old page from the buffer. (This means that once you call
- * GetXLogBuffer() with a given 'ptr', you must not access anything before
- * that point anymore, and must not call GetXLogBuffer() with an older 'ptr'
- * later, because older buffers might be recycled already)
  */
 static char *
 GetXLogBuffer(XLogRecPtr ptr)
 {
-	int			idx;
-	XLogRecPtr	endptr;
-	static uint64 cachedPage = 0;
-	static char *cachedPos = NULL;
-	XLogRecPtr	expectedEndPtr;
+	int				idx;
+	XLogPageHeader	page;
+	XLogSegNo		segno;
 
-	/*
-	 * Fast path for the common case that we need to access again the same
-	 * page as last time.
-	 */
-	if (ptr / XLOG_BLCKSZ == cachedPage)
+	/* shut-up compiler if not --enable-cassert */
+	(void) page;
+
+	XLByteToSeg(ptr, segno, wal_segment_size);
+	if (segno != openLogSegNo)
 	{
-		Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
-		Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
-		return cachedPos + ptr % XLOG_BLCKSZ;
+		/* Unmap the current segment if mapped */
+		if (mappedPages != NULL)
+			XLogFileUnmap(mappedPages, openLogSegNo);
+
+		/* Map the segment we need */
+		mappedPages = XLogFileMap(segno, &pmemMapped);
+		Assert(mappedPages != NULL);
+		openLogSegNo = segno;
 	}
 
-	/*
-	 * The XLog buffer cache is organized so that a page is always loaded to a
-	 * particular buffer.  That way we can easily calculate the buffer a given
-	 * page must be loaded into, from the XLogRecPtr alone.
-	 */
 	idx = XLogRecPtrToBufIdx(ptr);
+	page = (XLogPageHeader) (mappedPages + idx * (Size) XLOG_BLCKSZ);
 
-	/*
-	 * See what page is loaded in the buffer at the moment. It could be the
-	 * page we're looking for, or something older. It can't be anything newer
-	 * - that would imply the page we're looking for has already been written
-	 * out to disk and evicted, and the caller is responsible for making sure
-	 * that doesn't happen.
-	 *
-	 * However, we don't hold a lock while we read the value. If someone has
-	 * just initialized the page, it's possible that we get a "torn read" of
-	 * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
-	 * that case we will see a bogus value. That's ok, we'll grab the mapping
-	 * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
-	 * the page we're looking for. But it means that when we do this unlocked
-	 * read, we might see a value that appears to be ahead of the page we're
-	 * looking for. Don't PANIC on that, until we've verified the value while
-	 * holding the lock.
-	 */
-	expectedEndPtr = ptr;
-	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+	Assert(page->xlp_magic == XLOG_PAGE_MAGIC);
+	Assert(page->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
 
-	endptr = XLogCtl->xlblocks[idx];
-	if (expectedEndPtr != endptr)
-	{
-		XLogRecPtr	initializedUpto;
-
-		/*
-		 * Before calling AdvanceXLInsertBuffer(), which can block, let others
-		 * know how far we're finished with inserting the record.
-		 *
-		 * NB: If 'ptr' points to just after the page header, advertise a
-		 * position at the beginning of the page rather than 'ptr' itself. If
-		 * there are no other insertions running, someone might try to flush
-		 * up to our advertised location. If we advertised a position after
-		 * the page header, someone might try to flush the page header, even
-		 * though page might actually not be initialized yet. As the first
-		 * inserter on the page, we are effectively responsible for making
-		 * sure that it's initialized, before we let insertingAt to move past
-		 * the page header.
-		 */
-		if (ptr % XLOG_BLCKSZ == SizeOfXLogShortPHD &&
-			XLogSegmentOffset(ptr, wal_segment_size) > XLOG_BLCKSZ)
-			initializedUpto = ptr - SizeOfXLogShortPHD;
-		else if (ptr % XLOG_BLCKSZ == SizeOfXLogLongPHD &&
-				 XLogSegmentOffset(ptr, wal_segment_size) < XLOG_BLCKSZ)
-			initializedUpto = ptr - SizeOfXLogLongPHD;
-		else
-			initializedUpto = ptr;
-
-		WALInsertLockUpdateInsertingAt(initializedUpto);
-
-		AdvanceXLInsertBuffer(ptr, false);
-		endptr = XLogCtl->xlblocks[idx];
-
-		if (expectedEndPtr != endptr)
-			elog(PANIC, "could not find WAL buffer for %X/%X",
-				 (uint32) (ptr >> 32), (uint32) ptr);
-	}
-	else
-	{
-		/*
-		 * Make sure the initialization of the page is visible to us, and
-		 * won't arrive later to overwrite the WAL data we write on the page.
-		 */
-		pg_memory_barrier();
-	}
-
-	/*
-	 * Found the buffer holding this page. Return a pointer to the right
-	 * offset within the page.
-	 */
-	cachedPage = ptr / XLOG_BLCKSZ;
-	cachedPos = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
-
-	Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
-	Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
-
-	return cachedPos + ptr % XLOG_BLCKSZ;
+	return mappedPages + ptr % wal_segment_size;
 }
 
 /*
@@ -2075,178 +1938,6 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 	return result;
 }
 
-/*
- * Initialize XLOG buffers, writing out old buffers if they still contain
- * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
- * true, initialize as many pages as we can without having to write out
- * unwritten data. Any new pages are initialized to zeros, with pages headers
- * initialized properly.
- */
-static void
-AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
-{
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	int			nextidx;
-	XLogRecPtr	OldPageRqstPtr;
-	XLogwrtRqst WriteRqst;
-	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
-	XLogRecPtr	NewPageBeginPtr;
-	XLogPageHeader NewPage;
-	int			npages = 0;
-
-	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
-
-	/*
-	 * Now that we have the lock, check if someone initialized the page
-	 * already.
-	 */
-	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
-	{
-		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
-
-		/*
-		 * Get ending-offset of the buffer page we need to replace (this may
-		 * be zero if the buffer hasn't been used yet).  Fall through if it's
-		 * already written out.
-		 */
-		OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
-		if (LogwrtResult.Write < OldPageRqstPtr)
-		{
-			/*
-			 * Nope, got work to do. If we just want to pre-initialize as much
-			 * as we can without flushing, give up now.
-			 */
-			if (opportunistic)
-				break;
-
-			/* Before waiting, get info_lck and update LogwrtResult */
-			SpinLockAcquire(&XLogCtl->info_lck);
-			if (XLogCtl->LogwrtRqst.Write < OldPageRqstPtr)
-				XLogCtl->LogwrtRqst.Write = OldPageRqstPtr;
-			LogwrtResult = XLogCtl->LogwrtResult;
-			SpinLockRelease(&XLogCtl->info_lck);
-
-			/*
-			 * Now that we have an up-to-date LogwrtResult value, see if we
-			 * still need to write it or if someone else already did.
-			 */
-			if (LogwrtResult.Write < OldPageRqstPtr)
-			{
-				/*
-				 * Must acquire write lock. Release WALBufMappingLock first,
-				 * to make sure that all insertions that we need to wait for
-				 * can finish (up to this same position). Otherwise we risk
-				 * deadlock.
-				 */
-				LWLockRelease(WALBufMappingLock);
-
-				WaitXLogInsertionsToFinish(OldPageRqstPtr);
-
-				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
-
-				LogwrtResult = XLogCtl->LogwrtResult;
-				if (LogwrtResult.Write >= OldPageRqstPtr)
-				{
-					/* OK, someone wrote it already */
-					LWLockRelease(WALWriteLock);
-				}
-				else
-				{
-					/* Have to write it ourselves */
-					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
-					WriteRqst.Write = OldPageRqstPtr;
-					WriteRqst.Flush = 0;
-					XLogWrite(WriteRqst, false);
-					LWLockRelease(WALWriteLock);
-					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
-				}
-				/* Re-acquire WALBufMappingLock and retry */
-				LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
-				continue;
-			}
-		}
-
-		/*
-		 * Now the next buffer slot is free and we can set it up to be the
-		 * next output page.
-		 */
-		NewPageBeginPtr = XLogCtl->InitializedUpTo;
-		NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
-
-		Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx);
-
-		NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
-
-		/*
-		 * Be sure to re-zero the buffer so that bytes beyond what we've
-		 * written will look like zeroes and not valid XLOG records...
-		 */
-		MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
-
-		/*
-		 * Fill the new page's header
-		 */
-		NewPage->xlp_magic = XLOG_PAGE_MAGIC;
-
-		/* NewPage->xlp_info = 0; */	/* done by memset */
-		NewPage->xlp_tli = ThisTimeLineID;
-		NewPage->xlp_pageaddr = NewPageBeginPtr;
-
-		/* NewPage->xlp_rem_len = 0; */	/* done by memset */
-
-		/*
-		 * If online backup is not in progress, mark the header to indicate
-		 * that WAL records beginning in this page have removable backup
-		 * blocks.  This allows the WAL archiver to know whether it is safe to
-		 * compress archived WAL data by transforming full-block records into
-		 * the non-full-block format.  It is sufficient to record this at the
-		 * page level because we force a page switch (in fact a segment
-		 * switch) when starting a backup, so the flag will be off before any
-		 * records can be written during the backup.  At the end of a backup,
-		 * the last page will be marked as all unsafe when perhaps only part
-		 * is unsafe, but at worst the archiver would miss the opportunity to
-		 * compress a few records.
-		 */
-		if (!Insert->forcePageWrites)
-			NewPage->xlp_info |= XLP_BKP_REMOVABLE;
-
-		/*
-		 * If first page of an XLOG segment file, make it a long header.
-		 */
-		if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
-		{
-			XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
-
-			NewLongPage->xlp_sysid = ControlFile->system_identifier;
-			NewLongPage->xlp_seg_size = wal_segment_size;
-			NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
-			NewPage->xlp_info |= XLP_LONG_HEADER;
-		}
-
-		/*
-		 * Make sure the initialization of the page becomes visible to others
-		 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
-		 * holding a lock.
-		 */
-		pg_write_barrier();
-
-		*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
-
-		XLogCtl->InitializedUpTo = NewPageEndPtr;
-
-		npages++;
-	}
-	LWLockRelease(WALBufMappingLock);
-
-#ifdef WAL_DEBUG
-	if (XLOG_DEBUG && npages > 0)
-	{
-		elog(DEBUG1, "initialized %d pages, up to %X/%X",
-			 npages, (uint32) (NewPageEndPtr >> 32), (uint32) NewPageEndPtr);
-	}
-#endif
-}
-
 /*
  * Calculate CheckPointSegments based on max_wal_size_mb and
  * checkpoint_completion_target.
@@ -2375,14 +2066,9 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
 static void
 XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 {
-	bool		ispartialpage;
-	bool		last_iteration;
 	bool		finishing_seg;
-	bool		use_existent;
-	int			curridx;
-	int			npages;
-	int			startidx;
-	uint32		startoffset;
+	XLogSegNo	rqstLogSegNo;
+	XLogSegNo	segno;
 
 	/* We should always be inside a critical section here */
 	Assert(CritSectionCount > 0);
@@ -2392,223 +2078,140 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	 */
 	LogwrtResult = XLogCtl->LogwrtResult;
 
-	/*
-	 * Since successive pages in the xlog cache are consecutively allocated,
-	 * we can usually gather multiple pages together and issue just one
-	 * write() call.  npages is the number of pages we have determined can be
-	 * written together; startidx is the cache block index of the first one,
-	 * and startoffset is the file offset at which it should go. The latter
-	 * two variables are only valid when npages > 0, but we must initialize
-	 * all of them to keep the compiler quiet.
-	 */
-	npages = 0;
-	startidx = 0;
-	startoffset = 0;
+	/* Fast return if not requested to flush */
+	if (WriteRqst.Flush == 0)
+		return;
+	Assert(WriteRqst.Flush == WriteRqst.Write);
 
 	/*
-	 * Within the loop, curridx is the cache block index of the page to
-	 * consider writing.  Begin at the buffer containing the next unwritten
-	 * page, or last partially written page.
+	 * Call pmem_persist() or pmem_msync() for each segment file that contains
+	 * records to be flushed.
 	 */
-	curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);
-
-	while (LogwrtResult.Write < WriteRqst.Write)
+	XLByteToPrevSeg(WriteRqst.Flush, rqstLogSegNo, wal_segment_size);
+	XLByteToSeg(LogwrtResult.Flush, segno, wal_segment_size);
+	while (segno <= rqstLogSegNo)
 	{
-		/*
-		 * Make sure we're not ahead of the insert process.  This could happen
-		 * if we're passed a bogus WriteRqst.Write that is past the end of the
-		 * last page that's been initialized by AdvanceXLInsertBuffer.
-		 */
-		XLogRecPtr	EndPtr = XLogCtl->xlblocks[curridx];
+		bool		is_pmem;
+		char	   *addr;
+		char	   *p;
+		Size		len;
+		XLogRecPtr	BeginPtr;
+		XLogRecPtr	EndPtr;
 
-		if (LogwrtResult.Write >= EndPtr)
-			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
-				 (uint32) (LogwrtResult.Write >> 32),
-				 (uint32) LogwrtResult.Write,
-				 (uint32) (EndPtr >> 32), (uint32) EndPtr);
-
-		/* Advance LogwrtResult.Write to end of current buffer page */
-		LogwrtResult.Write = EndPtr;
-		ispartialpage = WriteRqst.Write < LogwrtResult.Write;
-
-		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
-							 wal_segment_size))
+		/* Check if the segment is not mapped yet */
+		if (segno != openLogSegNo)
 		{
+			/* Map newly */
+			is_pmem = 0;
+			addr = XLogFileMap(segno, &is_pmem);
+
 			/*
-			 * Switch to new logfile segment.  We cannot have any pending
-			 * pages here (since we dump what we have at segment end).
+			 * Use the mapped above as WAL buffer of this process for the
+			 * future.  Note that it might be unmapped within this loop.
 			 */
-			Assert(npages == 0);
-			if (openLogFile >= 0)
-				XLogFileClose();
-			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
-							wal_segment_size);
-
-			/* create/use new log file */
-			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
+			if (openLogSegNo == 0)
+			{
+				pmemMapped = is_pmem;
+				mappedPages = addr;
+				openLogSegNo = segno;
+			}
 		}
-
-		/* Make sure we have the current logfile open */
-		if (openLogFile < 0)
+		else
 		{
-			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
-							wal_segment_size);
-			openLogFile = XLogFileOpen(openLogSegNo);
+			/* Or use existent mapping */
+			is_pmem = pmemMapped;
+			addr = mappedPages;
 		}
+		Assert(addr != NULL);
+		Assert(mappedPages != NULL);
+		Assert(openLogSegNo > 0);
 
-		/* Add current page to the set of pending pages-to-dump */
-		if (npages == 0)
-		{
-			/* first of group */
-			startidx = curridx;
-			startoffset = XLogSegmentOffset(LogwrtResult.Write - XLOG_BLCKSZ,
-											wal_segment_size);
-		}
-		npages++;
+		/* Find beginning position to be flushed */
+		BeginPtr = segno * wal_segment_size;
+		if (BeginPtr < LogwrtResult.Flush)
+			BeginPtr = LogwrtResult.Flush;
+
+		/* Find ending position to be flushed */
+		EndPtr = (segno + 1) * wal_segment_size;
+		if (EndPtr > WriteRqst.Flush)
+			EndPtr = WriteRqst.Flush;
+
+		/* Convert LSN to memory address */
+		Assert(BeginPtr <= EndPtr);
+		p = addr + BeginPtr % wal_segment_size;
+		len = (Size) (EndPtr - BeginPtr);
 
 		/*
-		 * Dump the set if this will be the last loop iteration, or if we are
-		 * at the last page of the cache area (since the next page won't be
-		 * contiguous in memory), or if we are at the end of the logfile
-		 * segment.
+		 * Do cache-flush or msync.
+		 *
+		 * Note that pmem_msync() does backoff to the page boundary.
 		 */
-		last_iteration = WriteRqst.Write <= LogwrtResult.Write;
-
-		finishing_seg = !ispartialpage &&
-			(startoffset + npages * XLOG_BLCKSZ) >= wal_segment_size;
-
-		if (last_iteration ||
-			curridx == XLogCtl->XLogCacheBlck ||
-			finishing_seg)
+		if (is_pmem)
 		{
-			char	   *from;
-			Size		nbytes;
-			Size		nleft;
-			int			written;
-
-			/* OK to write the page(s) */
-			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
-			nbytes = npages * (Size) XLOG_BLCKSZ;
-			nleft = nbytes;
-			do
+			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
+			pmem_persist(p, len);
+			pgstat_report_wait_end();
+		}
+		else
+		{
+			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
+			if (pmem_msync(p, len))
 			{
-				errno = 0;
-				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
-				if (written <= 0)
-				{
-					if (errno == EINTR)
-						continue;
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not write to log file %s "
-									"at offset %u, length %zu: %m",
-									XLogFileNameP(ThisTimeLineID, openLogSegNo),
-									startoffset, nleft)));
-				}
-				nleft -= written;
-				from += written;
-				startoffset += written;
-			} while (nleft > 0);
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not msync file \"%s\": %m",
+								XLogFileNameP(ThisTimeLineID, segno))));
+			}
+			pgstat_report_wait_end();
+		}
+		LogwrtResult.Flush = LogwrtResult.Write = EndPtr;
 
-			npages = 0;
+		/* Check if whole my WAL buffers are synchronized to the segment */
+		finishing_seg = (LogwrtResult.Flush % wal_segment_size == 0) &&
+						XLByteInPrevSeg(LogwrtResult.Flush, openLogSegNo,
+										wal_segment_size);
 
-			/*
-			 * If we just wrote the whole last page of a logfile segment,
-			 * fsync the segment immediately.  This avoids having to go back
-			 * and re-open prior segments when an fsync request comes along
-			 * later. Doing it here ensures that one and only one backend will
-			 * perform this fsync.
-			 *
-			 * This is also the right place to notify the Archiver that the
-			 * segment is ready to copy to archival storage, and to update the
-			 * timer for archive_timeout, and to signal for a checkpoint if
-			 * too many logfile segments have been used since the last
-			 * checkpoint.
-			 */
+		if (segno != openLogSegNo || finishing_seg)
+		{
+			XLogFileUnmap(addr, segno);
 			if (finishing_seg)
 			{
-				issue_xlog_fsync(openLogFile, openLogSegNo);
-
-				/* signal that we need to wakeup walsenders later */
-				WalSndWakeupRequest();
-
-				LogwrtResult.Flush = LogwrtResult.Write;	/* end of page */
-
-				if (XLogArchivingActive())
-					XLogArchiveNotifySeg(openLogSegNo);
-
-				XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-				XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
-
-				/*
-				 * Request a checkpoint if we've consumed too much xlog since
-				 * the last one.  For speed, we first check using the local
-				 * copy of RedoRecPtr, which might be out of date; if it looks
-				 * like a checkpoint is needed, forcibly update RedoRecPtr and
-				 * recheck.
-				 */
-				if (IsUnderPostmaster && XLogCheckpointNeeded(openLogSegNo))
-				{
-					(void) GetRedoRecPtr();
-					if (XLogCheckpointNeeded(openLogSegNo))
-						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
-				}
+				Assert(segno == openLogSegNo);
+				mappedPages = NULL;
+				openLogSegNo = 0;
 			}
-		}
 
-		if (ispartialpage)
-		{
-			/* Only asked to write a partial page */
-			LogwrtResult.Write = WriteRqst.Write;
-			break;
-		}
-		curridx = NextBufIdx(curridx);
+			/* signal that we need to wakeup walsenders later */
+			WalSndWakeupRequest();
 
-		/* If flexible, break out of loop as soon as we wrote something */
-		if (flexible && npages == 0)
-			break;
-	}
+			if (XLogArchivingActive())
+				XLogArchiveNotifySeg(segno);
 
-	Assert(npages == 0);
+			XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+			XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
 
-	/*
-	 * If asked to flush, do so
-	 */
-	if (LogwrtResult.Flush < WriteRqst.Flush &&
-		LogwrtResult.Flush < LogwrtResult.Write)
-
-	{
-		/*
-		 * Could get here without iterating above loop, in which case we might
-		 * have no open file or the wrong one.  However, we do not need to
-		 * fsync more than one file.
-		 */
-		if (sync_method != SYNC_METHOD_OPEN &&
-			sync_method != SYNC_METHOD_OPEN_DSYNC)
-		{
-			if (openLogFile >= 0 &&
-				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
-								 wal_segment_size))
-				XLogFileClose();
-			if (openLogFile < 0)
+			/*
+			 * Request a checkpoint if we've consumed too much xlog since
+			 * the last one.  For speed, we first check using the local
+			 * copy of RedoRecPtr, which might be out of date; if it looks
+			 * like a checkpoint is needed, forcibly update RedoRecPtr and
+			 * recheck.
+			 */
+			if (IsUnderPostmaster && XLogCheckpointNeeded(segno))
 			{
-				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
-								wal_segment_size);
-				openLogFile = XLogFileOpen(openLogSegNo);
+				(void) GetRedoRecPtr();
+				if (XLogCheckpointNeeded(segno))
+					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
 			}
-
-			issue_xlog_fsync(openLogFile, openLogSegNo);
 		}
 
-		/* signal that we need to wakeup walsenders later */
-		WalSndWakeupRequest();
-
-		LogwrtResult.Flush = LogwrtResult.Write;
+		++segno;
 	}
 
+	/* signal that we need to wakeup walsenders later */
+	WalSndWakeupRequest();
+
 	/*
 	 * Update shared-memory status
 	 *
@@ -3029,6 +2632,16 @@ XLogBackgroundFlush(void)
 				XLogFileClose();
 			}
 		}
+		else if (mappedPages != NULL)
+		{
+			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
+								 wal_segment_size))
+			{
+				XLogFileUnmap(mappedPages, openLogSegNo);
+				mappedPages = NULL;
+				openLogSegNo = 0;
+			}
+		}
 		return false;
 	}
 
@@ -3095,12 +2708,6 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests();
 
-	/*
-	 * Great, done. To take some work off the critical path, try to initialize
-	 * as many of the no-longer-needed WAL buffers for future use as we can.
-	 */
-	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
-
 	/*
 	 * If we determined that we need to write data, but somebody else
 	 * wrote/flushed already, it should be considered as being active, to
@@ -3257,6 +2864,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	save_errno = 0;
 	if (wal_init_zero)
 	{
+		XLogCtlInsert  *Insert = &XLogCtl->Insert;
+		XLogPageHeader	NewPage = (XLogPageHeader) zbuffer.data;
+		XLogRecPtr		NewPageBeginPtr = logsegno * wal_segment_size;
+
 		/*
 		 * Zero-fill the file.  With this setting, we do this the hard way to
 		 * ensure that all the file space has really been allocated.  On
@@ -3268,6 +2879,48 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		 */
 		for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
 		{
+			memset(NewPage, 0, SizeOfXLogLongPHD);
+
+			/*
+			 * Fill the new page's header
+			 */
+			NewPage->xlp_magic = XLOG_PAGE_MAGIC;
+
+			/* NewPage->xlp_info = 0; */	/* done by memset */
+			NewPage->xlp_tli = ThisTimeLineID;
+			NewPage->xlp_pageaddr = NewPageBeginPtr;
+
+			/* NewPage->xlp_rem_len = 0; */	/* done by memset */
+
+			/*
+			 * If online backup is not in progress, mark the header to indicate
+			 * that WAL records beginning in this page have removable backup
+			 * blocks.  This allows the WAL archiver to know whether it is safe to
+			 * compress archived WAL data by transforming full-block records into
+			 * the non-full-block format.  It is sufficient to record this at the
+			 * page level because we force a page switch (in fact a segment
+			 * switch) when starting a backup, so the flag will be off before any
+			 * records can be written during the backup.  At the end of a backup,
+			 * the last page will be marked as all unsafe when perhaps only part
+			 * is unsafe, but at worst the archiver would miss the opportunity to
+			 * compress a few records.
+			 */
+			if (!Insert->forcePageWrites)
+				NewPage->xlp_info |= XLP_BKP_REMOVABLE;
+
+			/*
+			 * If first page of an XLOG segment file, make it a long header.
+			 */
+			if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+			{
+				XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
+
+				NewLongPage->xlp_sysid = ControlFile->system_identifier;
+				NewLongPage->xlp_seg_size = wal_segment_size;
+				NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
+				NewPage->xlp_info |= XLP_LONG_HEADER;
+			}
+
 			errno = 0;
 			if (write(fd, zbuffer.data, XLOG_BLCKSZ) != XLOG_BLCKSZ)
 			{
@@ -3275,6 +2928,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 				save_errno = errno ? errno : ENOSPC;
 				break;
 			}
+
+			NewPageBeginPtr += XLOG_BLCKSZ;
 		}
 	}
 	else
@@ -3610,6 +3265,36 @@ XLogFileOpen(XLogSegNo segno)
 	return fd;
 }
 
+/*
+ * Memory-map a pre-existing logfile segment for WAL buffers.
+ *
+ * If success, it returns non-NULL and is_pmem is set whether the file is on
+ * PMEM or not.  Otherwise, it PANICs.
+ */
+static char *
+XLogFileMap(XLogSegNo segno, bool *is_pmem)
+{
+	char		path[MAXPGPATH];
+	char	   *addr;
+	Size		mlen;
+	int			pmem;
+
+	XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
+
+	mlen = 0;
+	pmem = 0;
+	addr = pmem_map_file(path, 0, 0, 0, &mlen, &pmem);
+	if (addr == NULL)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not open or mmap file \"%s\": %m", path)));
+
+	Assert(mlen == wal_segment_size);
+
+	*is_pmem = (bool) pmem;
+	return addr;
+}
+
 /*
  * Open a logfile segment for reading (during recovery).
  *
@@ -3799,6 +3484,21 @@ XLogFileClose(void)
 	openLogFile = -1;
 }
 
+/*
+ * Unmap the current logfile segment for WAL buffer.
+ */
+static void
+XLogFileUnmap(char *pages, XLogSegNo segno)
+{
+	Assert(pages != NULL);
+
+	if (pmem_unmap(pages, wal_segment_size))
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not unmap file \"%s\": %m",
+						XLogFileNameP(ThisTimeLineID, segno))));
+}
+
 /*
  * Preallocate log files beyond the specified log endpoint.
  */
@@ -4947,12 +4647,6 @@ XLOGShmemSize(void)
 
 	/* WAL insertion locks, plus alignment */
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
-	/* xlblocks array */
-	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
-	/* extra alignment padding for XLOG I/O buffers */
-	size = add_size(size, XLOG_BLCKSZ);
-	/* and the buffers themselves */
-	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
 
 	/*
 	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5028,10 +4722,6 @@ XLOGShmemInit(void)
 	 * needed here.
 	 */
 	allocptr = ((char *) XLogCtl) + sizeof(XLogCtlData);
-	XLogCtl->xlblocks = (XLogRecPtr *) allocptr;
-	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
-	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
-
 
 	/* WAL insertion locks. Ensure they're aligned to the full padded size */
 	allocptr += sizeof(WALInsertLockPadded) -
@@ -5048,15 +4738,6 @@ XLOGShmemInit(void)
 		WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
 	}
 
-	/*
-	 * Align the start of the page buffers to a full xlog block size boundary.
-	 * This simplifies some calculations in XLOG insertion. It is also
-	 * required for O_DIRECT.
-	 */
-	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
-	XLogCtl->pages = allocptr;
-	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
-
 	/*
 	 * Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
 	 * in additional info.)
@@ -7494,40 +7175,12 @@ StartupXLOG(void)
 	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
 
 	/*
-	 * Tricky point here: readBuf contains the *last* block that the LastRec
-	 * record spans, not the one it starts in.  The last block is indeed the
-	 * one we want to use.
+	 * We DO NOT need the if-else block once existed here because we use WAL
+	 * segment files as WAL buffers so the last block is "already on the
+	 * buffers."
+	 *
+	 * XXX We assume there is no torn record.
 	 */
-	if (EndOfLog % XLOG_BLCKSZ != 0)
-	{
-		char	   *page;
-		int			len;
-		int			firstIdx;
-		XLogRecPtr	pageBeginPtr;
-
-		pageBeginPtr = EndOfLog - (EndOfLog % XLOG_BLCKSZ);
-		Assert(readOff == XLogSegmentOffset(pageBeginPtr, wal_segment_size));
-
-		firstIdx = XLogRecPtrToBufIdx(EndOfLog);
-
-		/* Copy the valid part of the last block, and zero the rest */
-		page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];
-		len = EndOfLog % XLOG_BLCKSZ;
-		memcpy(page, xlogreader->readBuf, len);
-		memset(page + len, 0, XLOG_BLCKSZ - len);
-
-		XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
-		XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
-	}
-	else
-	{
-		/*
-		 * There is no partial block to copy. Just set InitializedUpTo, and
-		 * let the first attempt to insert a log record to initialize the next
-		 * buffer.
-		 */
-		XLogCtl->InitializedUpTo = EndOfLog;
-	}
 
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
-- 
2.20.1

0003-Lazy-unmap-WAL-segments.patchapplication/octet-stream; name=0003-Lazy-unmap-WAL-segments.patchDownload
From 29b1954f4bba9ffd7e28fac1c8c4302dfe4bc2a6 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 10 Feb 2020 17:53:14 +0900
Subject: [msync 3/5] Lazy-unmap WAL segments

---
 src/backend/access/transam/xlog.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 43f9a8affc..317816a0b9 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -768,7 +768,9 @@ static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
  */
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
+static XLogSegNo beingClosedLogSegNo = 0;
 static char *mappedPages = NULL;
+static char *beingUnmappedPages = NULL;
 static bool pmemMapped = 0;
 
 /*
@@ -1162,6 +1164,14 @@ XLogInsertRecord(XLogRecData *rdata,
 		}
 	}
 
+	/* Lazy-unmap */
+	if (beingUnmappedPages != NULL)
+	{
+		XLogFileUnmap(beingUnmappedPages, beingClosedLogSegNo);
+		beingUnmappedPages = NULL;
+		beingClosedLogSegNo = 0;
+	}
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
 	{
@@ -1794,9 +1804,23 @@ GetXLogBuffer(XLogRecPtr ptr)
 	XLByteToSeg(ptr, segno, wal_segment_size);
 	if (segno != openLogSegNo)
 	{
-		/* Unmap the current segment if mapped */
+		/*
+		 * We do not want to unmap the current segment here because we are in
+		 * a critial section and unmap is time-consuming operation.  So we
+		 * just mark it to be unmapped later.
+		 */
 		if (mappedPages != NULL)
-			XLogFileUnmap(mappedPages, openLogSegNo);
+		{
+			/*
+			 * If there is another being-unmapped segment, it cannot be helped;
+			 * we unmap it here.
+			 */
+			if (beingUnmappedPages != NULL)
+				XLogFileUnmap(beingUnmappedPages, beingClosedLogSegNo);
+
+			beingUnmappedPages = mappedPages;
+			beingClosedLogSegNo = openLogSegNo;
+		}
 
 		/* Map the segment we need */
 		mappedPages = XLogFileMap(segno, &pmemMapped);
-- 
2.20.1

0004-Speculative-map-WAL-segments.patchapplication/octet-stream; name=0004-Speculative-map-WAL-segments.patchDownload
From a1e54ccba738cc339647bba0bafd7df7e92915c3 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 10 Feb 2020 17:53:15 +0900
Subject: [msync 4/5] Speculative-map WAL segments

---
 src/backend/access/transam/xlog.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 317816a0b9..9b3caa63a4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -976,6 +976,8 @@ XLogInsertRecord(XLogRecData *rdata,
 							   info == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
+	XLogRecPtr	ProbablyInsertPos;
+	XLogSegNo	ProbablyInsertSegNo;
 	bool		prevDoPageWrites = doPageWrites;
 
 	/* we assume that all of the record header is in the first chunk */
@@ -985,6 +987,23 @@ XLogInsertRecord(XLogRecData *rdata,
 	if (!XLogInsertAllowed())
 		elog(ERROR, "cannot make new WAL entries during recovery");
 
+	/* Speculatively map a segment we probably need */
+	ProbablyInsertPos = GetInsertRecPtr();
+	XLByteToSeg(ProbablyInsertPos, ProbablyInsertSegNo, wal_segment_size);
+	if (ProbablyInsertSegNo != openLogSegNo)
+	{
+		if (mappedPages != NULL)
+		{
+			Assert(beingUnmappedPages == NULL);
+			Assert(beingClosedLogSegNo == 0);
+			beingUnmappedPages = mappedPages;
+			beingClosedLogSegNo = openLogSegNo;
+		}
+		mappedPages = XLogFileMap(ProbablyInsertSegNo, &pmemMapped);
+		Assert(mappedPages != NULL);
+		openLogSegNo = ProbablyInsertSegNo;
+	}
+
 	/*----------
 	 *
 	 * We have now done all the preparatory work we can without holding a
-- 
2.20.1

0005-Allocate-WAL-segments-to-utilize-hugepage.patchapplication/octet-stream; name=0005-Allocate-WAL-segments-to-utilize-hugepage.patchDownload
From 0e34f41ac611cf0f4e5bdc3428b71e0f81d33cb0 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 10 Feb 2020 17:53:16 +0900
Subject: [msync 5/5] Allocate WAL segments to utilize hugepage

See also https://nvdimm.wiki.kernel.org/2mib_fs_dax
---
 src/backend/access/transam/xlog.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9b3caa63a4..d3ef7bf6e5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2904,8 +2904,21 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	memset(zbuffer.data, 0, XLOG_BLCKSZ);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-	save_errno = 0;
-	if (wal_init_zero)
+
+	/*
+	 * Allocate the file by posix_allocate(3) to utilize hugepage and reduce
+	 * overhead of page fault.  Note that posix_fallocate(3) do not set errno
+	 * on error.  Instead, it returns an error number directly.
+	 */
+	save_errno = posix_fallocate(fd, 0, wal_segment_size);
+
+	if (save_errno)
+	{
+		/*
+		 * Do nothing on error.  Go to pgstat_report_wait_end().
+		 */
+	}
+	else if (wal_init_zero)
 	{
 		XLogCtlInsert  *Insert = &XLogCtl->Insert;
 		XLogPageHeader	NewPage = (XLogPageHeader) zbuffer.data;
-- 
2.20.1

#11Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Takashi Menjo (#1)
RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and analyzed its performance with pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL, it was "obviously worse" than the original REL_12_0. VTune told me that the CPU time of memcpy() called by CopyXLogRecordToWAL() got larger than before. When I used *NVDIMM-N and ext4 with filesystem DAX* to store WAL, however, it achieved "not bad" performance compared with our previous patchset and non-volatile WAL buffer. Each CPU time of XLogInsert() and XLogFlush() was reduced like as non-volatile WAL buffer.

So I think mmap()-ing WAL segments as WAL buffers is not such a bad idea as long as we use PMEM, at least NVDIMM-N.

Excuse me but for now I'd keep myself not talking about how much the performance was, because the mmap()-ing patchset is WIP so there might be bugs which wrongfully "improve" or "degrade" performance. Also we need to know persistent memory programming and related features such as filesystem DAX, huge page faults, and WAL persistence with cache flush and memory barrier instructions to explain why the performance improved. I'd talk about all the details at the appropriate time and place. (The conference, or here later...)

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Monday, February 10, 2020 6:30 PM
To: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>
Cc: 'pgsql-hackers@postgresql.org' <pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I made another WIP patchset to mmap WAL segments as WAL buffers. Note that this is not a non-volatile WAL
buffer patchset but its competitor. I am measuring and analyzing the performance of this patchset to compare
with my N.V.WAL buffer.

Please wait for a several more days for the result report...

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center

-----Original Message-----
From: Robert Haas <robertmhaas@gmail.com>
Sent: Wednesday, January 29, 2020 6:00 AM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Heikki Linnakangas <hlinnaka@iki.fi>; pgsql-hackers@postgresql.org
Subject: Re: [PoC] Non-volatile WAL buffer

On Tue, Jan 28, 2020 at 3:28 AM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

I think our concerns are roughly classified into two:

(1) Performance
(2) Consistency

And your "different concern" is rather into (2), I think.

Actually, I think it was mostly a performance concern (writes
triggering lots of reading) but there might be a consistency issue as well.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL
Company

#12Amit Langote
amitlangote09@gmail.com
In reply to: Takashi Menjo (#11)
Re: [PoC] Non-volatile WAL buffer

Menjo-san,

On Mon, Feb 17, 2020 at 1:13 PM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:

I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and analyzed its performance with pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL, it was "obviously worse" than the original REL_12_0.

I apologize for not having any opinion on the patches themselves, but
let me point out that it's better to base these patches on HEAD
(master branch) than REL_12_0, because all new code is committed to
the master branch, whereas stable branches such as REL_12_0 only
receive bug fixes. Do you have any specific reason to be working on
REL_12_0?

Thanks,
Amit

#13Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Amit Langote (#12)
RE: [PoC] Non-volatile WAL buffer

Hello Amit,

I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
whereas stable branches such as REL_12_0 only receive bug fixes. Do you have any specific reason to be working
on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss performance measurement. Of course I know all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm aware of rebasing my patchset onto master sooner or later. However, if someone, including me, says that s/he applies my patchset to "master" and measures its performance, we have to pay attention to which commit the "master" really points to. Although we have sha1 hashes to specify which commit, we should check whether the specific commit on master has patches affecting performance or not because master's HEAD gets new patches day by day. On the other hand, a release tag clearly points the commit all we probably know. Also we can check more easily the features and improvements by using release notes and user manuals.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 1:39 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas <hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Menjo-san,

On Mon, Feb 17, 2020 at 1:13 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

I applied my patchset that mmap()-s WAL segments as WAL buffers to refs/tags/REL_12_0, and measured and

analyzed its performance with pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL, it was
"obviously worse" than the original REL_12_0.

I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
whereas stable branches such as REL_12_0 only receive bug fixes. Do you have any specific reason to be working
on REL_12_0?

Thanks,
Amit

#14Amit Langote
amitlangote09@gmail.com
In reply to: Takashi Menjo (#13)
Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp> wrote:

Hello Amit,

I apologize for not having any opinion on the patches themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new code is committed to the master branch,
whereas stable branches such as REL_12_0 only receive bug fixes. Do you have any specific reason to be working
on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss performance measurement. Of course I know all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm aware of rebasing my patchset onto master sooner or later. However, if someone, including me, says that s/he applies my patchset to "master" and measures its performance, we have to pay attention to which commit the "master" really points to. Although we have sha1 hashes to specify which commit, we should check whether the specific commit on master has patches affecting performance or not because master's HEAD gets new patches day by day. On the other hand, a release tag clearly points the commit all we probably know. Also we can check more easily the features and improvements by using release notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least two
numbers -- performance with a branch's HEAD without patch applied and
that with patch applied -- which can be enough in most cases to see
the difference the patch makes. Sure, the numbers might change on
each report, but that's fine I'd think. If you continue to develop
against the stable branch, you might miss to notice impact from any
relevant developments in the master branch, even developments which
possibly require rethinking the architecture of your own changes,
although maybe that rarely occurs.

Thanks,
Amit

#15Andres Freund
andres@anarazel.de
In reply to: Takashi Menjo (#11)
Re: [PoC] Non-volatile WAL buffer

Hi,

On 2020-02-17 13:12:37 +0900, Takashi Menjo wrote:

I applied my patchset that mmap()-s WAL segments as WAL buffers to
refs/tags/REL_12_0, and measured and analyzed its performance with
pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL,
it was "obviously worse" than the original REL_12_0. VTune told me
that the CPU time of memcpy() called by CopyXLogRecordToWAL() got
larger than before.

FWIW, this might largely be because of page faults. In contrast to
before we wouldn't reuse the same pages (because they've been
munmap()/mmap()ed), so the first time they're touched, we'll incur page
faults. Did you try mmap()ing with MAP_POPULATE? It's probably also
worthwhile to try to use MAP_HUGETLB.

Still doubtful it's the right direction, but I'd rather have good
numbers to back me up :)

Greetings,

Andres Freund

#16Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Amit Langote (#14)
RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas <hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

Hello Amit,

I apologize for not having any opinion on the patches themselves,
but let me point out that it's better to base these patches on HEAD
(master branch) than REL_12_0, because all new code is committed to
the master branch, whereas stable branches such as REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable branches and not even release tags, so I'm
aware of rebasing my patchset onto master sooner or later. However, if someone, including me, says that s/he
applies my patchset to "master" and measures its performance, we have to pay attention to which commit the
"master" really points to. Although we have sha1 hashes to specify which commit, we should check whether the
specific commit on master has patches affecting performance or not because master's HEAD gets new patches day
by day. On the other hand, a release tag clearly points the commit all we probably know. Also we can check more
easily the features and improvements by using release notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest stable release' branch, that's normally just one
of the baselines.
The more important baseline for ongoing development is the master branch's HEAD, which is also what people
volunteering to test your patches would use. Anyone who reports would have to give at least two numbers --
performance with a branch's HEAD without patch applied and that with patch applied -- which can be enough in
most cases to see the difference the patch makes. Sure, the numbers might change on each report, but that's fine
I'd think. If you continue to develop against the stable branch, you might miss to notice impact from any relevant
developments in the master branch, even developments which possibly require rethinking the architecture of your
own changes, although maybe that rarely occurs.

Thanks,
Amit

#17Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Takashi Menjo (#1)
6 attachment(s)
RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master. A new v2 patchset is attached to this mail.

I also measured performance before and after patchset, varying -c/--client and -j/--jobs options of pgbench, for each scaling factor s = 50 or 1000. The results are presented in the following tables and the attached charts. Conditions, steps, and other details will be shown later.

Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)

Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)

Both throughput and average latency are improved for each scaling factor. Throughput seemed to almost reach the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case. I think larger scaling factor leads to less contentions on the same tables and/or indexes, that is, less lock and unlock operations. In such a situation, write-ahead logging appears to be more significant for performance.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch

Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown in the tables above.

(1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>; 'PostgreSQL-development'
<pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas
<hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

Hello Amit,

I apologize for not having any opinion on the patches themselves,
but let me point out that it's better to base these patches on
HEAD (master branch) than REL_12_0, because all new code is
committed to the master branch, whereas stable branches such as
REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone, including
me, says that s/he applies my patchset to "master" and measures its
performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has patches affecting performance or not

because master's HEAD gets new patches day by day. On the other hand, a release tag clearly points the commit
all we probably know. Also we can check more easily the features and improvements by using release notes and
user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least two
numbers -- performance with a branch's HEAD without patch applied and
that with patch applied -- which can be enough in most cases to see
the difference the patch makes. Sure, the numbers might change on
each report, but that's fine I'd think. If you continue to develop against the stable branch, you might miss to

notice impact from any relevant developments in the master branch, even developments which possibly require
rethinking the architecture of your own changes, although maybe that rarely occurs.

Thanks,
Amit

Attachments:

v2-0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=v2-0001-Support-GUCs-for-external-WAL-buffer.patchDownload
From db976d96affc0b120c79f6ac666fc4fc663b13d2 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:10:41 +0900
Subject: [PATCH v2 1/3] Support GUCs for external WAL buffer

To implement non-volatile WAL buffer, we add two new GUCs nvwal_path
and nvwal_size.  Now postgres maps a file at that path onto memory to
use it as WAL buffer.  Note that the buffer is still volatile for now.
---
 configure                                     | 262 ++++++++++++++++++
 configure.in                                  |  43 +++
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/nv_xlog_buffer.c   |  95 +++++++
 src/backend/access/transam/xlog.c             | 164 ++++++++++-
 src/backend/utils/misc/guc.c                  |  23 +-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/bin/initdb/initdb.c                       |  95 ++++++-
 src/include/access/nv_xlog_buffer.h           |  71 +++++
 src/include/access/xlog.h                     |   2 +
 src/include/pg_config.h.in                    |   6 +
 src/include/utils/guc.h                       |   4 +
 12 files changed, 748 insertions(+), 22 deletions(-)
 create mode 100644 src/backend/access/transam/nv_xlog_buffer.c
 create mode 100644 src/include/access/nv_xlog_buffer.h

diff --git a/configure b/configure
index 93ee4a2937..72ebaa525d 100755
--- a/configure
+++ b/configure
@@ -864,6 +864,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_nvwal
 with_gnu_ld
 enable_largefile
 '
@@ -1566,6 +1567,7 @@ Optional Packages:
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
+  --with-nvwal            use non-volatile WAL buffer (NVWAL)
   --with-gnu-ld           assume the C compiler uses GNU ld [default=no]
 
 Some influential environment variables:
@@ -8307,6 +8309,203 @@ fi
 
 
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with non-volatile WAL buffer (NVWAL)" >&5
+$as_echo_n "checking whether to build with non-volatile WAL buffer (NVWAL)... " >&6; }
+
+
+
+# Check whether --with-nvwal was given.
+if test "${with_nvwal+set}" = set; then :
+  withval=$with_nvwal;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_NVWAL 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-nvwal option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_nvwal=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_nvwal" >&5
+$as_echo "$with_nvwal" >&6; }
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+    freebsd1*|freebsd2*) elf=no;;
+    freebsd3*|freebsd4*) elf=yes;;
+esac
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for grep that handles long lines and -e" >&5
+$as_echo_n "checking for grep that handles long lines and -e... " >&6; }
+if ${ac_cv_path_GREP+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  if test -z "$GREP"; then
+  ac_path_GREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in grep ggrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_GREP="$as_dir/$ac_prog$ac_exec_ext"
+      as_fn_executable_p "$ac_path_GREP" || continue
+# Check for GNU ac_path_GREP and select it if it is found.
+  # Check for GNU $ac_path_GREP
+case `"$ac_path_GREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_GREP="$ac_path_GREP" ac_path_GREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'GREP' >> "conftest.nl"
+    "$ac_path_GREP" -e 'GREP$' -e '-(cannot match)-' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_GREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_GREP="$ac_path_GREP"
+      ac_path_GREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_GREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_GREP"; then
+    as_fn_error $? "no acceptable grep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_GREP=$GREP
+fi
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_GREP" >&5
+$as_echo "$ac_cv_path_GREP" >&6; }
+ GREP="$ac_cv_path_GREP"
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for egrep" >&5
+$as_echo_n "checking for egrep... " >&6; }
+if ${ac_cv_path_EGREP+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  if echo a | $GREP -E '(a|b)' >/dev/null 2>&1
+   then ac_cv_path_EGREP="$GREP -E"
+   else
+     if test -z "$EGREP"; then
+  ac_path_EGREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in egrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_EGREP="$as_dir/$ac_prog$ac_exec_ext"
+      as_fn_executable_p "$ac_path_EGREP" || continue
+# Check for GNU ac_path_EGREP and select it if it is found.
+  # Check for GNU $ac_path_EGREP
+case `"$ac_path_EGREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_EGREP="$ac_path_EGREP" ac_path_EGREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'EGREP' >> "conftest.nl"
+    "$ac_path_EGREP" 'EGREP$' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_EGREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_EGREP="$ac_path_EGREP"
+      ac_path_EGREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_EGREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_EGREP"; then
+    as_fn_error $? "no acceptable egrep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_EGREP=$EGREP
+fi
+
+   fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_EGREP" >&5
+$as_echo "$ac_cv_path_EGREP" >&6; }
+ EGREP="$ac_cv_path_EGREP"
+
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#if __ELF__
+  yes
+#endif
+
+_ACEOF
+if (eval "$ac_cpp conftest.$ac_ext") 2>&5 |
+  $EGREP "yes" >/dev/null 2>&1; then :
+  ELF_SYS=true
+else
+  if test "X$elf" = "Xyes" ; then
+  ELF_SYS=true
+else
+  ELF_SYS=
+fi
+fi
+rm -f conftest*
+
+
+
 #
 # Assignments
 #
@@ -12664,6 +12863,57 @@ fi
 fi
 
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'libpmem' is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+fi
+
 
 ##
 ## Header files
@@ -13343,6 +13593,18 @@ fi
 
 done
 
+fi
+
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+
 fi
 
 if test "$PORTNAME" = "win32" ; then
diff --git a/configure.in b/configure.in
index e2ae4e2d3e..4b3f1b4c42 100644
--- a/configure.in
+++ b/configure.in
@@ -968,6 +968,38 @@ PGAC_ARG_BOOL(with, zlib, yes,
               [do not use Zlib])
 AC_SUBST(with_zlib)
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+AC_MSG_CHECKING([whether to build with non-volatile WAL buffer (NVWAL)])
+PGAC_ARG_BOOL(with, nvwal, no, [use non-volatile WAL buffer (NVWAL)],
+              [AC_DEFINE([USE_NVWAL], 1, [Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal)])])
+AC_MSG_RESULT([$with_nvwal])
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+    freebsd1*|freebsd2*) elf=no;;
+    freebsd3*|freebsd4*) elf=yes;;
+esac
+
+AC_EGREP_CPP(yes,
+[#if __ELF__
+  yes
+#endif
+],
+[ELF_SYS=true],
+[if test "X$elf" = "Xyes" ; then
+  ELF_SYS=true
+else
+  ELF_SYS=
+fi])
+AC_SUBST(ELF_SYS)
+
 #
 # Assignments
 #
@@ -1269,6 +1301,12 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [],
+               [AC_MSG_ERROR([library 'libpmem' is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 
 ##
 ## Header files
@@ -1446,6 +1484,11 @@ elif test "$with_uuid" = ossp ; then
       [AC_MSG_ERROR([header file <ossp/uuid.h> or <uuid.h> is required for OSSP UUID])])])
 fi
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 if test "$PORTNAME" = "win32" ; then
    AC_CHECK_HEADERS(crtdefs.h)
 fi
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..b41a710e7e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -32,7 +32,8 @@ OBJS = \
 	xlogfuncs.o \
 	xloginsert.o \
 	xlogreader.o \
-	xlogutils.o
+	xlogutils.o \
+	nv_xlog_buffer.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/nv_xlog_buffer.c b/src/backend/access/transam/nv_xlog_buffer.c
new file mode 100644
index 0000000000..cfc6a6376b
--- /dev/null
+++ b/src/backend/access/transam/nv_xlog_buffer.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * nv_xlog_buffer.c
+ *		PostgreSQL non-volatile WAL buffer
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/nv_xlog_buffer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#ifdef USE_NVWAL
+
+#include <libpmem.h>
+#include "access/nv_xlog_buffer.h"
+
+#include "miscadmin.h" /* IsBootstrapProcessingMode */
+#include "common/file_perm.h" /* pg_file_create_mode */
+
+/*
+ * Maps non-volatile WAL buffer on shared memory.
+ *
+ * Returns a mapped address if success; PANICs and never return otherwise.
+ */
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+	void	   *addr;
+	size_t		map_len = 0;
+	int			is_pmem = 0;
+
+	Assert(fname != NULL);
+	Assert(fsize > 0);
+
+	if (IsBootstrapProcessingMode())
+	{
+		/*
+		 * Create and map a new file if we are in bootstrap mode (typically
+		 * executed by initdb).
+		 */
+		addr = pmem_map_file(fname, fsize, PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+							 pg_file_create_mode, &map_len, &is_pmem);
+	}
+	else
+	{
+		/*
+		 * Map an existing file.  The second argument (len) should be zero,
+		 * the third argument (flags) should have neither PMEM_FILE_CREATE nor
+		 * PMEM_FILE_EXCL, and the fourth argument (mode) will be ignored.
+		 */
+		addr = pmem_map_file(fname, 0, 0, 0, &map_len, &is_pmem);
+	}
+
+	if (addr == NULL)
+		elog(PANIC, "could not map non-volatile WAL buffer '%s': %m", fname);
+
+	if (map_len != fsize)
+		elog(PANIC, "size of non-volatile WAL buffer '%s' is invalid; "
+					"expected %zu; actual %zu",
+			 fname, fsize, map_len);
+
+	if (!is_pmem)
+		elog(PANIC, "non-volatile WAL buffer '%s' is not on persistent memory",
+			 fname);
+
+	/*
+	 * Assert page boundary alignment (8KiB as default).  It should pass because
+	 * PMDK considers hugepage boundary alignment (2MiB or 1GiB on x64).
+	 */
+	Assert((uint64) addr % XLOG_BLCKSZ == 0);
+
+	elog(LOG, "non-volatile WAL buffer '%s' is mapped on [%p-%p)",
+		 fname, addr, (char *) addr + map_len);
+	return addr;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+	Assert(addr != NULL);
+
+	if (pmem_unmap(addr, fsize) < 0)
+	{
+		elog(WARNING, "could not unmap non-volatile WAL buffer: %m");
+		return;
+	}
+
+	elog(LOG, "non-volatile WAL buffer unmapped");
+}
+
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4361568882..24aed4e76e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -36,6 +36,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "access/nv_xlog_buffer.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
@@ -852,6 +853,12 @@ static bool InRedo = false;
 /* Have we launched bgwriter during recovery? */
 static bool bgwriterLaunched = false;
 
+/* For non-volatile WAL buffer (NVWAL) */
+char	   *NvwalPath = NULL;	/* a GUC parameter */
+int			NvwalSizeMB = 1024;	/* a direct GUC parameter */
+static Size	NvwalSize = 0;		/* an indirect GUC parameter */
+static bool	NvwalAvail = false;
+
 /* For WALInsertLockAcquire/Release functions */
 static int	MyLockNo = 0;
 static bool holdingAllLocks = false;
@@ -4947,6 +4954,76 @@ check_wal_buffers(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * GUC check_hook for nvwal_path.
+ */
+bool
+check_nvwal_path(char **newval, void **extra, GucSource source)
+{
+#ifndef USE_NVWAL
+	Assert(!NvwalAvail);
+
+	if (**newval != '\0')
+	{
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("nvwal_path is invalid parameter without NVWAL");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_nvwal_path(const char *newval, void *extra)
+{
+	/* true if not empty; false if empty */
+	NvwalAvail = (bool) (*newval != '\0');
+}
+
+/*
+ * GUC check_hook for nvwal_size.
+ *
+ * It checks the boundary only and DOES NOT check if the size is multiple
+ * of wal_segment_size because the segment size (probably stored in the
+ * control file) have not been set properly here yet.
+ *
+ * See XLOGShmemSize for more validation.
+ */
+bool
+check_nvwal_size(int *newval, void **extra, GucSource source)
+{
+#ifdef USE_NVWAL
+	Size		buf_size;
+	int64		npages;
+
+	Assert(*newval > 0);
+
+	buf_size = (Size) (*newval) * 1024 * 1024;
+	npages = (int64) buf_size / XLOG_BLCKSZ;
+	Assert(npages > 0);
+
+	if (npages > INT_MAX)
+	{
+		/* XLOG_BLCKSZ could be so small that npages exceeds INT_MAX */
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("invalid value for nvwal_size (%dMB): "
+						 "the number of WAL pages too large; "
+						 "buf_size %zu; XLOG_BLCKSZ %d",
+						 *newval, buf_size, (int) XLOG_BLCKSZ);
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_nvwal_size(int newval, void *extra)
+{
+	NvwalSize = (Size) newval * 1024 * 1024;
+}
+
 /*
  * Read the control file, set respective GUCs.
  *
@@ -4975,13 +5052,49 @@ XLOGShmemSize(void)
 {
 	Size		size;
 
+	/*
+	 * If we use non-volatile WAL buffer, we don't use the given wal_buffers.
+	 * Instead, we set it the value based on the size of the file for the
+	 * buffer. This should be done here because of xlblocks array calculation.
+	 */
+	if (NvwalAvail)
+	{
+		char		buf[32];
+		int64		npages;
+
+		Assert(NvwalSizeMB > 0);
+		Assert(NvwalSize > 0);
+		Assert(wal_segment_size > 0);
+		Assert(wal_segment_size % XLOG_BLCKSZ == 0);
+
+		/*
+		 * At last, we can check if the size of non-volatile WAL buffer
+		 * (nvwal_size) is multiple of WAL segment size.
+		 *
+		 * Note that NvwalSize has already been calculated in assign_nvwal_size.
+		 */
+		if (NvwalSize % wal_segment_size != 0)
+		{
+			elog(PANIC,
+				 "invalid value for nvwal_size (%dMB): "
+				 "it should be multiple of WAL segment size; "
+				 "NvwalSize %zu; wal_segment_size %d",
+				 NvwalSizeMB, NvwalSize, wal_segment_size);
+		}
+
+		npages = (int64) NvwalSize / XLOG_BLCKSZ;
+		Assert(npages > 0 && npages <= INT_MAX);
+
+		snprintf(buf, sizeof(buf), "%d", (int) npages);
+		SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
+	}
 	/*
 	 * If the value of wal_buffers is -1, use the preferred auto-tune value.
 	 * This isn't an amazingly clean place to do this, but we must wait till
 	 * NBuffers has received its final value, and must do it before using the
 	 * value of XLOGbuffers to do anything important.
 	 */
-	if (XLOGbuffers == -1)
+	else if (XLOGbuffers == -1)
 	{
 		char		buf[32];
 
@@ -4997,10 +5110,13 @@ XLOGShmemSize(void)
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
-	/* extra alignment padding for XLOG I/O buffers */
-	size = add_size(size, XLOG_BLCKSZ);
-	/* and the buffers themselves */
-	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+	if (!NvwalAvail)
+	{
+		/* extra alignment padding for XLOG I/O buffers */
+		size = add_size(size, XLOG_BLCKSZ);
+		/* and the buffers themselves */
+		size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+	}
 
 	/*
 	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5097,13 +5213,32 @@ XLOGShmemInit(void)
 	}
 
 	/*
-	 * Align the start of the page buffers to a full xlog block size boundary.
-	 * This simplifies some calculations in XLOG insertion. It is also
-	 * required for O_DIRECT.
+	 * Open and memory-map a file for non-volatile XLOG buffer. The PMDK will
+	 * align the start of the buffer to 2-MiB boundary if the size of the
+	 * buffer is larger than or equal to 4 MiB.
 	 */
-	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
-	XLogCtl->pages = allocptr;
-	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+	if (NvwalAvail)
+	{
+		/* Logging and error-handling should be done in the function */
+		XLogCtl->pages = MapNonVolatileXLogBuffer(NvwalPath, NvwalSize);
+
+		/*
+		 * Do not memset non-volatile XLOG buffer (XLogCtl->pages) here
+		 * because it would contain records for recovery. We should do so in
+		 * checkpoint after the recovery completes successfully.
+		 */
+	}
+	else
+	{
+		/*
+		 * Align the start of the page buffers to a full xlog block size
+		 * boundary. This simplifies some calculations in XLOG insertion. It
+		 * is also required for O_DIRECT.
+		 */
+		allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+		XLogCtl->pages = allocptr;
+		memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+	}
 
 	/*
 	 * Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
@@ -8400,6 +8535,13 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+
+	/*
+	 * If we use non-volatile XLOG buffer, unmap it.
+	 */
+	if (NvwalAvail)
+		UnmapNonVolatileXLogBuffer(XLogCtl->pages, NvwalSize);
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 464f264d9a..4befd4d276 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2703,7 +2703,7 @@ static struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_XBLOCKS
 		},
 		&XLOGbuffers,
-		-1, -1, (INT_MAX / XLOG_BLCKSZ),
+		-1, -1, INT_MAX,
 		check_wal_buffers, NULL, NULL
 	},
 
@@ -3304,6 +3304,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, assign_tcp_user_timeout, show_tcp_user_timeout
 	},
 
+	{
+		{"nvwal_size", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Size of non-volatile WAL buffer (NVWAL)."),
+			NULL,
+			GUC_UNIT_MB
+		},
+		&NvwalSizeMB,
+		1024, 1, INT_MAX,
+		check_nvwal_size, assign_nvwal_size, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -4330,6 +4341,16 @@ static struct config_string ConfigureNamesString[] =
 		check_backtrace_functions, assign_backtrace_functions, NULL
 	},
 
+	{
+		{"nvwal_path", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Path to file for non-volatile WAL buffer (NVWAL)."),
+			NULL
+		},
+		&NvwalPath,
+		"",
+		check_nvwal_path, assign_nvwal_path, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index e58e4788a8..0c23c4d26b 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -224,6 +224,8 @@
 #checkpoint_timeout = 5min		# range 30s-1d
 #max_wal_size = 1GB
 #min_wal_size = 80MB
+#nvwal_path = '/path/to/nvwal'
+#nvwal_size = 1GB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index a6577486ce..869f95915e 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -144,7 +144,10 @@ static bool show_setting = false;
 static bool data_checksums = false;
 static char *xlog_dir = NULL;
 static char *str_wal_segment_size_mb = NULL;
+static char *nvwal_path = NULL;
+static char *str_nvwal_size_mb = NULL;
 static int	wal_segment_size_mb;
+static int	nvwal_size_mb;
 
 
 /* internal vars */
@@ -1103,14 +1106,78 @@ setup_config(void)
 	conflines = replace_token(conflines, "#port = 5432", repltok);
 #endif
 
-	/* set default max_wal_size and min_wal_size */
-	snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
-			 pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
-	conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
-
-	snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
-			 pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
-	conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+	if (nvwal_path != NULL)
+	{
+		int nr_segs;
+
+		if (str_nvwal_size_mb == NULL)
+			nvwal_size_mb = 1024;
+		else
+		{
+			char *endptr;
+
+			/* check that the argument is a number */
+			nvwal_size_mb = strtol(str_nvwal_size_mb, &endptr, 10);
+
+			/* verify that the size of non-volatile WAL buffer is valid */
+			if (endptr == str_nvwal_size_mb || *endptr != '\0')
+			{
+				pg_log_error("argument of --nvwal-size must be a number; "
+							 "str_nvwal_size_mb '%s'",
+							 str_nvwal_size_mb);
+				exit(1);
+			}
+			if (nvwal_size_mb <= 0)
+			{
+				pg_log_error("argument of --nvwal-size must be a positive number; "
+							 "str_nvwal_size_mb '%s'; nvwal_size_mb %d",
+							 str_nvwal_size_mb, nvwal_size_mb);
+				exit(1);
+			}
+			if (nvwal_size_mb % wal_segment_size_mb != 0)
+			{
+				pg_log_error("argument of --nvwal-size must be multiple of WAL segment size; "
+							 "str_nvwal_size_mb '%s'; nvwal_size_mb %d; wal_segment_size_mb %d",
+							 str_nvwal_size_mb, nvwal_size_mb, wal_segment_size_mb);
+				exit(1);
+			}
+		}
+
+		/*
+		 * XXX We set {min_,max_,nv}wal_size to the same value.  Note that
+		 * postgres might bootstrap and run if the three config does not have
+		 * the same value, but have not been tested yet.
+		 */
+		nr_segs = nvwal_size_mb / wal_segment_size_mb;
+
+		snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "nvwal_path = '%s'",
+				 nvwal_path);
+		conflines = replace_token(conflines,
+								  "#nvwal_path = '/path/to/nvwal'", repltok);
+
+		snprintf(repltok, sizeof(repltok), "nvwal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#nvwal_size = 1GB", repltok);
+	}
+	else
+	{
+		/* set default max_wal_size and min_wal_size */
+		snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+				 pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
+		conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+				 pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
+		conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+	}
 
 	snprintf(repltok, sizeof(repltok), "lc_messages = '%s'",
 			 escape_quotes(lc_messages));
@@ -2309,6 +2376,8 @@ usage(const char *progname)
 	printf(_("  -W, --pwprompt            prompt for a password for the new superuser\n"));
 	printf(_("  -X, --waldir=WALDIR       location for the write-ahead log directory\n"));
 	printf(_("      --wal-segsize=SIZE    size of WAL segments, in megabytes\n"));
+	printf(_("  -P, --nvwal-path=FILE     path to file for non-volatile WAL buffer (NVWAL)\n"));
+	printf(_("  -Q, --nvwal-size=SIZE     size of NVWAL, in megabytes\n"));
 	printf(_("\nLess commonly used options:\n"));
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("  -k, --data-checksums      use data page checksums\n"));
@@ -2982,6 +3051,8 @@ main(int argc, char *argv[])
 		{"sync-only", no_argument, NULL, 'S'},
 		{"waldir", required_argument, NULL, 'X'},
 		{"wal-segsize", required_argument, NULL, 12},
+		{"nvwal-path", required_argument, NULL, 'P'},
+		{"nvwal-size", required_argument, NULL, 'Q'},
 		{"data-checksums", no_argument, NULL, 'k'},
 		{"allow-group-access", no_argument, NULL, 'g'},
 		{NULL, 0, NULL, 0}
@@ -3025,7 +3096,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:g", long_options, &option_index)) != -1)
+	while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:P:Q:g", long_options, &option_index)) != -1)
 	{
 		switch (c)
 		{
@@ -3119,6 +3190,12 @@ main(int argc, char *argv[])
 			case 12:
 				str_wal_segment_size_mb = pg_strdup(optarg);
 				break;
+			case 'P':
+				nvwal_path = pg_strdup(optarg);
+				break;
+			case 'Q':
+				str_nvwal_size_mb = pg_strdup(optarg);
+				break;
 			case 'g':
 				SetDataDirectoryCreatePerm(PG_DIR_MODE_GROUP);
 				break;
diff --git a/src/include/access/nv_xlog_buffer.h b/src/include/access/nv_xlog_buffer.h
new file mode 100644
index 0000000000..b58878c92b
--- /dev/null
+++ b/src/include/access/nv_xlog_buffer.h
@@ -0,0 +1,71 @@
+/*
+ * nv_xlog_buffer.h
+ *
+ * PostgreSQL non-volatile WAL buffer
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nv_xlog_buffer.h
+ */
+#ifndef NV_XLOG_BUFFER_H
+#define NV_XLOG_BUFFER_H
+
+extern void *MapNonVolatileXLogBuffer(const char *fname, Size fsize);
+extern void	UnmapNonVolatileXLogBuffer(void *addr, Size fsize);
+
+#ifdef USE_NVWAL
+#include <libpmem.h>
+
+#define nv_memset_persist	pmem_memset_persist
+#define nv_memcpy_nodrain	pmem_memcpy_nodrain
+#define nv_flush			pmem_flush
+#define nv_drain			pmem_drain
+#define nv_persist			pmem_persist
+
+#else
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+	return NULL;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+	return;
+}
+
+static inline void *
+nv_memset_persist(void *pmemdest, int c, size_t len)
+{
+	return NULL;
+}
+
+static inline void *
+nv_memcpy_nodrain(void *pmemdest, const void *src,
+				  size_t len)
+{
+	return NULL;
+}
+
+static inline void
+nv_flush(void *pmemdest, size_t len)
+{
+	return;
+}
+
+static inline void
+nv_drain(void)
+{
+	return;
+}
+
+static inline void
+nv_persist(const void *addr, size_t len)
+{
+	return;
+}
+
+#endif							/* USE_NVWAL */
+#endif							/* NV_XLOG_BUFFER_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 98b033fc20..174423901a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -129,6 +129,8 @@ extern int	recoveryTargetAction;
 extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
+extern char *NvwalPath;
+extern int  NvwalSizeMB;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 4fa0f770aa..1b6fb49f76 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -325,6 +325,9 @@
 /* Define to 1 if you have the `pam' library (-lpam). */
 #undef HAVE_LIBPAM
 
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
 /* Define if you have a function readline library */
 #undef HAVE_LIBREADLINE
 
@@ -871,6 +874,9 @@
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
+/* Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal) */
+#undef USE_NVWAL
+
 /* Define to build with OpenSSL support. (--with-openssl) */
 #undef USE_OPENSSL
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index ce93ace76c..d4a345c7f0 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -437,6 +437,10 @@ extern void assign_search_path(const char *newval, void *extra);
 
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_nvwal_path(char **newval, void **extra, GucSource source);
+extern void assign_nvwal_path(const char *newval, void *extra);
+extern bool check_nvwal_size(int *newval, void **extra, GucSource source);
+extern void assign_nvwal_size(int newval, void *extra);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
 #endif							/* GUC_H */
-- 
2.17.1

v2-0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=v2-0002-Non-volatile-WAL-buffer.patchDownload
From 39d2f4e1b11eef84e1f1be8e8ff4f2f22ba85a37 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:10:42 +0900
Subject: [PATCH v2 2/3] Non-volatile WAL buffer

Now external WAL buffer becomes non-volatile.

Bumps PG_CONTROL_VERSION.
---
 src/backend/access/transam/xlog.c       | 1033 ++++++++++++++++++++---
 src/backend/access/transam/xlogreader.c |   24 +
 src/bin/pg_controldata/pg_controldata.c |    3 +
 src/include/access/xlog.h               |    8 +
 src/include/catalog/pg_control.h        |   17 +-
 5 files changed, 973 insertions(+), 112 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 24aed4e76e..2c6861f77e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -643,6 +643,13 @@ typedef struct XLogCtlData
 	TimeLineID	ThisTimeLineID;
 	TimeLineID	PrevTimeLineID;
 
+	/*
+	 * Used for non-volatile WAL buffer (NVWAL).
+	 *
+	 * All the records up to this LSN are persistent in NVWAL.
+	 */
+	XLogRecPtr	persistentUpTo;
+
 	/*
 	 * SharedRecoveryInProgress indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
@@ -766,11 +773,12 @@ typedef enum
 	XLOG_FROM_ANY = 0,			/* request to read WAL from any source */
 	XLOG_FROM_ARCHIVE,			/* restored using restore_command */
 	XLOG_FROM_PG_WAL,			/* existing file in pg_wal */
+	XLOG_FROM_NVWAL,			/* non-volatile WAL buffer */
 	XLOG_FROM_STREAM			/* streamed from master */
 } XLogSource;
 
 /* human-readable names for XLogSources, for debugging output */
-static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
+static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "nvwal", "stream"};
 
 /*
  * openLogFile is -1 or a kernel FD for an open log file segment.
@@ -901,6 +909,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
+static void PreallocNonVolatileXlogBuffer(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
@@ -1181,6 +1190,43 @@ XLogInsertRecord(XLogRecData *rdata,
 		}
 	}
 
+	/*
+	 * Request a checkpoint here if non-volatile WAL buffer is used and we
+	 * have consumed too much WAL since the last checkpoint.
+	 *
+	 * We first screen under the condition (1) OR (2) below:
+	 *
+	 * (1) The record was the first one in a certain segment.
+	 * (2) The record was inserted across segments.
+	 *
+	 * We then check the segment number which the record was inserted into.
+	 */
+	if (NvwalAvail && inserted &&
+		(StartPos % wal_segment_size == SizeOfXLogLongPHD ||
+		 StartPos / wal_segment_size < EndPos / wal_segment_size))
+	{
+		XLogSegNo	end_segno;
+
+		XLByteToSeg(EndPos, end_segno, wal_segment_size);
+
+		/*
+		 * NOTE: We do not signal walsender here because the inserted record
+		 * have not drained by NVWAL buffer yet.
+		 *
+		 * NOTE: We do not signal walarchiver here because the inserted record
+		 * have not flushed to a segment file.  So we don't need to update
+		 * XLogCtl->lastSegSwitch{Time,LSN}, used only by CheckArchiveTimeout.
+		 */
+
+		/* Two-step checking for speed (see also XLogWrite) */
+		if (IsUnderPostmaster && XLogCheckpointNeeded(end_segno))
+		{
+			(void) GetRedoRecPtr();
+			if (XLogCheckpointNeeded(end_segno))
+				RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+		}
+	}
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
 	{
@@ -2105,6 +2151,15 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
 	int			npages = 0;
+	bool		is_firstpage;
+
+	if (NvwalAvail)
+		elog(DEBUG1, "XLogCtl->InitializedUpTo %X/%X; upto %X/%X; opportunistic %s",
+			 (uint32) (XLogCtl->InitializedUpTo >> 32),
+			 (uint32) XLogCtl->InitializedUpTo,
+			 (uint32) (upto >> 32),
+			 (uint32) upto,
+			 opportunistic ? "true" : "false");
 
 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
 
@@ -2166,7 +2221,25 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 				{
 					/* Have to write it ourselves */
 					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
-					WriteRqst.Write = OldPageRqstPtr;
+
+					if (NvwalAvail)
+					{
+						/*
+						 * If we use non-volatile WAL buffer, it is a special
+						 * but expected case to write the buffer pages out to
+						 * segment files, and for simplicity, it is done in
+						 * segment by segment.
+						 */
+						XLogRecPtr		OldSegEndPtr;
+
+						OldSegEndPtr = OldPageRqstPtr - XLOG_BLCKSZ + wal_segment_size;
+						Assert(OldSegEndPtr % wal_segment_size == 0);
+
+						WriteRqst.Write = OldSegEndPtr;
+					}
+					else
+						WriteRqst.Write = OldPageRqstPtr;
+
 					WriteRqst.Flush = 0;
 					XLogWrite(WriteRqst, false);
 					LWLockRelease(WALWriteLock);
@@ -2193,7 +2266,20 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		 * Be sure to re-zero the buffer so that bytes beyond what we've
 		 * written will look like zeroes and not valid XLOG records...
 		 */
-		MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
+		if (NvwalAvail)
+		{
+			/*
+			 * We do not take the way that combines MemSet() and pmem_persist()
+			 * because pmem_persist() may use slow and strong-ordered cache
+			 * flush instruction if weak-ordered fast one is not supported.
+			 * Instead, we first fill the buffer with zero by
+			 * pmem_memset_persist() that can leverage non-temporal fast store
+			 * instructions, then make the header persistent later.
+			 */
+			nv_memset_persist(NewPage, 0, XLOG_BLCKSZ);
+		}
+		else
+			MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
 
 		/*
 		 * Fill the new page's header
@@ -2225,7 +2311,8 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		/*
 		 * If first page of an XLOG segment file, make it a long header.
 		 */
-		if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+		is_firstpage = ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0);
+		if (is_firstpage)
 		{
 			XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
 
@@ -2240,7 +2327,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
 		 * holding a lock.
 		 */
-		pg_write_barrier();
+		if (NvwalAvail)
+		{
+			/* Make the header persistent on PMEM */
+			nv_persist(NewPage, is_firstpage ? SizeOfXLogLongPHD : SizeOfXLogShortPHD);
+		}
+		else
+			pg_write_barrier();
 
 		*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
 
@@ -2250,6 +2343,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	}
 	LWLockRelease(WALBufMappingLock);
 
+	if (NvwalAvail)
+		elog(DEBUG1, "ControlFile->discardedUpTo %X/%X; XLogCtl->InitializedUpTo %X/%X",
+			 (uint32) (ControlFile->discardedUpTo >> 32),
+			 (uint32) ControlFile->discardedUpTo,
+			 (uint32) (XLogCtl->InitializedUpTo >> 32),
+			 (uint32) XLogCtl->InitializedUpTo);
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG && npages > 0)
 	{
@@ -2631,6 +2731,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		LogwrtResult.Flush = LogwrtResult.Write;
 	}
 
+	/*
+	 * Update discardedUpTo if NVWAL is used.  A new value should not fall
+	 * behind the old one.
+	 */
+	if (NvwalAvail)
+	{
+		Assert(LogwrtResult.Write == LogwrtResult.Flush);
+
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		if (ControlFile->discardedUpTo < LogwrtResult.Write)
+		{
+			ControlFile->discardedUpTo = LogwrtResult.Write;
+			UpdateControlFile();
+		}
+		LWLockRelease(ControlFileLock);
+	}
+
 	/*
 	 * Update shared-memory status
 	 *
@@ -2835,6 +2952,123 @@ XLogFlush(XLogRecPtr record)
 		return;
 	}
 
+	if (NvwalAvail)
+	{
+		XLogRecPtr	FromPos;
+
+		/*
+		 * No page on the NVWAL is to be flushed to segment files.  Instead,
+		 * we wait all the insertions preceding this one complete.  We will
+		 * wait for all the records to be persistent on the NVWAL below.
+		 */
+		record = WaitXLogInsertionsToFinish(record);
+
+		/*
+		 * Check if another backend already have done what I am doing.
+		 *
+		 * We can compare something <= XLogCtl->persistentUpTo without
+		 * holding XLogCtl->info_lck spinlock because persistentUpTo is
+		 * monotonically increasing and can be loaded atomically on each
+		 * NVWAL-supported platform (now x64 only).
+		 */
+		FromPos = *((volatile XLogRecPtr *) &XLogCtl->persistentUpTo);
+		if (record <= FromPos)
+			return;
+
+		/*
+		 * In a very rare case, we rounded whole the NVWAL.  We do not need
+		 * to care old pages here because they already have been evicted to
+		 * segment files at record insertion.
+		 *
+		 * In such a case, we flush whole the NVWAL.  We also log it as
+		 * warning because it can be time-consuming operation.
+		 *
+		 * TODO Advance XLogCtl->persistentUpTo at the end of XLogWrite, and
+		 * we can remove the following first if-block.
+		 */
+		if (record - FromPos > NvwalSize)
+		{
+			elog(WARNING, "flush whole the NVWAL; FromPos %X/%X; record %X/%X",
+				 (uint32) (FromPos >> 32), (uint32) FromPos,
+				 (uint32) (record >> 32), (uint32) record);
+
+			nv_flush(XLogCtl->pages, NvwalSize);
+		}
+		else
+		{
+			char   *frompos;
+			char   *uptopos;
+			size_t	fromoff;
+			size_t	uptooff;
+
+			/*
+			 * Flush each record that is probably not flushed yet.
+			 *
+			 * We have two reasons why we say "probably".  The first is because
+			 * such a record copied with non-temporal store instruction has
+			 * already "flushed" but we cannot distinguish it.  nv_flush is
+			 * harmless for it in consistency.
+			 *
+			 * The second reason is that the target record might have already
+			 * been evicted to a segment file until now.  Also in this case,
+			 * nv_flush is harmless in consistency.
+			 */
+			uptooff = record % NvwalSize;
+			uptopos = XLogCtl->pages + uptooff;
+			fromoff = FromPos % NvwalSize;
+			frompos = XLogCtl->pages + fromoff;
+
+			/* Handles rotation */
+			if (uptopos <= frompos)
+			{
+				nv_flush(frompos, NvwalSize - fromoff);
+				fromoff = 0;
+				frompos = XLogCtl->pages;
+			}
+
+			nv_flush(frompos, uptooff - fromoff);
+		}
+
+		/*
+		 * To guarantee durability ("D" of ACID), we should satisfy the
+		 * following two for each transaction X:
+		 *
+		 *  (1) All the WAL records inserted by X, including the commit record
+		 *      of X, should persist on NVWAL before the server commits X.
+		 *
+		 *  (2) All the WAL records inserted by any other transactions than
+		 *      X, that have less LSN than the commit record just inserted
+		 *      by X, should persist on NVWAL before the server commits X.
+		 *
+		 * The (1) can be satisfied by a store barrier after the commit record
+		 * of X is flushed because each WAL record on X is already flushed in
+		 * the end of its insertion.  The (2) can be satisfied by waiting for
+		 * any record insertions that have less LSN than the commit record just
+		 * inserted by X, and by a store barrier as well.
+		 *
+		 * Now is the time.  Have a store barrier.
+		 */
+		nv_drain();
+
+		/*
+		 * Remember where the last persistent record is.  A new value should
+		 * not fall behind the old one.
+		 */
+		SpinLockAcquire(&XLogCtl->info_lck);
+		if (XLogCtl->persistentUpTo < record)
+			XLogCtl->persistentUpTo = record;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		/*
+		 * The records up to the returned "record" have been persisntent on
+		 * NVWAL.  Now signal walsenders.
+		 */
+		WalSndWakeupRequest();
+		WalSndWakeupProcessRequests();
+
+		return;
+	}
+
 	/* Quick exit if already known flushed */
 	if (record <= LogwrtResult.Flush)
 		return;
@@ -3018,6 +3252,13 @@ XLogBackgroundFlush(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/*
+	 * Quick exit if NVWAL buffer is used and archiving is not active. In this
+	 * case, we need no WAL segment file in pg_wal directory.
+	 */
+	if (NvwalAvail && !XLogArchivingActive())
+		return false;
+
 	/* read LogwrtResult and update local state */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
@@ -3036,6 +3277,18 @@ XLogBackgroundFlush(void)
 		flexible = false;		/* ensure it all gets written */
 	}
 
+	/*
+	 * If NVWAL is used, back off to the last compeleted segment boundary
+	 * for writing the buffer page to files in segment by segment.  We do so
+	 * nowhere but here after XLogCtl->asyncXactLSN is loaded because it
+	 * should be considered.
+	 */
+	if (NvwalAvail)
+	{
+		WriteRqst.Write -= WriteRqst.Write % wal_segment_size;
+		flexible = false;		/* ensure it all gets written */
+	}
+
 	/*
 	 * If already known flushed, we're done. Just need to check if we are
 	 * holding an open file handle to a logfile that's no longer in use,
@@ -3062,7 +3315,12 @@ XLogBackgroundFlush(void)
 	flushbytes =
 		WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
 
-	if (WalWriterFlushAfter == 0 || lastflush == 0)
+	if (NvwalAvail)
+	{
+		WriteRqst.Flush = WriteRqst.Write;
+		lastflush = now;
+	}
+	else if (WalWriterFlushAfter == 0 || lastflush == 0)
 	{
 		/* first call, or block based limits disabled */
 		WriteRqst.Flush = WriteRqst.Write;
@@ -3121,7 +3379,28 @@ XLogBackgroundFlush(void)
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
 	 */
-	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
+	if (NvwalAvail && max_wal_senders == 0)
+	{
+		XLogRecPtr		upto;
+
+		/*
+		 * If NVWAL is used and there is no walsender, nobody is to load
+		 * segments on the buffer.  So let's recycle segments up to {where we
+		 * have requested to write and flush} + NvwalSize.
+		 *
+		 * Note that if NVWAL is used and a walsender seems running, we have to
+		 * do nothing; keep the written pages on the buffer for walsenders to be
+		 * loaded from the buffer, not from the segment files.  Note that the
+		 * buffer pages are eventually to be recycled by checkpoint.
+		 */
+		Assert(WriteRqst.Write == WriteRqst.Flush);
+		Assert(WriteRqst.Write % wal_segment_size == 0);
+
+		upto = WriteRqst.Write + NvwalSize;
+		AdvanceXLInsertBuffer(upto - XLOG_BLCKSZ, false);
+	}
+	else
+		AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
 
 	/*
 	 * If we determined that we need to write data, but somebody else
@@ -3829,6 +4108,43 @@ XLogFileClose(void)
 	ReleaseExternalFD();
 }
 
+/*
+ * Preallocate non-volatile XLOG buffers.
+ *
+ * This zeroes buffers and prepare page headers up to
+ * ControlFile->discardedUpTo + S, where S is the total size of
+ * the non-volatile XLOG buffers.
+ *
+ * It is caller's responsibility to update ControlFile->discardedUpTo
+ * and to set XLogCtl->InitializedUpTo properly.
+ */
+static void
+PreallocNonVolatileXlogBuffer(void)
+{
+	XLogRecPtr	newupto,
+				InitializedUpTo;
+
+	Assert(NvwalAvail);
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	newupto = ControlFile->discardedUpTo;
+	LWLockRelease(ControlFileLock);
+
+	InitializedUpTo = XLogCtl->InitializedUpTo;
+
+	newupto += NvwalSize;
+	Assert(newupto % wal_segment_size == 0);
+
+	if (newupto <= InitializedUpTo)
+		return;
+
+	/*
+	 * Subtracting XLOG_BLCKSZ is important, because AdvanceXLInsertBuffer
+	 * handles the first argument as the beginning of pages, not the end.
+	 */
+	AdvanceXLInsertBuffer(newupto - XLOG_BLCKSZ, false);
+}
+
 /*
  * Preallocate log files beyond the specified log endpoint.
  *
@@ -4124,8 +4440,11 @@ RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr)
 	 * Before deleting the file, see if it can be recycled as a future log
 	 * segment. Only recycle normal files, pg_standby for example can create
 	 * symbolic links pointing to a separate archive directory.
+	 *
+	 * If NVWAL buffer is used, a log segment file is never to be recycled
+	 * (that is, always go into else block).
 	 */
-	if (wal_recycle &&
+	if (!NvwalAvail && wal_recycle &&
 		endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
@@ -4533,6 +4852,7 @@ InitControlFile(uint64 sysidentifier)
 	memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, MOCK_AUTH_NONCE_LEN);
 	ControlFile->state = DB_SHUTDOWNED;
 	ControlFile->unloggedLSN = FirstNormalUnloggedLSN;
+	ControlFile->discardedUpTo = (NvwalAvail) ? wal_segment_size : InvalidXLogRecPtr;
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
@@ -5365,41 +5685,58 @@ BootStrapXLOG(void)
 	record->xl_crc = crc;
 
 	/* Create first XLOG segment file */
-	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	if (NvwalAvail)
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		nv_memcpy_nodrain(XLogCtl->pages + wal_segment_size, page, XLOG_BLCKSZ);
+		pgstat_report_wait_end();
 
-	/*
-	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
-	 * close the file again in a moment.
-	 */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+		nv_drain();
+		pgstat_report_wait_end();
 
-	/* Write the first page with the initial record */
-	errno = 0;
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
-	{
-		/* if write didn't set errno, assume problem is no disk space */
-		if (errno == 0)
-			errno = ENOSPC;
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not write bootstrap write-ahead log file: %m")));
+		/*
+		 * Other WAL stuffs will be initialized in startup process.
+		 */
 	}
-	pgstat_report_wait_end();
+	else
+	{
+		use_existent = false;
+		openLogFile = XLogFileInit(1, &use_existent, false);
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
-	pgstat_report_wait_end();
+		/*
+		 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
+		 * close the file again in a moment.
+		 */
 
-	if (close(openLogFile) != 0)
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not close bootstrap write-ahead log file: %m")));
+		/* Write the first page with the initial record */
+		errno = 0;
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write bootstrap write-ahead log file: %m")));
+		}
+		pgstat_report_wait_end();
 
-	openLogFile = -1;
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+		if (pg_fsync(openLogFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_end();
+
+		if (close(openLogFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not close bootstrap write-ahead log file: %m")));
+
+		openLogFile = -1;
+	}
 
 	/* Now create pg_control */
 	InitControlFile(sysidentifier);
@@ -5653,41 +5990,47 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * happens in the middle of a segment, copy data from the last WAL segment
 	 * of the old timeline up to the switch point, to the starting WAL segment
 	 * on the new timeline.
+	 *
+	 * If non-volatile WAL buffer is used, no new segment file is created. Data
+	 * up to the switch point will be copied into NVWAL buffer by StartupXLOG().
 	 */
-	if (endLogSegNo == startLogSegNo)
+	if (!NvwalAvail)
 	{
-		/*
-		 * Make a copy of the file on the new timeline.
-		 *
-		 * Writing WAL isn't allowed yet, so there are no locking
-		 * considerations. But we should be just as tense as XLogFileInit to
-		 * avoid emplacing a bogus file.
-		 */
-		XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
-					 XLogSegmentOffset(endOfLog, wal_segment_size));
-	}
-	else
-	{
-		/*
-		 * The switch happened at a segment boundary, so just create the next
-		 * segment on the new timeline.
-		 */
-		bool		use_existent = true;
-		int			fd;
+		if (endLogSegNo == startLogSegNo)
+		{
+			/*
+			 * Make a copy of the file on the new timeline.
+			 *
+			 * Writing WAL isn't allowed yet, so there are no locking
+			 * considerations. But we should be just as tense as XLogFileInit to
+			 * avoid emplacing a bogus file.
+			 */
+			XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
+						 XLogSegmentOffset(endOfLog, wal_segment_size));
+		}
+		else
+		{
+			/*
+			 * The switch happened at a segment boundary, so just create the next
+			 * segment on the new timeline.
+			 */
+			bool		use_existent = true;
+			int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+			fd = XLogFileInit(startLogSegNo, &use_existent, true);
 
-		if (close(fd) != 0)
-		{
-			char		xlogfname[MAXFNAMELEN];
-			int			save_errno = errno;
+			if (close(fd) != 0)
+			{
+				char		xlogfname[MAXFNAMELEN];
+				int			save_errno = errno;
 
-			XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
-						 wal_segment_size);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not close file \"%s\": %m", xlogfname)));
+				XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
+							 wal_segment_size);
+				errno = save_errno;
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not close file \"%s\": %m", xlogfname)));
+			}
 		}
 	}
 
@@ -6919,6 +7262,11 @@ StartupXLOG(void)
 		InRecovery = true;
 	}
 
+	/* Dump discardedUpTo just before REDO */
+	elog(LOG, "ControlFile->discardedUpTo %X/%X",
+		 (uint32) (ControlFile->discardedUpTo >> 32),
+		 (uint32) ControlFile->discardedUpTo);
+
 	/* REDO */
 	if (InRecovery)
 	{
@@ -7691,10 +8039,88 @@ StartupXLOG(void)
 	Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
 	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
 
+	if (NvwalAvail)
+	{
+		XLogRecPtr	discardedUpTo;
+
+		discardedUpTo = ControlFile->discardedUpTo;
+		Assert(discardedUpTo == InvalidXLogRecPtr ||
+			   discardedUpTo % wal_segment_size == 0);
+
+		if (discardedUpTo == InvalidXLogRecPtr)
+		{
+			elog(DEBUG1, "brand-new NVWAL");
+
+			/* The following "Tricky point" is to initialize the buffer */
+		}
+		else if (EndOfLog <= discardedUpTo)
+		{
+			elog(DEBUG1, "no record on NVWAL has been UNDONE");
+
+			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+			ControlFile->discardedUpTo = InvalidXLogRecPtr;
+			UpdateControlFile();
+			LWLockRelease(ControlFileLock);
+
+			nv_memset_persist(XLogCtl->pages, 0, NvwalSize);
+
+			/* The following "Tricky point" is to initialize the buffer */
+		}
+		else
+		{
+			int			last_idx;
+			int			idx;
+			XLogRecPtr	ptr;
+
+			elog(DEBUG1, "some records on NVWAL have been UNDONE; keep them");
+
+			/*
+			 * Initialize xlblock array because we decided to keep UNDONE
+			 * records on NVWAL buffer; or each page on the buffer that meets
+			 * xlblocks == 0 (initialized as so by XLOGShmemInit) is to be
+			 * accidentally cleared by the following AdvanceXLInsertBuffer!
+			 *
+			 * Two cases can be considered:
+			 *
+			 * 1) EndOfLog is on a page boundary (divisible by XLOG_BLCKSZ):
+			 *    Initialize up to (and including) the page containing the last
+			 *    record.  That page should end with EndOfLog.  The one more
+			 *    next page "N" beginning with EndOfLog is to be untouched
+			 *    because, in such a very corner case that all the NVWAL
+			 *    buffer pages are already filled, page N is on the same
+			 *    location as the first page "F" beginning with discardedUpTo.
+			 *    Of cource we should not overwrite the page F.
+			 *
+			 *    In this case, we first get XLogRecPtrToBufIdx(EndOfLog) as
+			 *    last_idx, indicating the page N.  Then, we go forward from
+			 *    the page F up to (but excluding) page N that have the same
+			 *    index as the page F.
+			 *
+			 * 2) EndOfLog is not on a page boundary:  Initialize all the pages
+			 *    but the page "L" having the last record. The page L is to be
+			 *    initialized by the following "Tricky point", including its
+			 *    content.
+			 *
+			 * In either case, XLogCtl->InitializedUpTo is to be initialized in
+			 * the following "Tricky" if-else block.
+			 */
+
+			last_idx = XLogRecPtrToBufIdx(EndOfLog);
+
+			ptr = discardedUpTo;
+			for (idx = XLogRecPtrToBufIdx(ptr); idx != last_idx;
+				 idx = NextBufIdx(idx))
+			{
+				ptr += XLOG_BLCKSZ;
+				XLogCtl->xlblocks[idx] = ptr;
+			}
+		}
+	}
+
 	/*
-	 * Tricky point here: readBuf contains the *last* block that the LastRec
-	 * record spans, not the one it starts in.  The last block is indeed the
-	 * one we want to use.
+	 * Tricky point here: readBuf contains the *last* block that the
+	 * LastRec record spans, not the one it starts in.  The last block is
+	 * indeed the one we want to use.
 	 */
 	if (EndOfLog % XLOG_BLCKSZ != 0)
 	{
@@ -7714,6 +8140,9 @@ StartupXLOG(void)
 		memcpy(page, xlogreader->readBuf, len);
 		memset(page + len, 0, XLOG_BLCKSZ - len);
 
+		if (NvwalAvail)
+			nv_persist(page, XLOG_BLCKSZ);
+
 		XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
 		XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
 	}
@@ -7727,12 +8156,54 @@ StartupXLOG(void)
 		XLogCtl->InitializedUpTo = EndOfLog;
 	}
 
-	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+	if (NvwalAvail)
+	{
+		XLogRecPtr	SegBeginPtr;
 
-	XLogCtl->LogwrtResult = LogwrtResult;
+		/*
+		 * If NVWAL buffer is used, writing records out to segment files should
+		 * be done in segment by segment. So Logwrt{Rqst,Result} (and also
+		 * discardedUpTo) should be multiple of wal_segment_size.  Let's get
+		 * them back off to the last segment boundary.
+		 */
 
-	XLogCtl->LogwrtRqst.Write = EndOfLog;
-	XLogCtl->LogwrtRqst.Flush = EndOfLog;
+		SegBeginPtr = EndOfLog - (EndOfLog % wal_segment_size);
+		LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+		XLogCtl->LogwrtResult = LogwrtResult;
+		XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+		XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+
+		/*
+		 * persistentUpTo does not need to be multiple of wal_segment_size,
+		 * and should be drained-up-to LSN. walsender will use it to load
+		 * records from NVWAL buffer.
+		 */
+		XLogCtl->persistentUpTo = EndOfLog;
+
+		/* Update discardedUpTo in pg_control if still invalid */
+		if (ControlFile->discardedUpTo == InvalidXLogRecPtr)
+		{
+			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+			ControlFile->discardedUpTo = SegBeginPtr;
+			UpdateControlFile();
+			LWLockRelease(ControlFileLock);
+		}
+
+		elog(DEBUG1, "EndOfLog: %X/%X",
+			 (uint32) (EndOfLog >> 32), (uint32) EndOfLog);
+
+		elog(DEBUG1, "SegBeginPtr: %X/%X",
+			 (uint32) (SegBeginPtr >> 32), (uint32) SegBeginPtr);
+	}
+	else
+	{
+		LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+
+		XLogCtl->LogwrtResult = LogwrtResult;
+
+		XLogCtl->LogwrtRqst.Write = EndOfLog;
+		XLogCtl->LogwrtRqst.Flush = EndOfLog;
+	}
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7863,6 +8334,7 @@ StartupXLOG(void)
 				char		origpath[MAXPGPATH];
 				char		partialfname[MAXFNAMELEN];
 				char		partialpath[MAXPGPATH];
+				XLogRecPtr	discardedUpTo;
 
 				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
 				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
@@ -7874,6 +8346,53 @@ StartupXLOG(void)
 				 */
 				XLogArchiveCleanup(partialfname);
 
+				/*
+				 * If NVWAL is also used for archival recovery, write old
+				 * records out to segment files to archive them.  Note that we
+				 * need locks related to WAL because LocalXLogInsertAllowed
+				 * already got to -1.
+				 */
+				discardedUpTo = ControlFile->discardedUpTo;
+				if (NvwalAvail && discardedUpTo != InvalidXLogRecPtr &&
+					discardedUpTo < EndOfLog)
+				{
+					XLogwrtRqst WriteRqst;
+					TimeLineID	thisTLI = ThisTimeLineID;
+					XLogRecPtr	SegBeginPtr =
+						EndOfLog - (EndOfLog % wal_segment_size);
+
+					/*
+					 * XXX Assume that all the records have the same TLI.
+					 */
+					ThisTimeLineID = EndOfLogTLI;
+
+					WriteRqst.Write = EndOfLog;
+					WriteRqst.Flush = 0;
+
+					LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+					XLogWrite(WriteRqst, false);
+
+					/*
+					 * Force back-off to the last segment boundary.
+					 */
+					LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+					ControlFile->discardedUpTo = SegBeginPtr;
+					UpdateControlFile();
+					LWLockRelease(ControlFileLock);
+
+					LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+
+					SpinLockAcquire(&XLogCtl->info_lck);
+					XLogCtl->LogwrtResult = LogwrtResult;
+					XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+					XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+					SpinLockRelease(&XLogCtl->info_lck);
+
+					LWLockRelease(WALWriteLock);
+
+					ThisTimeLineID = thisTLI;
+				}
+
 				durable_rename(origpath, partialpath, ERROR);
 				XLogArchiveNotify(partialfname);
 			}
@@ -7883,7 +8402,10 @@ StartupXLOG(void)
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
-	PreallocXlogFiles(EndOfLog);
+	if (NvwalAvail)
+		PreallocNonVolatileXlogBuffer();
+	else
+		PreallocXlogFiles(EndOfLog);
 
 	/*
 	 * Okay, we're officially UP.
@@ -8428,10 +8950,24 @@ GetInsertRecPtr(void)
 /*
  * GetFlushRecPtr -- Returns the current flush position, ie, the last WAL
  * position known to be fsync'd to disk.
+ *
+ * If NVWAL is used, this returns the last persistent WAL position instead.
  */
 XLogRecPtr
 GetFlushRecPtr(void)
 {
+	if (NvwalAvail)
+	{
+		XLogRecPtr		ret;
+
+		SpinLockAcquire(&XLogCtl->info_lck);
+		LogwrtResult = XLogCtl->LogwrtResult;
+		ret = XLogCtl->persistentUpTo;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		return ret;
+	}
+
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	SpinLockRelease(&XLogCtl->info_lck);
@@ -8731,6 +9267,9 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
+	/* for non-volatile WAL buffer */
+	XLogRecPtr	newDiscardedUpTo = 0;
+
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
 	 * issued at a different time.
@@ -9042,6 +9581,22 @@ CreateCheckPoint(int flags)
 	 */
 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+	/*
+	 * If non-volatile WAL buffer is used, discardedUpTo should be updated and
+	 * persist on the control file. So the new value should be caluculated
+	 * here.
+	 *
+	 * TODO Do not copy and paste codes...
+	 */
+	if (NvwalAvail)
+	{
+		XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+		KeepLogSeg(recptr, &_logSegNo);
+		_logSegNo--;
+
+		newDiscardedUpTo = _logSegNo * wal_segment_size;
+	}
+
 	/*
 	 * Update the control file.
 	 */
@@ -9050,6 +9605,16 @@ CreateCheckPoint(int flags)
 		ControlFile->state = DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
+	if (NvwalAvail)
+	{
+		/*
+		 * A new value should not fall behind the old one.
+		 */
+		if (ControlFile->discardedUpTo < newDiscardedUpTo)
+			ControlFile->discardedUpTo = newDiscardedUpTo;
+		else
+			newDiscardedUpTo = ControlFile->discardedUpTo;
+	}
 	ControlFile->time = (pg_time_t) time(NULL);
 	/* crash recovery should always recover to the end of WAL */
 	ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
@@ -9067,6 +9632,44 @@ CreateCheckPoint(int flags)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * If we use non-volatile XLOG buffer, update XLogCtl->Logwrt{Rqst,Result}
+	 * so that the XLOG records older than newDiscardedUpTo are treated as
+	 * "already written and flushed."
+	 */
+	if (NvwalAvail)
+	{
+		Assert(newDiscardedUpTo > 0);
+
+		/* Update process-local variables */
+		LogwrtResult.Write = LogwrtResult.Flush = newDiscardedUpTo;
+
+		/*
+		 * Update shared-memory variables. We need both light-weight lock and
+		 * spin lock to update them.
+		 */
+		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+		SpinLockAcquire(&XLogCtl->info_lck);
+
+		/*
+		 * Note that there can be a corner case that process-local
+		 * LogwrtResult falls behind shared XLogCtl->LogwrtResult if whole the
+		 * non-volatile XLOG buffer is filled and some pages are written out
+		 * to segment files between UpdateControlFile and LWLockAcquire above.
+		 *
+		 * TODO For now, we ignore that case because it can hardly occur.
+		 */
+		XLogCtl->LogwrtResult = LogwrtResult;
+
+		if (XLogCtl->LogwrtRqst.Write < newDiscardedUpTo)
+			XLogCtl->LogwrtRqst.Write = newDiscardedUpTo;
+		if (XLogCtl->LogwrtRqst.Flush < newDiscardedUpTo)
+			XLogCtl->LogwrtRqst.Flush = newDiscardedUpTo;
+
+		SpinLockRelease(&XLogCtl->info_lck);
+		LWLockRelease(WALWriteLock);
+	}
+
 	/* Update shared-memory copy of checkpoint XID/epoch */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->ckptFullXid = checkPoint.nextFullXid;
@@ -9090,21 +9693,31 @@ CreateCheckPoint(int flags)
 	if (PriorRedoPtr != InvalidXLogRecPtr)
 		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
 
-	/*
-	 * Delete old log files, those no longer needed for last checkpoint to
-	 * prevent the disk holding the xlog from growing full.
-	 */
-	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
-	KeepLogSeg(recptr, &_logSegNo);
-	_logSegNo--;
-	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	if (NvwalAvail)
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	else
+	{
+		/*
+		 * Delete old log files, those no longer needed for last checkpoint to
+		 * prevent the disk holding the xlog from growing full.
+		 */
+		XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+		KeepLogSeg(recptr, &_logSegNo);
+		_logSegNo--;
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	}
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
 	if (!shutdown)
-		PreallocXlogFiles(recptr);
+	{
+		if (NvwalAvail)
+			PreallocNonVolatileXlogBuffer();
+		else
+			PreallocXlogFiles(recptr);
+	}
 
 	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
@@ -11751,6 +12364,116 @@ CancelBackup(void)
 	}
 }
 
+/*
+ * Is NVWAL used?
+ */
+bool
+IsNvwalAvail(void)
+{
+	return NvwalAvail;
+}
+
+/*
+ * Returns the size we can load from NVWAL and sets nvwalptr to load-from LSN.
+ */
+Size
+GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
+{
+	XLogRecPtr	readUpTo;
+	XLogRecPtr	discardedUpTo;
+
+	Assert(IsNvwalAvail());
+	Assert(nvwalptr != NULL);
+
+	readUpTo = target + count;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	discardedUpTo = ControlFile->discardedUpTo;
+	LWLockRelease(ControlFileLock);
+
+	/* Check if all the records are on WAL segment files */
+	if (readUpTo <= discardedUpTo)
+		return 0;
+
+	/* Check if all the records are on NVWAL */
+	if (discardedUpTo <= target)
+	{
+		*nvwalptr = target;
+		return count;
+	}
+
+	/* Some on WAL segment files, some on NVWAL */
+	*nvwalptr = discardedUpTo;
+	return (Size) (readUpTo - discardedUpTo);
+}
+
+/*
+ * It is like WALRead @ xlogreader.c, but loads from non-volatile WAL
+ * buffer.
+ */
+bool
+CopyXLogRecordsFromNVWAL(char *buf, XLogRecPtr startptr, Size count)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	Assert(NvwalAvail);
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	/*
+	 * Hold shared WALBufMappingLock to let others not rotate WAL buffer
+	 * while reading WAL records from it.  We do not need exclusive lock
+	 * because we will not rotate the buffer in this function.
+	 */
+	LWLockAcquire(WALBufMappingLock, LW_SHARED);
+
+	while (nbytes > 0)
+	{
+		char	   *src;
+		Size		off;
+		Size		max_read;
+		Size		readbytes;
+		XLogRecPtr	discardedUpTo;
+
+		LWLockAcquire(ControlFileLock, LW_SHARED);
+		discardedUpTo = ControlFile->discardedUpTo;
+		LWLockRelease(ControlFileLock);
+
+		/* Check if the records we need have been already evicted or not */
+		if (recptr < discardedUpTo)
+		{
+			LWLockRelease(WALBufMappingLock);
+
+			/* TODO error handling? */
+			return false;
+		}
+
+		/*
+		 * Get the target address on no-volatile WAL buffer and the size we
+		 * can load from it at once because the buffer can rotate and we
+		 * might have to load what we want devided into two or more.
+		 */
+		off = recptr % NvwalSize;
+		src = XLogCtl->pages + off;
+		max_read = NvwalSize - off;
+		readbytes = (nbytes < max_read) ? nbytes : max_read;
+
+		memcpy(p, src, readbytes);
+
+		/* Update state for load */
+		recptr += readbytes;
+		nbytes -= readbytes;
+		p += readbytes;
+	}
+
+	LWLockRelease(WALBufMappingLock);
+	return true;
+}
+
 /*
  * Read the XLOG page containing RecPtr into readBuf (if not read already).
  * Returns number of bytes read, if the page is read successfully, or -1
@@ -11818,7 +12541,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
+	if ((readSource != XLOG_FROM_NVWAL && readFile < 0) ||
 		(readSource == XLOG_FROM_STREAM &&
 		 receivedUpto < targetPagePtr + reqLen))
 	{
@@ -11830,10 +12553,68 @@ retry:
 			if (readFile >= 0)
 				close(readFile);
 			readFile = -1;
-			readLen = 0;
-			readSource = 0;
 
-			return -1;
+			/*
+			 * Try non-volatile WAL buffer as last resort.
+			 *
+			 * XXX It is not supported yet on stanby mode.
+			 */
+			if (NvwalAvail && !StandbyMode && readSource != XLOG_FROM_STREAM)
+			{
+				XLogRecPtr	discardedUpTo;
+
+				elog(DEBUG1, "see if NVWAL has records to be UNDONE");
+
+				discardedUpTo = ControlFile->discardedUpTo;
+				if (discardedUpTo != InvalidXLogRecPtr &&
+					discardedUpTo <= targetPagePtr)
+				{
+					elog(DEBUG1, "recovering NVWAL");
+
+					/* Loading records from non-volatile WAL buffer */
+					currentSource = XLOG_FROM_NVWAL;
+					lastSourceFailed = false;
+
+					/* Report recovery progress in PS display */
+					set_ps_display("recovering NVWAL", false);
+
+					/* Track source of data */
+					readSource = XLOG_FROM_NVWAL;
+					XLogReceiptSource = XLOG_FROM_NVWAL;
+
+					/* Track receipt time */
+					XLogReceiptTime = GetCurrentTimestamp();
+
+					/*
+					 * Construct expectedTLEs.  This is necessary to recover
+					 * only from NVWAL because its filename does not have any
+					 * TLI information.
+					 */
+					if (!expectedTLEs)
+					{
+						TimeLineHistoryEntry *entry;
+
+						entry = (TimeLineHistoryEntry *) palloc(sizeof(TimeLineHistoryEntry));
+						entry->tli = recoveryTargetTLI;
+						entry->begin = entry->end = InvalidXLogRecPtr;
+
+						expectedTLEs = list_make1(entry);
+
+						elog(DEBUG1, "expectedTLEs: [%u]", (uint32) recoveryTargetTLI);
+					}
+				}
+			}
+			else
+				elog(DEBUG1, "do not recover NVWAL");
+
+			/* See if the try above succeeded or not */
+			if (readSource != XLOG_FROM_NVWAL)
+			{
+				readLen = 0;
+				readSource = 0;
+
+				return -1;
+			}
 		}
 	}
 
@@ -11841,7 +12622,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(readFile != -1 || readSource == XLOG_FROM_NVWAL);
 
 	/*
 	 * If the current segment is being streamed from master, calculate how
@@ -11860,41 +12641,60 @@ retry:
 	else
 		readLen = XLOG_BLCKSZ;
 
-	/* Read the requested page */
 	readOff = targetPageOff;
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
-	if (r != XLOG_BLCKSZ)
+	if (currentSource == XLOG_FROM_NVWAL)
 	{
-		char		fname[MAXFNAMELEN];
-		int			save_errno = errno;
+		Size		offset = (Size) (targetPagePtr % NvwalSize);
+		char	   *readpos = XLogCtl->pages + offset;
 
+		Assert(readLen == XLOG_BLCKSZ);
+		Assert(offset % XLOG_BLCKSZ == 0);
+
+		/* Load the requested page from non-volatile WAL buffer */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		memcpy(readBuf, readpos, readLen);
 		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		if (r < 0)
+
+		/* There are not any other clues of TLI... */
+		xlogreader->seg.ws_tli = ((XLogPageHeader) readBuf)->xlp_tli;
+	}
+	else
+	{
+		/* Read the requested page from file */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+		if (r != XLOG_BLCKSZ)
 		{
-			errno = save_errno;
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
+			char		fname[MAXFNAMELEN];
+			int			save_errno = errno;
+
+			pgstat_report_wait_end();
+			XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+			if (r < 0)
+			{
+				errno = save_errno;
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode_for_file_access(),
+						 errmsg("could not read from log segment %s, offset %u: %m",
+								fname, readOff)));
+			}
+			else
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+								fname, readOff, r, (Size) XLOG_BLCKSZ)));
+			goto next_record_is_invalid;
 		}
-		else
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							fname, readOff, r, (Size) XLOG_BLCKSZ)));
-		goto next_record_is_invalid;
+		pgstat_report_wait_end();
+
+		xlogreader->seg.ws_tli = curFileTLI;
 	}
-	pgstat_report_wait_end();
 
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	xlogreader->seg.ws_tli = curFileTLI;
-
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
 	 * it's not valid. This may seem unnecessary, because XLogReadRecord()
@@ -11928,6 +12728,17 @@ retry:
 		goto next_record_is_invalid;
 	}
 
+	/*
+	 * Updating curFileTLI on each page verified if non-volatile WAL buffer
+	 * is used because there is no TimeLineID information in NVWAL's filename.
+	 */
+	if (readSource == XLOG_FROM_NVWAL &&
+		curFileTLI != xlogreader->latestPageTLI)
+	{
+		curFileTLI = xlogreader->latestPageTLI;
+		elog(DEBUG1, "curFileTLI: %u", curFileTLI);
+	}
+
 	return readLen;
 
 next_record_is_invalid:
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 32f02256ed..c40a4f1400 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1058,11 +1058,24 @@ WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	XLogRecPtr	recptr_nvwal = 0;
+	Size		nbytes_nvwal = 0;
+#endif
 
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
 
+#ifndef FRONTEND
+	/* Try to load records directly from NVWAL if used */
+	if (IsNvwalAvail())
+	{
+		nbytes_nvwal = GetLoadableSizeFromNvwal(startptr, count, &recptr_nvwal);
+		nbytes = count - nbytes_nvwal;
+	}
+#endif
+
 	while (nbytes > 0)
 	{
 		uint32		startoff;
@@ -1127,6 +1140,17 @@ WALRead(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
 		p += readbytes;
 	}
 
+#ifndef FRONTEND
+	if (IsNvwalAvail())
+	{
+		if (!CopyXLogRecordsFromNVWAL(p, recptr_nvwal, nbytes_nvwal))
+		{
+			/* TODO graceful error handling */
+			elog(PANIC, "some records on NVWAL had been discarded");
+		}
+	}
+#endif
+
 	return true;
 }
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..4c594e915f 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -272,6 +272,9 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("discarded Up To:                      %X/%X\n"),
+		   (uint32) (ControlFile->discardedUpTo >> 32),
+		   (uint32) ControlFile->discardedUpTo);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 174423901a..ccf2671bd9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -324,6 +324,14 @@ extern void XLogRequestWalReceiverReply(void);
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
+extern bool IsNvwalAvail(void);
+extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
+										   Size count,
+										   XLogRecPtr *nvwalptr);
+extern bool CopyXLogRecordsFromNVWAL(char *buf,
+									 XLogRecPtr startptr,
+									 Size count);
+
 /*
  * Routines to start, stop, and get status of a base backup.
  */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..fe71992a69 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
 
 
 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION	1300
+#define PG_CONTROL_VERSION	1301
 
 /* Nonce key length, see below */
 #define MOCK_AUTH_NONCE_LEN		32
@@ -132,6 +132,21 @@ typedef struct ControlFileData
 
 	XLogRecPtr	unloggedLSN;	/* current fake LSN value, for unlogged rels */
 
+	/*
+	 * Used for non-volatile WAL buffer (NVWAL).
+	 *
+	 * discardedUpTo is updated to the oldest LSN in the NVWAL when either a
+	 * checkpoint or a restartpoint is completed successfully, or whole the
+	 * NVWAL is filled with WAL records and a new record is being inserted.
+	 * This field tells that the NVWAL contains WAL records in the range of
+	 * [discardedUpTo, discardedUpTo+S), where S is the size of the NVWAL.
+	 * Note that the WAL records whose LSN are less than discardedUpTo would
+	 * remain in WAL segment files and be needed for recovery.
+	 *
+	 * It is set to zero when NVWAL is not used.
+	 */
+	XLogRecPtr	discardedUpTo;
+
 	/*
 	 * These two values determine the minimum point we must recover up to
 	 * before starting up:
-- 
2.17.1

v2-0003-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v2-0003-README-for-non-volatile-WAL-buffer.patchDownload
From 7a886ea7529b4d0e2273a13cd8d9209b652099c4 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:10:44 +0900
Subject: [PATCH v2 3/3] README for non-volatile WAL buffer

---
 README.nvwal | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 184 insertions(+)
 create mode 100644 README.nvwal

diff --git a/README.nvwal b/README.nvwal
new file mode 100644
index 0000000000..b6b9d576e7
--- /dev/null
+++ b/README.nvwal
@@ -0,0 +1,184 @@
+Non-volatile WAL buffer
+=======================
+Here is a PostgreSQL branch with a proof-of-concept "non-volatile WAL buffer"
+(NVWAL) feature. Putting the WAL buffer pages on persistent memory (PMEM) [1],
+inserting WAL records into it directly, and eliminating I/O for WAL segment
+files, PostgreSQL gets lower latency and higher throughput.
+
+
+Prerequisites and recommends
+----------------------------
+* An x64 system
+  * (Recommended) Supporting CLFLUSHOPT or CLWB instruction
+    * See if lscpu shows "clflushopt" or "clwb" flag
+* An OS supporting PMEM
+  * Linux: 4.15 or later (tested on 5.2)
+  * Windows: (Sorry but we have not tested on Windows yet.)
+* A filesystem supporting DAX (tested on ext4)
+* libpmem in PMDK [2] 1.4 or later (tested on 1.7)
+* ndctl [3] (tested on 61.2)
+* ipmctl [4] if you use Intel DCPMM
+* sudo privilege
+* All other prerequisites of original PostgreSQL
+* (Recommended) PMEM module(s) (NVDIMM-N or Intel DCPMM)
+  * You can emulate PMEM using DRAM [5] even if you have no PMEM module.
+* (Recommended) numactl
+
+
+Build and install PostgreSQL with NVWAL feature
+-----------------------------------------------
+We have a new configure option --with-nvwal.
+
+I believe it is good to install under your home directory with --prefix option.
+If you do so, please DO NOT forget "export PATH".
+
+  $ ./configure --with-nvwal --prefix="$HOME/postgres"
+  $ make
+  $ make install
+  $ export PATH="$HOME/postgres/bin:$PATH"
+
+NOTE: ./configure --with-nvwal will fail if libpmem is not found.
+
+
+Prepare DAX filesystem
+----------------------
+Here we use NVDIMM-N or emulated PMEM, make ext4 filesystem on namespace0.0
+(/dev/pmem0), and mount it onto /mnt/pmem0. Please DO NOT forget "-o dax" option
+on mount. For Intel DCPMM and ipmctl, please see [4].
+
+  $ ndctl list
+  [
+    {
+      "dev":"namespace1.0",
+      "mode":"raw",
+      "size":103079215104,
+      "sector_size":512,
+      "blockdev":"pmem1",
+      "numa_node":1
+    },
+    {
+      "dev":"namespace0.0",
+      "mode":"raw",
+      "size":103079215104,
+      "sector_size":512,
+      "blockdev":"pmem0",
+      "numa_node":0
+    }
+  ]
+
+  $ sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0
+  {
+    "dev":"namespace0.0",
+    "mode":"fsdax",
+    "map":"dev",
+    "size":"94.50 GiB (101.47 GB)",
+    "uuid":"e7da9d65-140b-4e1e-90ec-6548023a1b6e",
+    "sector_size":512,
+    "blockdev":"pmem0",
+    "numa_node":0
+  }
+
+  $ ls -l /dev/pmem0
+  brw-rw---- 1 root disk 259, 3 Jan  6 17:06 /dev/pmem0
+
+  $ sudo mkfs.ext4 -q -F /dev/pmem0
+  $ sudo mkdir -p /mnt/pmem0
+  $ sudo mount -o dax /dev/pmem0 /mnt/pmem0
+  $ mount -l | grep ^/dev/pmem0
+  /dev/pmem0 on /mnt/pmem0 type ext4 (rw,relatime,dax)
+
+
+Enable transparent huge page
+----------------------------
+Of course transparent huge page would not be suitable for database workload,
+but it improves performance of PMEM by reducing overhead of page walk.
+
+  $ ls -l /sys/kernel/mm/transparent_hugepage/enabled
+  -rw-r--r-- 1 root root 4096 Dec  3 10:38 /sys/kernel/mm/transparent_hugepage/enabled
+
+  $ echo always | sudo dd of=/sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null
+  $ cat /sys/kernel/mm/transparent_hugepage/enabled
+  [always] madvise never
+
+
+initdb
+------
+We have two new options:
+
+  -P, --nvwal-path=FILE  path to file for non-volatile WAL buffer (NVWAL)
+  -Q, --nvwal-size=SIZE  size of NVWAL, in megabytes
+
+If you want to create a new 80GB (81920MB) NVWAL file on /mnt/pmem0/pgsql/nvwal,
+please run initdb as follows:
+
+  $ sudo mkdir -p /mnt/pmem0/pgsql
+  $ sudo chown "$USER:$USER" /mnt/pmem0/pgsql
+  $ export PGDATA="$HOME/pgdata"
+  $ initdb -P /mnt/pmem0/pgsql/nvwal -Q 81920
+
+You will find there is no WAL segment file to be created in PGDATA/pg_wal
+directory. That is okay; your NVWAL file has the content of the first WAL
+segment file.
+
+NOTE:
+* initdb will fail if the given NVWAL size is not multiple of WAL segment
+  size. The segment size is given with initdb --wal-segsize, or is 16MB as
+  default.
+* postgres (executed by initdb) will fail in bootstrap if the directory in
+  which the NVWAL file is being created (/mnt/pmem0/pgsql for example
+  above) does not exist.
+* postgres (executed by initdb) will fail in bootstrap if an entry already
+  exists on the given path.
+* postgres (executed by initdb) will fail in bootstrap if the given path is
+  not on PMEM or you forget "-o dax" option on mount.
+* Resizing an NVWAL file is NOT supported yet. Please be careful to decide
+  how large your NVWAL file is to be.
+* "-Q 1024" (1024MB) will be assumed if -P is given but -Q is not.
+
+
+postgresql.conf
+---------------
+We have two new parameters nvwal_path and nvwal_size, corresponding to the two
+new options of initdb. If you run initdb as above, you will find postgresql.conf
+in your PGDATA directory like as follows:
+
+  max_wal_size = 80GB
+  min_wal_size = 80GB
+  nvwal_path = '/mnt/pmem0/pgsql/nvwal'
+  nvwal_size = 80GB
+
+NOTE:
+* postgres will fail in startup if no file exists on the given nvwal_path.
+* postgres will fail in startup if the given nvwal_size is not equal to the
+  actual NVWAL file size,
+* postgres will fail in startup if the given nvwal_path is not on PMEM or you
+  forget "-o dax" option on mount.
+* wal_buffers will be ignored if nvwal_path is given.
+* You SHOULD give both max_wal_size and min_wal_size the same value as
+  nvwal_size. postgres could possibly run even though the three values are
+  not same, however, we have not tested such a case yet.
+
+
+Startup
+-------
+Same as you know:
+
+  $ pg_ctl start
+
+or use numactl as follows to let postgres run on the specified NUMA node (typi-
+cally the one on which your NVWAL file is) if you need stable performance:
+
+  $ numactl --cpunodebind=0 --membind=0 -- pg_ctl start
+
+
+References
+----------
+[1] https://pmem.io/
+[2] https://pmem.io/pmdk/
+[3] https://docs.pmem.io/ndctl-user-guide/
+[4] https://docs.pmem.io/ipmctl-user-guide/
+[5] https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
+
+
+--
+Takashi Menjo <takashi.menjou.vg AT hco.ntt.co.jp>
-- 
2.17.1

nvwal-performance-s50.pngimage/png; name=nvwal-performance-s50.pngDownload
nvwal-performance-s1000.pngimage/png; name=nvwal-performance-s1000.pngDownload
postgresql.confapplication/octet-stream; name=postgresql.confDownload
#18Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Andres Freund (#15)
8 attachment(s)
RE: [PoC] Non-volatile WAL buffer

Dear Andres,

Thank you for your advice about MAP_POPULATE flag. I rebased my msync patchset onto master and added a commit to append that flag
when mmap. A new v2 patchset is attached to this mail. Note that this patchset is NOT non-volatile WAL buffer's one.

I also measured performance of the following three versions, varying -c/--client and -j/--jobs options of pgbench, for each scaling
factor s = 50 or 1000.

- Before patchset (say "before")
- After patchset except patch 0005 not to use MAP_POPULATE ("after (no populate)")
- After full patchset to use MAP_POPULATE ("after (populate)")

The results are presented in the following tables and the attached charts. Conditions, steps, and other details will be shown
later. Note that, unlike the measurement of non-volatile WAL buffer I sent recently [1]/messages/by-id/002701d5fd03$6e1d97a0$4a58c6e0$@hco.ntt.co.jp_1, I used an NVMe SSD for pg_wal to evaluate
this patchset with traditional mmap-ed files, that is, direct access (DAX) is not supported and there are page caches.

Results (s=50)
==============
Throughput [10^3 TPS]
( c, j) before after after
(no populate) (populate)
------- -------------------------------------
( 8, 8) 30.9 28.1 (- 9.2%) 28.3 (- 8.6%)
(18,18) 61.5 46.1 (-25.0%) 47.7 (-22.3%)
(36,18) 67.0 45.9 (-31.5%) 48.4 (-27.8%)
(54,18) 68.3 47.0 (-31.3%) 49.6 (-27.5%)

Average Latency [ms]
( c, j) before after after
(no populate) (populate)
------- --------------------------------------
( 8, 8) 0.259 0.285 (+10.0%) 0.283 (+ 9.3%)
(18,18) 0.293 0.391 (+33.4%) 0.377 (+28.7%)
(36,18) 0.537 0.784 (+46.0%) 0.744 (+38.5%)
(54,18) 0.790 1.149 (+45.4%) 1.090 (+38.0%)

Results (s=1000)
================
Throghput [10^3 TPS]
( c, j) before after after
(no populate) (populate)
------- ------------------------------------
( 8, 8) 32.0 29.6 (- 7.6%) 29.1 (- 9.0%)
(18,18) 66.1 49.2 (-25.6%) 50.4 (-23.7%)
(36,18) 76.4 51.0 (-33.3%) 53.4 (-30.1%)
(54,18) 80.1 54.3 (-32.2%) 57.2 (-28.6%)

Average latency [10^3 TPS]
( c, j) before after after
(no populate) (populate)
------- --------------------------------------
( 8, 8) 0.250 0.271 (+ 8.4%) 0.275 (+10.0%)
(18,18) 0.272 0.366 (+34.6%) 0.357 (+31.3%)
(36,18) 0.471 0.706 (+49.9%) 0.674 (+43.1%)
(54,18) 0.674 0.995 (+47.6%) 0.944 (+40.1%)

I'd say MAP_POPULATE made performance a little better in large #clients cases, comparing "populate" with "no populate". However,
comparing "after" with "before", I found both throughput and average latency degraded. VTune told me that "after (populate)" still
spent larger CPU time for memcpy-ing WAL records into mmap-ed segments than "before".

I also made a microbenchmark to see the behavior of mmap and msync. I found that:

- A major fault occured at mmap with MAP_POPULATE, instead at first access to the mmap-ed space.
- Some minor faults also occured at mmap with MAP_POPULATE, and no additional fault occured when I loaded from the mmap-ed space.
But once I stored to that space, a minor fault occured.
- When I stored to the page that had been msync-ed, a minor fault occurred.

So I think one of the remaining causes of performance degrade is minor faults when mmap-ed pages get dirtied. And it seems not to
be solved by MAP_POPULATE only, as far as I see.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use two NVMe SSDs; one for PGDATA, another for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- Use the attached postgresql.conf

Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown in the
tables above.

(1) Run initdb with proper -D and -X options
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA x2

Best regards,
Takashi

[1]: /messages/by-id/002701d5fd03$6e1d97a0$4a58c6e0$@hco.ntt.co.jp_1

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Andres Freund <andres@anarazel.de>
Sent: Thursday, February 20, 2020 2:04 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>;
pgsql-hackers@postgresql.org
Subject: Re: [PoC] Non-volatile WAL buffer

Hi,

On 2020-02-17 13:12:37 +0900, Takashi Menjo wrote:

I applied my patchset that mmap()-s WAL segments as WAL buffers to
refs/tags/REL_12_0, and measured and analyzed its performance with
pgbench. Roughly speaking, When I used *SSD and ext4* to store WAL,
it was "obviously worse" than the original REL_12_0. VTune told me
that the CPU time of memcpy() called by CopyXLogRecordToWAL() got
larger than before.

FWIW, this might largely be because of page faults. In contrast to before we wouldn't reuse the same pages
(because they've been munmap()/mmap()ed), so the first time they're touched, we'll incur page faults. Did you
try mmap()ing with MAP_POPULATE? It's probably also worthwhile to try to use MAP_HUGETLB.

Still doubtful it's the right direction, but I'd rather have good numbers to back me up :)

Greetings,

Andres Freund

Attachments:

v2-0001-Preallocate-more-WAL-segments.patchapplication/octet-stream; name=v2-0001-Preallocate-more-WAL-segments.patchDownload
From 1afcff4eacdcb8c7d9c5547432d546d16ebef3a2 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:13:59 +0900
Subject: [PATCH v2 1/5] Preallocate more WAL segments

---
 src/backend/access/transam/xlog.c | 27 ++++++++++-----------------
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4361568882..b0362dce44 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -894,7 +894,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
-static void PreallocXlogFiles(XLogRecPtr endptr);
+static void PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
 static void RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr);
@@ -3824,27 +3824,20 @@ XLogFileClose(void)
 
 /*
  * Preallocate log files beyond the specified log endpoint.
- *
- * XXX this is currently extremely conservative, since it forces only one
- * future log segment to exist, and even that only if we are 75% done with
- * the current one.  This is only appropriate for very low-WAL-volume systems.
- * High-volume systems will be OK once they've built up a sufficient set of
- * recycled log segments, but the startup transient is likely to include
- * a lot of segment creations by foreground processes, which is not so good.
  */
 static void
-PreallocXlogFiles(XLogRecPtr endptr)
+PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 {
 	XLogSegNo	_logSegNo;
+	XLogSegNo	endSegNo;
+	XLogSegNo	recycleSegNo;
 	int			lf;
 	bool		use_existent;
-	uint64		offset;
 
-	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
-	offset = XLogSegmentOffset(endptr - 1, wal_segment_size);
-	if (offset >= (uint32) (0.75 * wal_segment_size))
+	XLByteToPrevSeg(endptr, endSegNo, wal_segment_size);
+	recycleSegNo = XLOGfileslop(RedoRecPtr);
+	for (_logSegNo = endSegNo + 1; _logSegNo <= recycleSegNo; _logSegNo++)
 	{
-		_logSegNo++;
 		use_existent = true;
 		lf = XLogFileInit(_logSegNo, &use_existent, true);
 		close(lf);
@@ -7748,7 +7741,7 @@ StartupXLOG(void)
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
-	PreallocXlogFiles(EndOfLog);
+	PreallocXlogFiles(RedoRecPtr, EndOfLog);
 
 	/*
 	 * Okay, we're officially UP.
@@ -8962,7 +8955,7 @@ CreateCheckPoint(int flags)
 	 * segments, since that may supply some of the needed files.)
 	 */
 	if (!shutdown)
-		PreallocXlogFiles(recptr);
+		PreallocXlogFiles(RedoRecPtr, recptr);
 
 	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
@@ -9312,7 +9305,7 @@ CreateRestartPoint(int flags)
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
-	PreallocXlogFiles(endptr);
+	PreallocXlogFiles(RedoRecPtr, endptr);
 
 	/*
 	 * ThisTimeLineID is normally not set when we're still in recovery.
-- 
2.17.1

v2-0002-Use-WAL-segments-as-WAL-buffers.patchapplication/octet-stream; name=v2-0002-Use-WAL-segments-as-WAL-buffers.patchDownload
From a228fe4588a65494b3ae2b3295461defbba55a71 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:00 +0900
Subject: [PATCH v2 2/5] Use WAL segments as WAL buffers

Please run ./configure with LIBS=-lpmem to build.

Note that we ignore wal_sync_method from here.
---
 src/backend/access/transam/xlog.c | 967 +++++++++++-------------------
 1 file changed, 366 insertions(+), 601 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b0362dce44..423eb839b5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -18,9 +18,11 @@
 #include <math.h>
 #include <time.h>
 #include <fcntl.h>
+#include <sys/mman.h>
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <unistd.h>
+#include <libpmem.h>
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
@@ -613,24 +615,8 @@ typedef struct XLogCtlData
 	XLogwrtResult LogwrtResult;
 
 	/*
-	 * Latest initialized page in the cache (last byte position + 1).
-	 *
-	 * To change the identity of a buffer (and InitializedUpTo), you need to
-	 * hold WALBufMappingLock.  To change the identity of a buffer that's
-	 * still dirty, the old page needs to be written out first, and for that
-	 * you need WALWriteLock, and you need to ensure that there are no
-	 * in-progress insertions to the page by calling
-	 * WaitXLogInsertionsToFinish().
+	 * This value does not change after startup.
 	 */
-	XLogRecPtr	InitializedUpTo;
-
-	/*
-	 * These values do not change after startup, although the pointed-to pages
-	 * and xlblocks values certainly do.  xlblocks values are protected by
-	 * WALBufMappingLock.
-	 */
-	char	   *pages;			/* buffers for unwritten XLOG pages */
-	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
 	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
 
 	/*
@@ -776,9 +762,26 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
  * openLogSegNo identifies the segment.  These variables are only used to
  * write the XLOG, and so will normally refer to the active segment.
  * Note: call Reserve/ReleaseExternalFD to track consumption of this FD.
+ *
+ * mappedPages is mmap(2)-ed address for an open log file segment.
+ * It is used as WAL buffer instead of XLogCtl->pages.
+ *
+ * pmemMapped is true if mappedPages is on PMEM.
  */
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
+static char *mappedPages = NULL;
+static bool pmemMapped = 0;
+
+/* 2MiB hugepage mask used by XLogFileMapHint */
+#define PG_HUGEPAGE_MASK ((((uintptr_t) 1) << 21) - 1)
+
+#ifndef MAP_SHARED_VALIDATE
+#define MAP_SHARED_VALIDATE 0x3
+#endif
+#ifndef MAP_SYNC
+#define MAP_SYNC 0x80000
+#endif
 
 /*
  * These variables are used similarly to the ones above, but for reading
@@ -879,12 +882,15 @@ static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
-static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 								   bool find_free, XLogSegNo max_segno,
 								   bool use_lock);
+static void *XLogFileMapHint(void);
+static void *XLogFileMapUtil(void *hint, int fd, bool dax);
+static char *XLogFileMap(XLogSegNo segno, bool *is_pmem);
+static void XLogFileUnmap(char *pages, XLogSegNo segno);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 int source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
@@ -944,7 +950,6 @@ static void checkXLogConsistency(XLogReaderState *record);
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
-static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
@@ -1579,27 +1584,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		 */
 		while (CurrPos < EndPos)
 		{
-			/*
-			 * The minimal action to flush the page would be to call
-			 * WALInsertLockUpdateInsertingAt(CurrPos) followed by
-			 * AdvanceXLInsertBuffer(...).  The page would be left initialized
-			 * mostly to zeros, except for the page header (always the short
-			 * variant, as this is never a segment's first page).
-			 *
-			 * The large vistas of zeros are good for compressibility, but the
-			 * headers interrupting them every XLOG_BLCKSZ (with values that
-			 * differ from page to page) are not.  The effect varies with
-			 * compression tool, but bzip2 for instance compresses about an
-			 * order of magnitude worse if those headers are left in place.
-			 *
-			 * Rather than complicating AdvanceXLInsertBuffer itself (which is
-			 * called in heavily-loaded circumstances as well as this lightly-
-			 * loaded one) with variant behavior, we just use GetXLogBuffer
-			 * (which itself calls the two methods we need) to get the pointer
-			 * and zero most of the page.  Then we just zero the page header.
-			 */
-			currpos = GetXLogBuffer(CurrPos);
-			MemSet(currpos, 0, SizeOfXLogShortPHD);
+			/* XXX We assume that XLogFileInit does what we did here */
 
 			CurrPos += XLOG_BLCKSZ;
 		}
@@ -1713,29 +1698,6 @@ WALInsertLockRelease(void)
 	}
 }
 
-/*
- * Update our insertingAt value, to let others know that we've finished
- * inserting up to that point.
- */
-static void
-WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
-{
-	if (holdingAllLocks)
-	{
-		/*
-		 * We use the last lock to mark our actual position, see comments in
-		 * WALInsertLockAcquireExclusive.
-		 */
-		LWLockUpdateVar(&WALInsertLocks[NUM_XLOGINSERT_LOCKS - 1].l.lock,
-						&WALInsertLocks[NUM_XLOGINSERT_LOCKS - 1].l.insertingAt,
-						insertingAt);
-	}
-	else
-		LWLockUpdateVar(&WALInsertLocks[MyLockNo].l.lock,
-						&WALInsertLocks[MyLockNo].l.insertingAt,
-						insertingAt);
-}
-
 /*
  * Wait for any WAL insertions < upto to finish.
  *
@@ -1836,123 +1798,37 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 /*
  * Get a pointer to the right location in the WAL buffer containing the
  * given XLogRecPtr.
- *
- * If the page is not initialized yet, it is initialized. That might require
- * evicting an old dirty buffer from the buffer cache, which means I/O.
- *
- * The caller must ensure that the page containing the requested location
- * isn't evicted yet, and won't be evicted. The way to ensure that is to
- * hold onto a WAL insertion lock with the insertingAt position set to
- * something <= ptr. GetXLogBuffer() will update insertingAt if it needs
- * to evict an old page from the buffer. (This means that once you call
- * GetXLogBuffer() with a given 'ptr', you must not access anything before
- * that point anymore, and must not call GetXLogBuffer() with an older 'ptr'
- * later, because older buffers might be recycled already)
  */
 static char *
 GetXLogBuffer(XLogRecPtr ptr)
 {
-	int			idx;
-	XLogRecPtr	endptr;
-	static uint64 cachedPage = 0;
-	static char *cachedPos = NULL;
-	XLogRecPtr	expectedEndPtr;
+	int				idx;
+	XLogPageHeader	page;
+	XLogSegNo		segno;
 
-	/*
-	 * Fast path for the common case that we need to access again the same
-	 * page as last time.
-	 */
-	if (ptr / XLOG_BLCKSZ == cachedPage)
+	/* shut-up compiler if not --enable-cassert */
+	(void) page;
+
+	XLByteToSeg(ptr, segno, wal_segment_size);
+	if (segno != openLogSegNo)
 	{
-		Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
-		Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
-		return cachedPos + ptr % XLOG_BLCKSZ;
+		/* Unmap the current segment if mapped */
+		if (mappedPages != NULL)
+			XLogFileUnmap(mappedPages, openLogSegNo);
+
+		/* Map the segment we need */
+		mappedPages = XLogFileMap(segno, &pmemMapped);
+		Assert(mappedPages != NULL);
+		openLogSegNo = segno;
 	}
 
-	/*
-	 * The XLog buffer cache is organized so that a page is always loaded to a
-	 * particular buffer.  That way we can easily calculate the buffer a given
-	 * page must be loaded into, from the XLogRecPtr alone.
-	 */
 	idx = XLogRecPtrToBufIdx(ptr);
+	page = (XLogPageHeader) (mappedPages + idx * (Size) XLOG_BLCKSZ);
 
-	/*
-	 * See what page is loaded in the buffer at the moment. It could be the
-	 * page we're looking for, or something older. It can't be anything newer
-	 * - that would imply the page we're looking for has already been written
-	 * out to disk and evicted, and the caller is responsible for making sure
-	 * that doesn't happen.
-	 *
-	 * However, we don't hold a lock while we read the value. If someone has
-	 * just initialized the page, it's possible that we get a "torn read" of
-	 * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
-	 * that case we will see a bogus value. That's ok, we'll grab the mapping
-	 * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
-	 * the page we're looking for. But it means that when we do this unlocked
-	 * read, we might see a value that appears to be ahead of the page we're
-	 * looking for. Don't PANIC on that, until we've verified the value while
-	 * holding the lock.
-	 */
-	expectedEndPtr = ptr;
-	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+	Assert(page->xlp_magic == XLOG_PAGE_MAGIC);
+	Assert(page->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
 
-	endptr = XLogCtl->xlblocks[idx];
-	if (expectedEndPtr != endptr)
-	{
-		XLogRecPtr	initializedUpto;
-
-		/*
-		 * Before calling AdvanceXLInsertBuffer(), which can block, let others
-		 * know how far we're finished with inserting the record.
-		 *
-		 * NB: If 'ptr' points to just after the page header, advertise a
-		 * position at the beginning of the page rather than 'ptr' itself. If
-		 * there are no other insertions running, someone might try to flush
-		 * up to our advertised location. If we advertised a position after
-		 * the page header, someone might try to flush the page header, even
-		 * though page might actually not be initialized yet. As the first
-		 * inserter on the page, we are effectively responsible for making
-		 * sure that it's initialized, before we let insertingAt to move past
-		 * the page header.
-		 */
-		if (ptr % XLOG_BLCKSZ == SizeOfXLogShortPHD &&
-			XLogSegmentOffset(ptr, wal_segment_size) > XLOG_BLCKSZ)
-			initializedUpto = ptr - SizeOfXLogShortPHD;
-		else if (ptr % XLOG_BLCKSZ == SizeOfXLogLongPHD &&
-				 XLogSegmentOffset(ptr, wal_segment_size) < XLOG_BLCKSZ)
-			initializedUpto = ptr - SizeOfXLogLongPHD;
-		else
-			initializedUpto = ptr;
-
-		WALInsertLockUpdateInsertingAt(initializedUpto);
-
-		AdvanceXLInsertBuffer(ptr, false);
-		endptr = XLogCtl->xlblocks[idx];
-
-		if (expectedEndPtr != endptr)
-			elog(PANIC, "could not find WAL buffer for %X/%X",
-				 (uint32) (ptr >> 32), (uint32) ptr);
-	}
-	else
-	{
-		/*
-		 * Make sure the initialization of the page is visible to us, and
-		 * won't arrive later to overwrite the WAL data we write on the page.
-		 */
-		pg_memory_barrier();
-	}
-
-	/*
-	 * Found the buffer holding this page. Return a pointer to the right
-	 * offset within the page.
-	 */
-	cachedPage = ptr / XLOG_BLCKSZ;
-	cachedPos = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
-
-	Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
-	Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
-
-	return cachedPos + ptr % XLOG_BLCKSZ;
+	return mappedPages + ptr % wal_segment_size;
 }
 
 /*
@@ -2080,178 +1956,6 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 	return result;
 }
 
-/*
- * Initialize XLOG buffers, writing out old buffers if they still contain
- * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
- * true, initialize as many pages as we can without having to write out
- * unwritten data. Any new pages are initialized to zeros, with pages headers
- * initialized properly.
- */
-static void
-AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
-{
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	int			nextidx;
-	XLogRecPtr	OldPageRqstPtr;
-	XLogwrtRqst WriteRqst;
-	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
-	XLogRecPtr	NewPageBeginPtr;
-	XLogPageHeader NewPage;
-	int			npages = 0;
-
-	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
-
-	/*
-	 * Now that we have the lock, check if someone initialized the page
-	 * already.
-	 */
-	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
-	{
-		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
-
-		/*
-		 * Get ending-offset of the buffer page we need to replace (this may
-		 * be zero if the buffer hasn't been used yet).  Fall through if it's
-		 * already written out.
-		 */
-		OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
-		if (LogwrtResult.Write < OldPageRqstPtr)
-		{
-			/*
-			 * Nope, got work to do. If we just want to pre-initialize as much
-			 * as we can without flushing, give up now.
-			 */
-			if (opportunistic)
-				break;
-
-			/* Before waiting, get info_lck and update LogwrtResult */
-			SpinLockAcquire(&XLogCtl->info_lck);
-			if (XLogCtl->LogwrtRqst.Write < OldPageRqstPtr)
-				XLogCtl->LogwrtRqst.Write = OldPageRqstPtr;
-			LogwrtResult = XLogCtl->LogwrtResult;
-			SpinLockRelease(&XLogCtl->info_lck);
-
-			/*
-			 * Now that we have an up-to-date LogwrtResult value, see if we
-			 * still need to write it or if someone else already did.
-			 */
-			if (LogwrtResult.Write < OldPageRqstPtr)
-			{
-				/*
-				 * Must acquire write lock. Release WALBufMappingLock first,
-				 * to make sure that all insertions that we need to wait for
-				 * can finish (up to this same position). Otherwise we risk
-				 * deadlock.
-				 */
-				LWLockRelease(WALBufMappingLock);
-
-				WaitXLogInsertionsToFinish(OldPageRqstPtr);
-
-				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
-
-				LogwrtResult = XLogCtl->LogwrtResult;
-				if (LogwrtResult.Write >= OldPageRqstPtr)
-				{
-					/* OK, someone wrote it already */
-					LWLockRelease(WALWriteLock);
-				}
-				else
-				{
-					/* Have to write it ourselves */
-					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
-					WriteRqst.Write = OldPageRqstPtr;
-					WriteRqst.Flush = 0;
-					XLogWrite(WriteRqst, false);
-					LWLockRelease(WALWriteLock);
-					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
-				}
-				/* Re-acquire WALBufMappingLock and retry */
-				LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
-				continue;
-			}
-		}
-
-		/*
-		 * Now the next buffer slot is free and we can set it up to be the
-		 * next output page.
-		 */
-		NewPageBeginPtr = XLogCtl->InitializedUpTo;
-		NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
-
-		Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx);
-
-		NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
-
-		/*
-		 * Be sure to re-zero the buffer so that bytes beyond what we've
-		 * written will look like zeroes and not valid XLOG records...
-		 */
-		MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
-
-		/*
-		 * Fill the new page's header
-		 */
-		NewPage->xlp_magic = XLOG_PAGE_MAGIC;
-
-		/* NewPage->xlp_info = 0; */	/* done by memset */
-		NewPage->xlp_tli = ThisTimeLineID;
-		NewPage->xlp_pageaddr = NewPageBeginPtr;
-
-		/* NewPage->xlp_rem_len = 0; */	/* done by memset */
-
-		/*
-		 * If online backup is not in progress, mark the header to indicate
-		 * that WAL records beginning in this page have removable backup
-		 * blocks.  This allows the WAL archiver to know whether it is safe to
-		 * compress archived WAL data by transforming full-block records into
-		 * the non-full-block format.  It is sufficient to record this at the
-		 * page level because we force a page switch (in fact a segment
-		 * switch) when starting a backup, so the flag will be off before any
-		 * records can be written during the backup.  At the end of a backup,
-		 * the last page will be marked as all unsafe when perhaps only part
-		 * is unsafe, but at worst the archiver would miss the opportunity to
-		 * compress a few records.
-		 */
-		if (!Insert->forcePageWrites)
-			NewPage->xlp_info |= XLP_BKP_REMOVABLE;
-
-		/*
-		 * If first page of an XLOG segment file, make it a long header.
-		 */
-		if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
-		{
-			XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
-
-			NewLongPage->xlp_sysid = ControlFile->system_identifier;
-			NewLongPage->xlp_seg_size = wal_segment_size;
-			NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
-			NewPage->xlp_info |= XLP_LONG_HEADER;
-		}
-
-		/*
-		 * Make sure the initialization of the page becomes visible to others
-		 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
-		 * holding a lock.
-		 */
-		pg_write_barrier();
-
-		*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
-
-		XLogCtl->InitializedUpTo = NewPageEndPtr;
-
-		npages++;
-	}
-	LWLockRelease(WALBufMappingLock);
-
-#ifdef WAL_DEBUG
-	if (XLOG_DEBUG && npages > 0)
-	{
-		elog(DEBUG1, "initialized %d pages, up to %X/%X",
-			 npages, (uint32) (NewPageEndPtr >> 32), (uint32) NewPageEndPtr);
-	}
-#endif
-}
-
 /*
  * Calculate CheckPointSegments based on max_wal_size_mb and
  * checkpoint_completion_target.
@@ -2380,14 +2084,9 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
 static void
 XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 {
-	bool		ispartialpage;
-	bool		last_iteration;
 	bool		finishing_seg;
-	bool		use_existent;
-	int			curridx;
-	int			npages;
-	int			startidx;
-	uint32		startoffset;
+	XLogSegNo	rqstLogSegNo;
+	XLogSegNo	segno;
 
 	/* We should always be inside a critical section here */
 	Assert(CritSectionCount > 0);
@@ -2397,233 +2096,149 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	 */
 	LogwrtResult = XLogCtl->LogwrtResult;
 
-	/*
-	 * Since successive pages in the xlog cache are consecutively allocated,
-	 * we can usually gather multiple pages together and issue just one
-	 * write() call.  npages is the number of pages we have determined can be
-	 * written together; startidx is the cache block index of the first one,
-	 * and startoffset is the file offset at which it should go. The latter
-	 * two variables are only valid when npages > 0, but we must initialize
-	 * all of them to keep the compiler quiet.
-	 */
-	npages = 0;
-	startidx = 0;
-	startoffset = 0;
+	/* Fast return if not requested to flush */
+	if (WriteRqst.Flush == 0)
+		return;
+	Assert(WriteRqst.Flush == WriteRqst.Write);
 
 	/*
-	 * Within the loop, curridx is the cache block index of the page to
-	 * consider writing.  Begin at the buffer containing the next unwritten
-	 * page, or last partially written page.
+	 * Call pmem_persist() or pmem_msync() for each segment file that contains
+	 * records to be flushed.
 	 */
-	curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);
-
-	while (LogwrtResult.Write < WriteRqst.Write)
+	XLByteToPrevSeg(WriteRqst.Flush, rqstLogSegNo, wal_segment_size);
+	XLByteToSeg(LogwrtResult.Flush, segno, wal_segment_size);
+	while (segno <= rqstLogSegNo)
 	{
-		/*
-		 * Make sure we're not ahead of the insert process.  This could happen
-		 * if we're passed a bogus WriteRqst.Write that is past the end of the
-		 * last page that's been initialized by AdvanceXLInsertBuffer.
-		 */
-		XLogRecPtr	EndPtr = XLogCtl->xlblocks[curridx];
+		bool		is_pmem;
+		char	   *addr;
+		char	   *p;
+		Size		len;
+		XLogRecPtr	BeginPtr;
+		XLogRecPtr	EndPtr;
 
-		if (LogwrtResult.Write >= EndPtr)
-			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
-				 (uint32) (LogwrtResult.Write >> 32),
-				 (uint32) LogwrtResult.Write,
-				 (uint32) (EndPtr >> 32), (uint32) EndPtr);
-
-		/* Advance LogwrtResult.Write to end of current buffer page */
-		LogwrtResult.Write = EndPtr;
-		ispartialpage = WriteRqst.Write < LogwrtResult.Write;
-
-		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
-							 wal_segment_size))
+		/* Check if the segment is not mapped yet */
+		if (segno != openLogSegNo)
 		{
+			/* Map newly */
+			is_pmem = 0;
+			addr = XLogFileMap(segno, &is_pmem);
+
 			/*
-			 * Switch to new logfile segment.  We cannot have any pending
-			 * pages here (since we dump what we have at segment end).
+			 * Use the mapped above as WAL buffer of this process for the
+			 * future.  Note that it might be unmapped within this loop.
 			 */
-			Assert(npages == 0);
-			if (openLogFile >= 0)
-				XLogFileClose();
-			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
-							wal_segment_size);
-
-			/* create/use new log file */
-			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
-			ReserveExternalFD();
+			if (openLogSegNo == 0)
+			{
+				pmemMapped = is_pmem;
+				mappedPages = addr;
+				openLogSegNo = segno;
+			}
 		}
-
-		/* Make sure we have the current logfile open */
-		if (openLogFile < 0)
+		else
 		{
-			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
-							wal_segment_size);
-			openLogFile = XLogFileOpen(openLogSegNo);
-			ReserveExternalFD();
+			/* Or use existent mapping */
+			is_pmem = pmemMapped;
+			addr = mappedPages;
 		}
+		Assert(addr != NULL);
+		Assert(mappedPages != NULL);
+		Assert(openLogSegNo > 0);
 
-		/* Add current page to the set of pending pages-to-dump */
-		if (npages == 0)
-		{
-			/* first of group */
-			startidx = curridx;
-			startoffset = XLogSegmentOffset(LogwrtResult.Write - XLOG_BLCKSZ,
-											wal_segment_size);
-		}
-		npages++;
+		/* Find beginning position to be flushed */
+		BeginPtr = segno * wal_segment_size;
+		if (BeginPtr < LogwrtResult.Flush)
+			BeginPtr = LogwrtResult.Flush;
+
+		/* Find ending position to be flushed */
+		EndPtr = (segno + 1) * wal_segment_size;
+		if (EndPtr > WriteRqst.Flush)
+			EndPtr = WriteRqst.Flush;
+
+		/* Convert LSN to memory address */
+		Assert(BeginPtr <= EndPtr);
+		p = addr + BeginPtr % wal_segment_size;
+		len = (Size) (EndPtr - BeginPtr);
 
 		/*
-		 * Dump the set if this will be the last loop iteration, or if we are
-		 * at the last page of the cache area (since the next page won't be
-		 * contiguous in memory), or if we are at the end of the logfile
-		 * segment.
+		 * Do cache-flush or msync.
+		 *
+		 * Note that pmem_msync() does backoff to the page boundary.
 		 */
-		last_iteration = WriteRqst.Write <= LogwrtResult.Write;
-
-		finishing_seg = !ispartialpage &&
-			(startoffset + npages * XLOG_BLCKSZ) >= wal_segment_size;
-
-		if (last_iteration ||
-			curridx == XLogCtl->XLogCacheBlck ||
-			finishing_seg)
+		if (is_pmem)
 		{
-			char	   *from;
-			Size		nbytes;
-			Size		nleft;
-			int			written;
-
-			/* OK to write the page(s) */
-			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
-			nbytes = npages * (Size) XLOG_BLCKSZ;
-			nleft = nbytes;
-			do
+			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
+			pmem_persist(p, len);
+			pgstat_report_wait_end();
+		}
+		else
+		{
+			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
+			if (pmem_msync(p, len))
 			{
-				errno = 0;
-				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
+				char		xlogfname[MAXFNAMELEN];
+				int			save_errno;
+
 				pgstat_report_wait_end();
-				if (written <= 0)
-				{
-					char		xlogfname[MAXFNAMELEN];
-					int			save_errno;
 
-					if (errno == EINTR)
-						continue;
+				save_errno = errno;
+				XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
+							 wal_segment_size);
+				errno = save_errno;
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not msync to log file %s "
+								"at address %p, length %zu: %m",
+								xlogfname, p, len)));
+			}
+			pgstat_report_wait_end();
+		}
+		LogwrtResult.Flush = LogwrtResult.Write = EndPtr;
 
-					save_errno = errno;
-					XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
-								 wal_segment_size);
-					errno = save_errno;
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not write to log file %s "
-									"at offset %u, length %zu: %m",
-									xlogfname, startoffset, nleft)));
-				}
-				nleft -= written;
-				from += written;
-				startoffset += written;
-			} while (nleft > 0);
+		/* Check if whole my WAL buffers are synchronized to the segment */
+		finishing_seg = (LogwrtResult.Flush % wal_segment_size == 0) &&
+						XLByteInPrevSeg(LogwrtResult.Flush, openLogSegNo,
+										wal_segment_size);
 
-			npages = 0;
-
-			/*
-			 * If we just wrote the whole last page of a logfile segment,
-			 * fsync the segment immediately.  This avoids having to go back
-			 * and re-open prior segments when an fsync request comes along
-			 * later. Doing it here ensures that one and only one backend will
-			 * perform this fsync.
-			 *
-			 * This is also the right place to notify the Archiver that the
-			 * segment is ready to copy to archival storage, and to update the
-			 * timer for archive_timeout, and to signal for a checkpoint if
-			 * too many logfile segments have been used since the last
-			 * checkpoint.
-			 */
+		if (segno != openLogSegNo || finishing_seg)
+		{
+			XLogFileUnmap(addr, segno);
 			if (finishing_seg)
 			{
-				issue_xlog_fsync(openLogFile, openLogSegNo);
-
-				/* signal that we need to wakeup walsenders later */
-				WalSndWakeupRequest();
-
-				LogwrtResult.Flush = LogwrtResult.Write;	/* end of page */
-
-				if (XLogArchivingActive())
-					XLogArchiveNotifySeg(openLogSegNo);
-
-				XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-				XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
-
-				/*
-				 * Request a checkpoint if we've consumed too much xlog since
-				 * the last one.  For speed, we first check using the local
-				 * copy of RedoRecPtr, which might be out of date; if it looks
-				 * like a checkpoint is needed, forcibly update RedoRecPtr and
-				 * recheck.
-				 */
-				if (IsUnderPostmaster && XLogCheckpointNeeded(openLogSegNo))
-				{
-					(void) GetRedoRecPtr();
-					if (XLogCheckpointNeeded(openLogSegNo))
-						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
-				}
+				Assert(segno == openLogSegNo);
+				mappedPages = NULL;
+				openLogSegNo = 0;
 			}
-		}
 
-		if (ispartialpage)
-		{
-			/* Only asked to write a partial page */
-			LogwrtResult.Write = WriteRqst.Write;
-			break;
-		}
-		curridx = NextBufIdx(curridx);
+			/* signal that we need to wakeup walsenders later */
+			WalSndWakeupRequest();
 
-		/* If flexible, break out of loop as soon as we wrote something */
-		if (flexible && npages == 0)
-			break;
-	}
+			if (XLogArchivingActive())
+				XLogArchiveNotifySeg(segno);
 
-	Assert(npages == 0);
+			XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+			XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
 
-	/*
-	 * If asked to flush, do so
-	 */
-	if (LogwrtResult.Flush < WriteRqst.Flush &&
-		LogwrtResult.Flush < LogwrtResult.Write)
-
-	{
-		/*
-		 * Could get here without iterating above loop, in which case we might
-		 * have no open file or the wrong one.  However, we do not need to
-		 * fsync more than one file.
-		 */
-		if (sync_method != SYNC_METHOD_OPEN &&
-			sync_method != SYNC_METHOD_OPEN_DSYNC)
-		{
-			if (openLogFile >= 0 &&
-				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
-								 wal_segment_size))
-				XLogFileClose();
-			if (openLogFile < 0)
+			/*
+			 * Request a checkpoint if we've consumed too much xlog since
+			 * the last one.  For speed, we first check using the local
+			 * copy of RedoRecPtr, which might be out of date; if it looks
+			 * like a checkpoint is needed, forcibly update RedoRecPtr and
+			 * recheck.
+			 */
+			if (IsUnderPostmaster && XLogCheckpointNeeded(segno))
 			{
-				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
-								wal_segment_size);
-				openLogFile = XLogFileOpen(openLogSegNo);
-				ReserveExternalFD();
+				(void) GetRedoRecPtr();
+				if (XLogCheckpointNeeded(segno))
+					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
 			}
-
-			issue_xlog_fsync(openLogFile, openLogSegNo);
 		}
 
-		/* signal that we need to wakeup walsenders later */
-		WalSndWakeupRequest();
-
-		LogwrtResult.Flush = LogwrtResult.Write;
+		++segno;
 	}
 
+	/* signal that we need to wakeup walsenders later */
+	WalSndWakeupRequest();
+
 	/*
 	 * Update shared-memory status
 	 *
@@ -3044,6 +2659,16 @@ XLogBackgroundFlush(void)
 				XLogFileClose();
 			}
 		}
+		else if (mappedPages != NULL)
+		{
+			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
+								 wal_segment_size))
+			{
+				XLogFileUnmap(mappedPages, openLogSegNo);
+				mappedPages = NULL;
+				openLogSegNo = 0;
+			}
+		}
 		return false;
 	}
 
@@ -3110,12 +2735,6 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests();
 
-	/*
-	 * Great, done. To take some work off the critical path, try to initialize
-	 * as many of the no-longer-needed WAL buffers for future use as we can.
-	 */
-	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
-
 	/*
 	 * If we determined that we need to write data, but somebody else
 	 * wrote/flushed already, it should be considered as being active, to
@@ -3269,9 +2888,26 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	memset(zbuffer.data, 0, XLOG_BLCKSZ);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-	save_errno = 0;
-	if (wal_init_zero)
+
+	/*
+	 * Allocate the file by posix_allocate(3) to utilize hugepage and reduce
+	 * overhead of page fault.  Note that posix_fallocate(3) do not set errno
+	 * on error.  Instead, it returns an error number directly.
+	 */
+	save_errno = posix_fallocate(fd, 0, wal_segment_size);
+
+	if (save_errno)
 	{
+		/*
+		 * Do nothing on error.  Go to pgstat_report_wait_end().
+		 */
+	}
+	else if (wal_init_zero)
+	{
+		XLogCtlInsert  *Insert = &XLogCtl->Insert;
+		XLogPageHeader	NewPage = (XLogPageHeader) zbuffer.data;
+		XLogRecPtr		NewPageBeginPtr = logsegno * wal_segment_size;
+
 		/*
 		 * Zero-fill the file.  With this setting, we do this the hard way to
 		 * ensure that all the file space has really been allocated.  On
@@ -3283,6 +2919,48 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		 */
 		for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
 		{
+			memset(NewPage, 0, SizeOfXLogLongPHD);
+
+			/*
+			 * Fill the new page's header
+			 */
+			NewPage->xlp_magic = XLOG_PAGE_MAGIC;
+
+			/* NewPage->xlp_info = 0; */	/* done by memset */
+			NewPage->xlp_tli = ThisTimeLineID;
+			NewPage->xlp_pageaddr = NewPageBeginPtr;
+
+			/* NewPage->xlp_rem_len = 0; */	/* done by memset */
+
+			/*
+			 * If online backup is not in progress, mark the header to indicate
+			 * that WAL records beginning in this page have removable backup
+			 * blocks.  This allows the WAL archiver to know whether it is safe to
+			 * compress archived WAL data by transforming full-block records into
+			 * the non-full-block format.  It is sufficient to record this at the
+			 * page level because we force a page switch (in fact a segment
+			 * switch) when starting a backup, so the flag will be off before any
+			 * records can be written during the backup.  At the end of a backup,
+			 * the last page will be marked as all unsafe when perhaps only part
+			 * is unsafe, but at worst the archiver would miss the opportunity to
+			 * compress a few records.
+			 */
+			if (!Insert->forcePageWrites)
+				NewPage->xlp_info |= XLP_BKP_REMOVABLE;
+
+			/*
+			 * If first page of an XLOG segment file, make it a long header.
+			 */
+			if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+			{
+				XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
+
+				NewLongPage->xlp_sysid = ControlFile->system_identifier;
+				NewLongPage->xlp_seg_size = wal_segment_size;
+				NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
+				NewPage->xlp_info |= XLP_LONG_HEADER;
+			}
+
 			errno = 0;
 			if (write(fd, zbuffer.data, XLOG_BLCKSZ) != XLOG_BLCKSZ)
 			{
@@ -3290,6 +2968,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 				save_errno = errno ? errno : ENOSPC;
 				break;
 			}
+
+			NewPageBeginPtr += XLOG_BLCKSZ;
 		}
 	}
 	else
@@ -3605,6 +3285,138 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	return true;
 }
 
+/*
+ * Get a hint address for hugepage boundary mapping.
+ *
+ * Returns non-NULL if success, or PANICs otherwise.
+ */
+static void *
+XLogFileMapHint(void)
+{
+	void	   *hint;
+	Size		len;
+
+	len = (Size) wal_segment_size + PG_HUGEPAGE_MASK + 1;
+	hint = mmap(NULL, len, PROT_READ | PROT_WRITE,
+				MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+
+	if (hint == MAP_FAILED)
+		elog(PANIC, "could not get hint address");
+
+	if (munmap(hint, len) != 0)
+		elog(PANIC, "could not unmap hint address");
+
+	/* Go forward onto the nearest hugepage boundary */
+	return (void *) (((uintptr_t) hint + PG_HUGEPAGE_MASK) & ~PG_HUGEPAGE_MASK);
+}
+
+static void *
+XLogFileMapUtil(void *hint, int fd, bool dax)
+{
+	int			flags;
+
+	if (dax)
+		flags = MAP_SHARED_VALIDATE | MAP_SYNC;
+	else
+		flags = MAP_SHARED;
+
+	return mmap(hint, wal_segment_size, PROT_READ | PROT_WRITE, flags, fd, 0);
+}
+
+/*
+ * Memory-map a pre-existing logfile segment for WAL buffers.
+ *
+ * If success, it returns non-NULL and is_pmem is set whether the file is on
+ * PMEM or not.  Otherwise, it PANICs.
+ */
+static char *
+XLogFileMap(XLogSegNo segno, bool *is_pmem)
+{
+	char		path[MAXPGPATH];
+	char	   *addr;
+	void	   *hint;
+	int			fd;
+	struct stat	stat_buf;
+
+	XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
+
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m", path)));
+
+	if (fstat(fd, &stat_buf) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not fstat file \"%s\": %m", path)));
+
+	if (stat_buf.st_size != wal_segment_size)
+		elog(PANIC,
+			 "invalid logfile segment size; path \"%s\" actual %d expected %d",
+			 path, (int) stat_buf.st_size, wal_segment_size);
+
+	hint = XLogFileMapHint();
+
+	/*
+	 * Try DAX mapping first (dax=true).
+	 *
+	 * If not supported, then do regular mapping (dax=false).
+	 */
+	addr = XLogFileMapUtil(hint, fd, true);
+
+	if (addr != MAP_FAILED)
+	{
+		*is_pmem = true;
+	}
+	else if (errno == EOPNOTSUPP || errno == EINVAL)
+	{
+		addr = XLogFileMapUtil(hint, fd, false);
+
+		if (addr == MAP_FAILED)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not mmap file \"%s\": %m", path)));
+
+		*is_pmem = false;
+	}
+
+	/* Check if the logfile segment is mapped onto hugepage boundary */
+	if ((uintptr_t) addr & PG_HUGEPAGE_MASK)
+			elog(WARNING,
+				 "logfile segment is not mapped onto hugepage boundary; path \"%s\" actual %p expected %p",
+			 path, addr, hint);
+
+	/* We don't need the file descriptor anymore, so close it */
+	if (close(fd) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	return addr;
+}
+
+/*
+ * Unmap a given logfile segment for WAL buffer.
+ */
+static void
+XLogFileUnmap(char *pages, XLogSegNo segno)
+{
+	Assert(pages != NULL);
+
+	if (munmap(pages, wal_segment_size) != 0)
+	{
+		char		xlogfname[MAXFNAMELEN];
+		int			save_errno = errno;
+
+		XLogFileName(xlogfname, ThisTimeLineID, segno, wal_segment_size);
+		errno = save_errno;
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not unmap file \"%s\": %m", xlogfname)));
+	}
+}
+
 /*
  * Open a pre-existing logfile segment for writing.
  */
@@ -4988,12 +4800,6 @@ XLOGShmemSize(void)
 
 	/* WAL insertion locks, plus alignment */
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
-	/* xlblocks array */
-	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
-	/* extra alignment padding for XLOG I/O buffers */
-	size = add_size(size, XLOG_BLCKSZ);
-	/* and the buffers themselves */
-	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
 
 	/*
 	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5069,10 +4875,6 @@ XLOGShmemInit(void)
 	 * needed here.
 	 */
 	allocptr = ((char *) XLogCtl) + sizeof(XLogCtlData);
-	XLogCtl->xlblocks = (XLogRecPtr *) allocptr;
-	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
-	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
-
 
 	/* WAL insertion locks. Ensure they're aligned to the full padded size */
 	allocptr += sizeof(WALInsertLockPadded) -
@@ -5089,15 +4891,6 @@ XLOGShmemInit(void)
 		WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
 	}
 
-	/*
-	 * Align the start of the page buffers to a full xlog block size boundary.
-	 * This simplifies some calculations in XLOG insertion. It is also
-	 * required for O_DIRECT.
-	 */
-	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
-	XLogCtl->pages = allocptr;
-	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
-
 	/*
 	 * Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
 	 * in additional info.)
@@ -7550,40 +7343,12 @@ StartupXLOG(void)
 	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
 
 	/*
-	 * Tricky point here: readBuf contains the *last* block that the LastRec
-	 * record spans, not the one it starts in.  The last block is indeed the
-	 * one we want to use.
+	 * We DO NOT need the if-else block once existed here because we use WAL
+	 * segment files as WAL buffers so the last block is "already on the
+	 * buffers."
+	 *
+	 * XXX We assume there is no torn record.
 	 */
-	if (EndOfLog % XLOG_BLCKSZ != 0)
-	{
-		char	   *page;
-		int			len;
-		int			firstIdx;
-		XLogRecPtr	pageBeginPtr;
-
-		pageBeginPtr = EndOfLog - (EndOfLog % XLOG_BLCKSZ);
-		Assert(readOff == XLogSegmentOffset(pageBeginPtr, wal_segment_size));
-
-		firstIdx = XLogRecPtrToBufIdx(EndOfLog);
-
-		/* Copy the valid part of the last block, and zero the rest */
-		page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];
-		len = EndOfLog % XLOG_BLCKSZ;
-		memcpy(page, xlogreader->readBuf, len);
-		memset(page + len, 0, XLOG_BLCKSZ - len);
-
-		XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
-		XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
-	}
-	else
-	{
-		/*
-		 * There is no partial block to copy. Just set InitializedUpTo, and
-		 * let the first attempt to insert a log record to initialize the next
-		 * buffer.
-		 */
-		XLogCtl->InitializedUpTo = EndOfLog;
-	}
 
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
-- 
2.17.1

v2-0003-Lazy-unmap-WAL-segments.patchapplication/octet-stream; name=v2-0003-Lazy-unmap-WAL-segments.patchDownload
From cf15df350201cd2c5383f04ea52b9ddc534c997a Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:02 +0900
Subject: [PATCH v2 3/5] Lazy-unmap WAL segments

---
 src/backend/access/transam/xlog.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 423eb839b5..ff7d0b69bd 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -770,7 +770,9 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
  */
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
+static XLogSegNo beingClosedLogSegNo = 0;
 static char *mappedPages = NULL;
+static char *beingUnmappedPages = NULL;
 static bool pmemMapped = 0;
 
 /* 2MiB hugepage mask used by XLogFileMapHint */
@@ -1179,6 +1181,14 @@ XLogInsertRecord(XLogRecData *rdata,
 		}
 	}
 
+	/* Lazy-unmap */
+	if (beingUnmappedPages != NULL)
+	{
+		XLogFileUnmap(beingUnmappedPages, beingClosedLogSegNo);
+		beingUnmappedPages = NULL;
+		beingClosedLogSegNo = 0;
+	}
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
 	{
@@ -1812,9 +1822,23 @@ GetXLogBuffer(XLogRecPtr ptr)
 	XLByteToSeg(ptr, segno, wal_segment_size);
 	if (segno != openLogSegNo)
 	{
-		/* Unmap the current segment if mapped */
+		/*
+		 * We do not want to unmap the current segment here because we are in
+		 * a critial section and unmap is time-consuming operation.  So we
+		 * just mark it to be unmapped later.
+		 */
 		if (mappedPages != NULL)
-			XLogFileUnmap(mappedPages, openLogSegNo);
+		{
+			/*
+			 * If there is another being-unmapped segment, it cannot be helped;
+			 * we unmap it here.
+			 */
+			if (beingUnmappedPages != NULL)
+				XLogFileUnmap(beingUnmappedPages, beingClosedLogSegNo);
+
+			beingUnmappedPages = mappedPages;
+			beingClosedLogSegNo = openLogSegNo;
+		}
 
 		/* Map the segment we need */
 		mappedPages = XLogFileMap(segno, &pmemMapped);
-- 
2.17.1

v2-0004-Speculative-map-WAL-segments.patchapplication/octet-stream; name=v2-0004-Speculative-map-WAL-segments.patchDownload
From 111d5892f076cc0504e9ec2866ac5297de1862df Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:03 +0900
Subject: [PATCH v2 4/5] Speculative-map WAL segments

---
 src/backend/access/transam/xlog.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ff7d0b69bd..382256369d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -993,6 +993,8 @@ XLogInsertRecord(XLogRecData *rdata,
 							   info == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
+	XLogRecPtr	ProbablyInsertPos;
+	XLogSegNo	ProbablyInsertSegNo;
 	bool		prevDoPageWrites = doPageWrites;
 
 	/* we assume that all of the record header is in the first chunk */
@@ -1002,6 +1004,23 @@ XLogInsertRecord(XLogRecData *rdata,
 	if (!XLogInsertAllowed())
 		elog(ERROR, "cannot make new WAL entries during recovery");
 
+	/* Speculatively map a segment we probably need */
+	ProbablyInsertPos = GetInsertRecPtr();
+	XLByteToSeg(ProbablyInsertPos, ProbablyInsertSegNo, wal_segment_size);
+	if (ProbablyInsertSegNo != openLogSegNo)
+	{
+		if (mappedPages != NULL)
+		{
+			Assert(beingUnmappedPages == NULL);
+			Assert(beingClosedLogSegNo == 0);
+			beingUnmappedPages = mappedPages;
+			beingClosedLogSegNo = openLogSegNo;
+		}
+		mappedPages = XLogFileMap(ProbablyInsertSegNo, &pmemMapped);
+		Assert(mappedPages != NULL);
+		openLogSegNo = ProbablyInsertSegNo;
+	}
+
 	/*----------
 	 *
 	 * We have now done all the preparatory work we can without holding a
-- 
2.17.1

v2-0005-Map-WAL-segments-with-MAP_POPULATE-if-non-DAX.patchapplication/octet-stream; name=v2-0005-Map-WAL-segments-with-MAP_POPULATE-if-non-DAX.patchDownload
From a3ba57b33ac23f8db46e7f92e72a558db6ccd64a Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:04 +0900
Subject: [PATCH v2 5/5] Map WAL segments with MAP_POPULATE if non-DAX

---
 src/backend/access/transam/xlog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 382256369d..5c387846e5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3361,7 +3361,7 @@ XLogFileMapUtil(void *hint, int fd, bool dax)
 	if (dax)
 		flags = MAP_SHARED_VALIDATE | MAP_SYNC;
 	else
-		flags = MAP_SHARED;
+		flags = MAP_SHARED | MAP_POPULATE;
 
 	return mmap(hint, wal_segment_size, PROT_READ | PROT_WRITE, flags, fd, 0);
 }
-- 
2.17.1

msync-performance-s50.pngimage/png; name=msync-performance-s50.pngDownload
msync-performance-s1000.pngimage/png; name=msync-performance-s1000.pngDownload
postgresql.confapplication/octet-stream; name=postgresql.confDownload
#19Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Takashi Menjo (#1)
5 attachment(s)
RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I update my non-volatile WAL buffer's patchset to v3. Now we can use it in streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>; 'Amit Langote'
<amitlangote09@gmail.com>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master. A new v2 patchset is attached to this mail.

I also measured performance before and after patchset, varying -c/--client and -j/--jobs options of pgbench, for
each scaling factor s = 50 or 1000. The results are presented in the following tables and the attached charts.
Conditions, steps, and other details will be shown later.

Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)

Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)

Both throughput and average latency are improved for each scaling factor. Throughput seemed to almost reach
the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case. I think larger scaling factor leads to less
contentions on the same tables and/or indexes, that is, less lock and unlock operations. In such a situation,
write-ahead logging appears to be more significant for performance.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch

Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown
in the tables above.

(1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <hlinnaka@iki.fi>;

'PostgreSQL-development'

<pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software
Innovation Center

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas
<hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> wrote:

Hello Amit,

I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or not

because master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using release notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss to

notice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architecture of your own changes, although maybe that

rarely occurs.

Thanks,
Amit

Attachments:

v3-0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=v3-0001-Support-GUCs-for-external-WAL-buffer.patchDownload
From 931ab8fa7e9181f6b69601ad279e0ee5acb103d4 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:56 +0900
Subject: [PATCH v3 1/5] Support GUCs for external WAL buffer

To implement non-volatile WAL buffer, we add two new GUCs nvwal_path
and nvwal_size.  Now postgres maps a file at that path onto memory to
use it as WAL buffer.  Note that the buffer is still volatile for now.
---
 configure                                     | 262 ++++++++++++++++++
 configure.in                                  |  43 +++
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/nv_xlog_buffer.c   |  95 +++++++
 src/backend/access/transam/xlog.c             | 164 ++++++++++-
 src/backend/utils/misc/guc.c                  |  23 +-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/bin/initdb/initdb.c                       |  93 ++++++-
 src/include/access/nv_xlog_buffer.h           |  71 +++++
 src/include/access/xlog.h                     |   2 +
 src/include/pg_config.h.in                    |   6 +
 src/include/utils/guc.h                       |   4 +
 12 files changed, 747 insertions(+), 21 deletions(-)
 create mode 100644 src/backend/access/transam/nv_xlog_buffer.c
 create mode 100644 src/include/access/nv_xlog_buffer.h

diff --git a/configure b/configure
index 2feff37fe3..3f16feeb54 100755
--- a/configure
+++ b/configure
@@ -866,6 +866,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_nvwal
 with_gnu_ld
 enable_largefile
 '
@@ -1570,6 +1571,7 @@ Optional Packages:
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
+  --with-nvwal            use non-volatile WAL buffer (NVWAL)
   --with-gnu-ld           assume the C compiler uses GNU ld [default=no]
 
 Some influential environment variables:
@@ -8504,6 +8506,203 @@ fi
 
 
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with non-volatile WAL buffer (NVWAL)" >&5
+$as_echo_n "checking whether to build with non-volatile WAL buffer (NVWAL)... " >&6; }
+
+
+
+# Check whether --with-nvwal was given.
+if test "${with_nvwal+set}" = set; then :
+  withval=$with_nvwal;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_NVWAL 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-nvwal option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_nvwal=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_nvwal" >&5
+$as_echo "$with_nvwal" >&6; }
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+    freebsd1*|freebsd2*) elf=no;;
+    freebsd3*|freebsd4*) elf=yes;;
+esac
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for grep that handles long lines and -e" >&5
+$as_echo_n "checking for grep that handles long lines and -e... " >&6; }
+if ${ac_cv_path_GREP+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  if test -z "$GREP"; then
+  ac_path_GREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in grep ggrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_GREP="$as_dir/$ac_prog$ac_exec_ext"
+      as_fn_executable_p "$ac_path_GREP" || continue
+# Check for GNU ac_path_GREP and select it if it is found.
+  # Check for GNU $ac_path_GREP
+case `"$ac_path_GREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_GREP="$ac_path_GREP" ac_path_GREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'GREP' >> "conftest.nl"
+    "$ac_path_GREP" -e 'GREP$' -e '-(cannot match)-' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_GREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_GREP="$ac_path_GREP"
+      ac_path_GREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_GREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_GREP"; then
+    as_fn_error $? "no acceptable grep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_GREP=$GREP
+fi
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_GREP" >&5
+$as_echo "$ac_cv_path_GREP" >&6; }
+ GREP="$ac_cv_path_GREP"
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for egrep" >&5
+$as_echo_n "checking for egrep... " >&6; }
+if ${ac_cv_path_EGREP+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  if echo a | $GREP -E '(a|b)' >/dev/null 2>&1
+   then ac_cv_path_EGREP="$GREP -E"
+   else
+     if test -z "$EGREP"; then
+  ac_path_EGREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in egrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_EGREP="$as_dir/$ac_prog$ac_exec_ext"
+      as_fn_executable_p "$ac_path_EGREP" || continue
+# Check for GNU ac_path_EGREP and select it if it is found.
+  # Check for GNU $ac_path_EGREP
+case `"$ac_path_EGREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_EGREP="$ac_path_EGREP" ac_path_EGREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'EGREP' >> "conftest.nl"
+    "$ac_path_EGREP" 'EGREP$' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_EGREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_EGREP="$ac_path_EGREP"
+      ac_path_EGREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_EGREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_EGREP"; then
+    as_fn_error $? "no acceptable egrep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_EGREP=$EGREP
+fi
+
+   fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_EGREP" >&5
+$as_echo "$ac_cv_path_EGREP" >&6; }
+ EGREP="$ac_cv_path_EGREP"
+
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#if __ELF__
+  yes
+#endif
+
+_ACEOF
+if (eval "$ac_cpp conftest.$ac_ext") 2>&5 |
+  $EGREP "yes" >/dev/null 2>&1; then :
+  ELF_SYS=true
+else
+  if test "X$elf" = "Xyes" ; then
+  ELF_SYS=true
+else
+  ELF_SYS=
+fi
+fi
+rm -f conftest*
+
+
+
 #
 # Assignments
 #
@@ -12861,6 +13060,57 @@ fi
 fi
 
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'libpmem' is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+fi
+
 
 ##
 ## Header files
@@ -13540,6 +13790,18 @@ fi
 
 done
 
+fi
+
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+
 fi
 
 if test "$PORTNAME" = "win32" ; then
diff --git a/configure.in b/configure.in
index 0188c6ff07..a5f9c9fb9d 100644
--- a/configure.in
+++ b/configure.in
@@ -992,6 +992,38 @@ PGAC_ARG_BOOL(with, zlib, yes,
               [do not use Zlib])
 AC_SUBST(with_zlib)
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+AC_MSG_CHECKING([whether to build with non-volatile WAL buffer (NVWAL)])
+PGAC_ARG_BOOL(with, nvwal, no, [use non-volatile WAL buffer (NVWAL)],
+              [AC_DEFINE([USE_NVWAL], 1, [Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal)])])
+AC_MSG_RESULT([$with_nvwal])
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+    freebsd1*|freebsd2*) elf=no;;
+    freebsd3*|freebsd4*) elf=yes;;
+esac
+
+AC_EGREP_CPP(yes,
+[#if __ELF__
+  yes
+#endif
+],
+[ELF_SYS=true],
+[if test "X$elf" = "Xyes" ; then
+  ELF_SYS=true
+else
+  ELF_SYS=
+fi])
+AC_SUBST(ELF_SYS)
+
 #
 # Assignments
 #
@@ -1293,6 +1325,12 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [],
+               [AC_MSG_ERROR([library 'libpmem' is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 
 ##
 ## Header files
@@ -1470,6 +1508,11 @@ elif test "$with_uuid" = ossp ; then
       [AC_MSG_ERROR([header file <ossp/uuid.h> or <uuid.h> is required for OSSP UUID])])])
 fi
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 if test "$PORTNAME" = "win32" ; then
    AC_CHECK_HEADERS(crtdefs.h)
 fi
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..b41a710e7e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -32,7 +32,8 @@ OBJS = \
 	xlogfuncs.o \
 	xloginsert.o \
 	xlogreader.o \
-	xlogutils.o
+	xlogutils.o \
+	nv_xlog_buffer.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/nv_xlog_buffer.c b/src/backend/access/transam/nv_xlog_buffer.c
new file mode 100644
index 0000000000..cfc6a6376b
--- /dev/null
+++ b/src/backend/access/transam/nv_xlog_buffer.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * nv_xlog_buffer.c
+ *		PostgreSQL non-volatile WAL buffer
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/nv_xlog_buffer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#ifdef USE_NVWAL
+
+#include <libpmem.h>
+#include "access/nv_xlog_buffer.h"
+
+#include "miscadmin.h" /* IsBootstrapProcessingMode */
+#include "common/file_perm.h" /* pg_file_create_mode */
+
+/*
+ * Maps non-volatile WAL buffer on shared memory.
+ *
+ * Returns a mapped address if success; PANICs and never return otherwise.
+ */
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+	void	   *addr;
+	size_t		map_len = 0;
+	int			is_pmem = 0;
+
+	Assert(fname != NULL);
+	Assert(fsize > 0);
+
+	if (IsBootstrapProcessingMode())
+	{
+		/*
+		 * Create and map a new file if we are in bootstrap mode (typically
+		 * executed by initdb).
+		 */
+		addr = pmem_map_file(fname, fsize, PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+							 pg_file_create_mode, &map_len, &is_pmem);
+	}
+	else
+	{
+		/*
+		 * Map an existing file.  The second argument (len) should be zero,
+		 * the third argument (flags) should have neither PMEM_FILE_CREATE nor
+		 * PMEM_FILE_EXCL, and the fourth argument (mode) will be ignored.
+		 */
+		addr = pmem_map_file(fname, 0, 0, 0, &map_len, &is_pmem);
+	}
+
+	if (addr == NULL)
+		elog(PANIC, "could not map non-volatile WAL buffer '%s': %m", fname);
+
+	if (map_len != fsize)
+		elog(PANIC, "size of non-volatile WAL buffer '%s' is invalid; "
+					"expected %zu; actual %zu",
+			 fname, fsize, map_len);
+
+	if (!is_pmem)
+		elog(PANIC, "non-volatile WAL buffer '%s' is not on persistent memory",
+			 fname);
+
+	/*
+	 * Assert page boundary alignment (8KiB as default).  It should pass because
+	 * PMDK considers hugepage boundary alignment (2MiB or 1GiB on x64).
+	 */
+	Assert((uint64) addr % XLOG_BLCKSZ == 0);
+
+	elog(LOG, "non-volatile WAL buffer '%s' is mapped on [%p-%p)",
+		 fname, addr, (char *) addr + map_len);
+	return addr;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+	Assert(addr != NULL);
+
+	if (pmem_unmap(addr, fsize) < 0)
+	{
+		elog(WARNING, "could not unmap non-volatile WAL buffer: %m");
+		return;
+	}
+
+	elog(LOG, "non-volatile WAL buffer unmapped");
+}
+
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a1256a103b..0681ba1262 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -37,6 +37,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "access/nv_xlog_buffer.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
@@ -873,6 +874,12 @@ static bool InRedo = false;
 /* Have we launched bgwriter during recovery? */
 static bool bgwriterLaunched = false;
 
+/* For non-volatile WAL buffer (NVWAL) */
+char	   *NvwalPath = NULL;	/* a GUC parameter */
+int			NvwalSizeMB = 1024;	/* a direct GUC parameter */
+static Size	NvwalSize = 0;		/* an indirect GUC parameter */
+static bool	NvwalAvail = false;
+
 /* For WALInsertLockAcquire/Release functions */
 static int	MyLockNo = 0;
 static bool holdingAllLocks = false;
@@ -5014,6 +5021,76 @@ check_wal_buffers(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * GUC check_hook for nvwal_path.
+ */
+bool
+check_nvwal_path(char **newval, void **extra, GucSource source)
+{
+#ifndef USE_NVWAL
+	Assert(!NvwalAvail);
+
+	if (**newval != '\0')
+	{
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("nvwal_path is invalid parameter without NVWAL");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_nvwal_path(const char *newval, void *extra)
+{
+	/* true if not empty; false if empty */
+	NvwalAvail = (bool) (*newval != '\0');
+}
+
+/*
+ * GUC check_hook for nvwal_size.
+ *
+ * It checks the boundary only and DOES NOT check if the size is multiple
+ * of wal_segment_size because the segment size (probably stored in the
+ * control file) have not been set properly here yet.
+ *
+ * See XLOGShmemSize for more validation.
+ */
+bool
+check_nvwal_size(int *newval, void **extra, GucSource source)
+{
+#ifdef USE_NVWAL
+	Size		buf_size;
+	int64		npages;
+
+	Assert(*newval > 0);
+
+	buf_size = (Size) (*newval) * 1024 * 1024;
+	npages = (int64) buf_size / XLOG_BLCKSZ;
+	Assert(npages > 0);
+
+	if (npages > INT_MAX)
+	{
+		/* XLOG_BLCKSZ could be so small that npages exceeds INT_MAX */
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("invalid value for nvwal_size (%dMB): "
+						 "the number of WAL pages too large; "
+						 "buf_size %zu; XLOG_BLCKSZ %d",
+						 *newval, buf_size, (int) XLOG_BLCKSZ);
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_nvwal_size(int newval, void *extra)
+{
+	NvwalSize = (Size) newval * 1024 * 1024;
+}
+
 /*
  * Read the control file, set respective GUCs.
  *
@@ -5042,13 +5119,49 @@ XLOGShmemSize(void)
 {
 	Size		size;
 
+	/*
+	 * If we use non-volatile WAL buffer, we don't use the given wal_buffers.
+	 * Instead, we set it the value based on the size of the file for the
+	 * buffer. This should be done here because of xlblocks array calculation.
+	 */
+	if (NvwalAvail)
+	{
+		char		buf[32];
+		int64		npages;
+
+		Assert(NvwalSizeMB > 0);
+		Assert(NvwalSize > 0);
+		Assert(wal_segment_size > 0);
+		Assert(wal_segment_size % XLOG_BLCKSZ == 0);
+
+		/*
+		 * At last, we can check if the size of non-volatile WAL buffer
+		 * (nvwal_size) is multiple of WAL segment size.
+		 *
+		 * Note that NvwalSize has already been calculated in assign_nvwal_size.
+		 */
+		if (NvwalSize % wal_segment_size != 0)
+		{
+			elog(PANIC,
+				 "invalid value for nvwal_size (%dMB): "
+				 "it should be multiple of WAL segment size; "
+				 "NvwalSize %zu; wal_segment_size %d",
+				 NvwalSizeMB, NvwalSize, wal_segment_size);
+		}
+
+		npages = (int64) NvwalSize / XLOG_BLCKSZ;
+		Assert(npages > 0 && npages <= INT_MAX);
+
+		snprintf(buf, sizeof(buf), "%d", (int) npages);
+		SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
+	}
 	/*
 	 * If the value of wal_buffers is -1, use the preferred auto-tune value.
 	 * This isn't an amazingly clean place to do this, but we must wait till
 	 * NBuffers has received its final value, and must do it before using the
 	 * value of XLOGbuffers to do anything important.
 	 */
-	if (XLOGbuffers == -1)
+	else if (XLOGbuffers == -1)
 	{
 		char		buf[32];
 
@@ -5064,10 +5177,13 @@ XLOGShmemSize(void)
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
-	/* extra alignment padding for XLOG I/O buffers */
-	size = add_size(size, XLOG_BLCKSZ);
-	/* and the buffers themselves */
-	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+	if (!NvwalAvail)
+	{
+		/* extra alignment padding for XLOG I/O buffers */
+		size = add_size(size, XLOG_BLCKSZ);
+		/* and the buffers themselves */
+		size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+	}
 
 	/*
 	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5161,13 +5277,32 @@ XLOGShmemInit(void)
 	}
 
 	/*
-	 * Align the start of the page buffers to a full xlog block size boundary.
-	 * This simplifies some calculations in XLOG insertion. It is also
-	 * required for O_DIRECT.
+	 * Open and memory-map a file for non-volatile XLOG buffer. The PMDK will
+	 * align the start of the buffer to 2-MiB boundary if the size of the
+	 * buffer is larger than or equal to 4 MiB.
 	 */
-	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
-	XLogCtl->pages = allocptr;
-	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+	if (NvwalAvail)
+	{
+		/* Logging and error-handling should be done in the function */
+		XLogCtl->pages = MapNonVolatileXLogBuffer(NvwalPath, NvwalSize);
+
+		/*
+		 * Do not memset non-volatile XLOG buffer (XLogCtl->pages) here
+		 * because it would contain records for recovery. We should do so in
+		 * checkpoint after the recovery completes successfully.
+		 */
+	}
+	else
+	{
+		/*
+		 * Align the start of the page buffers to a full xlog block size
+		 * boundary. This simplifies some calculations in XLOG insertion. It
+		 * is also required for O_DIRECT.
+		 */
+		allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+		XLogCtl->pages = allocptr;
+		memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+	}
 
 	/*
 	 * Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
@@ -8522,6 +8657,13 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+
+	/*
+	 * If we use non-volatile XLOG buffer, unmap it.
+	 */
+	if (NvwalAvail)
+		UnmapNonVolatileXLogBuffer(XLogCtl->pages, NvwalSize);
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 75fc6f11d6..140a99faee 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2707,7 +2707,7 @@ static struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_XBLOCKS
 		},
 		&XLOGbuffers,
-		-1, -1, (INT_MAX / XLOG_BLCKSZ),
+		-1, -1, INT_MAX,
 		check_wal_buffers, NULL, NULL
 	},
 
@@ -3381,6 +3381,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, assign_tcp_user_timeout, show_tcp_user_timeout
 	},
 
+	{
+		{"nvwal_size", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Size of non-volatile WAL buffer (NVWAL)."),
+			NULL,
+			GUC_UNIT_MB
+		},
+		&NvwalSizeMB,
+		1024, 1, INT_MAX,
+		check_nvwal_size, assign_nvwal_size, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -4419,6 +4430,16 @@ static struct config_string ConfigureNamesString[] =
 		check_backtrace_functions, assign_backtrace_functions, NULL
 	},
 
+	{
+		{"nvwal_path", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Path to file for non-volatile WAL buffer (NVWAL)."),
+			NULL
+		},
+		&NvwalPath,
+		"",
+		check_nvwal_path, assign_nvwal_path, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 3a25287a39..866f77828d 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -226,6 +226,8 @@
 #checkpoint_timeout = 5min		# range 30s-1d
 #max_wal_size = 1GB
 #min_wal_size = 80MB
+#nvwal_path = '/path/to/nvwal'
+#nvwal_size = 1GB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 786672b1b6..1b18097580 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -144,7 +144,10 @@ static bool show_setting = false;
 static bool data_checksums = false;
 static char *xlog_dir = NULL;
 static char *str_wal_segment_size_mb = NULL;
+static char *nvwal_path = NULL;
+static char *str_nvwal_size_mb = NULL;
 static int	wal_segment_size_mb;
+static int	nvwal_size_mb;
 
 
 /* internal vars */
@@ -1109,14 +1112,78 @@ setup_config(void)
 	conflines = replace_token(conflines, "#port = 5432", repltok);
 #endif
 
-	/* set default max_wal_size and min_wal_size */
-	snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
-			 pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
-	conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+	if (nvwal_path != NULL)
+	{
+		int nr_segs;
+
+		if (str_nvwal_size_mb == NULL)
+			nvwal_size_mb = 1024;
+		else
+		{
+			char *endptr;
+
+			/* check that the argument is a number */
+			nvwal_size_mb = strtol(str_nvwal_size_mb, &endptr, 10);
+
+			/* verify that the size of non-volatile WAL buffer is valid */
+			if (endptr == str_nvwal_size_mb || *endptr != '\0')
+			{
+				pg_log_error("argument of --nvwal-size must be a number; "
+							 "str_nvwal_size_mb '%s'",
+							 str_nvwal_size_mb);
+				exit(1);
+			}
+			if (nvwal_size_mb <= 0)
+			{
+				pg_log_error("argument of --nvwal-size must be a positive number; "
+							 "str_nvwal_size_mb '%s'; nvwal_size_mb %d",
+							 str_nvwal_size_mb, nvwal_size_mb);
+				exit(1);
+			}
+			if (nvwal_size_mb % wal_segment_size_mb != 0)
+			{
+				pg_log_error("argument of --nvwal-size must be multiple of WAL segment size; "
+							 "str_nvwal_size_mb '%s'; nvwal_size_mb %d; wal_segment_size_mb %d",
+							 str_nvwal_size_mb, nvwal_size_mb, wal_segment_size_mb);
+				exit(1);
+			}
+		}
+
+		/*
+		 * XXX We set {min_,max_,nv}wal_size to the same value.  Note that
+		 * postgres might bootstrap and run if the three config does not have
+		 * the same value, but have not been tested yet.
+		 */
+		nr_segs = nvwal_size_mb / wal_segment_size_mb;
 
-	snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
-			 pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
-	conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+		snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "nvwal_path = '%s'",
+				 nvwal_path);
+		conflines = replace_token(conflines,
+								  "#nvwal_path = '/path/to/nvwal'", repltok);
+
+		snprintf(repltok, sizeof(repltok), "nvwal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#nvwal_size = 1GB", repltok);
+	}
+	else
+	{
+		/* set default max_wal_size and min_wal_size */
+		snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+				 pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
+		conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+				 pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
+		conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+	}
 
 	snprintf(repltok, sizeof(repltok), "lc_messages = '%s'",
 			 escape_quotes(lc_messages));
@@ -2321,6 +2388,8 @@ usage(const char *progname)
 	printf(_("  -W, --pwprompt            prompt for a password for the new superuser\n"));
 	printf(_("  -X, --waldir=WALDIR       location for the write-ahead log directory\n"));
 	printf(_("      --wal-segsize=SIZE    size of WAL segments, in megabytes\n"));
+	printf(_("  -P, --nvwal-path=FILE     path to file for non-volatile WAL buffer (NVWAL)\n"));
+	printf(_("  -Q, --nvwal-size=SIZE     size of NVWAL, in megabytes\n"));
 	printf(_("\nLess commonly used options:\n"));
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("  -k, --data-checksums      use data page checksums\n"));
@@ -2989,6 +3058,8 @@ main(int argc, char *argv[])
 		{"sync-only", no_argument, NULL, 'S'},
 		{"waldir", required_argument, NULL, 'X'},
 		{"wal-segsize", required_argument, NULL, 12},
+		{"nvwal-path", required_argument, NULL, 'P'},
+		{"nvwal-size", required_argument, NULL, 'Q'},
 		{"data-checksums", no_argument, NULL, 'k'},
 		{"allow-group-access", no_argument, NULL, 'g'},
 		{NULL, 0, NULL, 0}
@@ -3032,7 +3103,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:g", long_options, &option_index)) != -1)
+	while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:P:Q:g", long_options, &option_index)) != -1)
 	{
 		switch (c)
 		{
@@ -3126,6 +3197,12 @@ main(int argc, char *argv[])
 			case 12:
 				str_wal_segment_size_mb = pg_strdup(optarg);
 				break;
+			case 'P':
+				nvwal_path = pg_strdup(optarg);
+				break;
+			case 'Q':
+				str_nvwal_size_mb = pg_strdup(optarg);
+				break;
 			case 'g':
 				SetDataDirectoryCreatePerm(PG_DIR_MODE_GROUP);
 				break;
diff --git a/src/include/access/nv_xlog_buffer.h b/src/include/access/nv_xlog_buffer.h
new file mode 100644
index 0000000000..b58878c92b
--- /dev/null
+++ b/src/include/access/nv_xlog_buffer.h
@@ -0,0 +1,71 @@
+/*
+ * nv_xlog_buffer.h
+ *
+ * PostgreSQL non-volatile WAL buffer
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nv_xlog_buffer.h
+ */
+#ifndef NV_XLOG_BUFFER_H
+#define NV_XLOG_BUFFER_H
+
+extern void *MapNonVolatileXLogBuffer(const char *fname, Size fsize);
+extern void	UnmapNonVolatileXLogBuffer(void *addr, Size fsize);
+
+#ifdef USE_NVWAL
+#include <libpmem.h>
+
+#define nv_memset_persist	pmem_memset_persist
+#define nv_memcpy_nodrain	pmem_memcpy_nodrain
+#define nv_flush			pmem_flush
+#define nv_drain			pmem_drain
+#define nv_persist			pmem_persist
+
+#else
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+	return NULL;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+	return;
+}
+
+static inline void *
+nv_memset_persist(void *pmemdest, int c, size_t len)
+{
+	return NULL;
+}
+
+static inline void *
+nv_memcpy_nodrain(void *pmemdest, const void *src,
+				  size_t len)
+{
+	return NULL;
+}
+
+static inline void
+nv_flush(void *pmemdest, size_t len)
+{
+	return;
+}
+
+static inline void
+nv_drain(void)
+{
+	return;
+}
+
+static inline void
+nv_persist(const void *addr, size_t len)
+{
+	return;
+}
+
+#endif							/* USE_NVWAL */
+#endif							/* NV_XLOG_BUFFER_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 347a38f57c..0a05e79524 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,8 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern char *NvwalPath;
+extern int  NvwalSizeMB;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index c199cd46d2..90d23b46d1 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -325,6 +325,9 @@
 /* Define to 1 if you have the `pam' library (-lpam). */
 #undef HAVE_LIBPAM
 
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
 /* Define if you have a function readline library */
 #undef HAVE_LIBREADLINE
 
@@ -880,6 +883,9 @@
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
+/* Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal) */
+#undef USE_NVWAL
+
 /* Define to build with OpenSSL support. (--with-openssl) */
 #undef USE_OPENSSL
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..d941a76d43 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,6 +438,10 @@ extern void assign_search_path(const char *newval, void *extra);
 
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_nvwal_path(char **newval, void **extra, GucSource source);
+extern void assign_nvwal_path(const char *newval, void *extra);
+extern bool check_nvwal_size(int *newval, void **extra, GucSource source);
+extern void assign_nvwal_size(int newval, void *extra);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
 #endif							/* GUC_H */
-- 
2.17.1

v3-0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0002-Non-volatile-WAL-buffer.patchDownload
From 0cb1f9197350d76ad8ef1fc2115afb7abdfc4fdc Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:57 +0900
Subject: [PATCH v3 2/5] Non-volatile WAL buffer

Now external WAL buffer becomes non-volatile.

Bumps PG_CONTROL_VERSION.
---
 src/backend/access/transam/xlog.c            | 1154 ++++++++++++++++--
 src/backend/access/transam/xlogreader.c      |   24 +
 src/bin/pg_controldata/pg_controldata.c      |    3 +
 src/include/access/xlog.h                    |    8 +
 src/include/catalog/pg_control.h             |   17 +-
 src/test/regress/expected/misc_functions.out |   14 +-
 src/test/regress/sql/misc_functions.sql      |   14 +-
 7 files changed, 1097 insertions(+), 137 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0681ba1262..45e05b9498 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -654,6 +654,13 @@ typedef struct XLogCtlData
 	TimeLineID	ThisTimeLineID;
 	TimeLineID	PrevTimeLineID;
 
+	/*
+	 * Used for non-volatile WAL buffer (NVWAL).
+	 *
+	 * All the records up to this LSN are persistent in NVWAL.
+	 */
+	XLogRecPtr	persistentUpTo;
+
 	/*
 	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
@@ -783,11 +790,13 @@ typedef enum
 	XLOG_FROM_ANY = 0,			/* request to read WAL from any source */
 	XLOG_FROM_ARCHIVE,			/* restored using restore_command */
 	XLOG_FROM_PG_WAL,			/* existing file in pg_wal */
-	XLOG_FROM_STREAM			/* streamed from master */
+	XLOG_FROM_NVWAL,			/* non-volatile WAL buffer */
+	XLOG_FROM_STREAM,			/* streamed from master via segment file */
+	XLOG_FROM_STREAM_NVWAL		/* same as above, but via NVWAL */
 } XLogSource;
 
 /* human-readable names for XLogSources, for debugging output */
-static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
+static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "nvwal", "stream", "stream_nvwal"};
 
 /*
  * openLogFile is -1 or a kernel FD for an open log file segment.
@@ -922,6 +931,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
+static void PreallocNonVolatileXlogBuffer(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
@@ -1204,6 +1214,43 @@ XLogInsertRecord(XLogRecData *rdata,
 		}
 	}
 
+	/*
+	 * Request a checkpoint here if non-volatile WAL buffer is used and we
+	 * have consumed too much WAL since the last checkpoint.
+	 *
+	 * We first screen under the condition (1) OR (2) below:
+	 *
+	 * (1) The record was the first one in a certain segment.
+	 * (2) The record was inserted across segments.
+	 *
+	 * We then check the segment number which the record was inserted into.
+	 */
+	if (NvwalAvail && inserted &&
+		(StartPos % wal_segment_size == SizeOfXLogLongPHD ||
+		 StartPos / wal_segment_size < EndPos / wal_segment_size))
+	{
+		XLogSegNo	end_segno;
+
+		XLByteToSeg(EndPos, end_segno, wal_segment_size);
+
+		/*
+		 * NOTE: We do not signal walsender here because the inserted record
+		 * have not drained by NVWAL buffer yet.
+		 *
+		 * NOTE: We do not signal walarchiver here because the inserted record
+		 * have not flushed to a segment file.  So we don't need to update
+		 * XLogCtl->lastSegSwitch{Time,LSN}, used only by CheckArchiveTimeout.
+		 */
+
+		/* Two-step checking for speed (see also XLogWrite) */
+		if (IsUnderPostmaster && XLogCheckpointNeeded(end_segno))
+		{
+			(void) GetRedoRecPtr();
+			if (XLogCheckpointNeeded(end_segno))
+				RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+		}
+	}
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
 	{
@@ -2136,6 +2183,15 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
 	int			npages = 0;
+	bool		is_firstpage;
+
+	if (NvwalAvail)
+		elog(DEBUG1, "XLogCtl->InitializedUpTo %X/%X; upto %X/%X; opportunistic %s",
+			 (uint32) (XLogCtl->InitializedUpTo >> 32),
+			 (uint32) XLogCtl->InitializedUpTo,
+			 (uint32) (upto >> 32),
+			 (uint32) upto,
+			 opportunistic ? "true" : "false");
 
 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
 
@@ -2197,7 +2253,25 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 				{
 					/* Have to write it ourselves */
 					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
-					WriteRqst.Write = OldPageRqstPtr;
+
+					if (NvwalAvail)
+					{
+						/*
+						 * If we use non-volatile WAL buffer, it is a special
+						 * but expected case to write the buffer pages out to
+						 * segment files, and for simplicity, it is done in
+						 * segment by segment.
+						 */
+						XLogRecPtr		OldSegEndPtr;
+
+						OldSegEndPtr = OldPageRqstPtr - XLOG_BLCKSZ + wal_segment_size;
+						Assert(OldSegEndPtr % wal_segment_size == 0);
+
+						WriteRqst.Write = OldSegEndPtr;
+					}
+					else
+						WriteRqst.Write = OldPageRqstPtr;
+
 					WriteRqst.Flush = 0;
 					XLogWrite(WriteRqst, false);
 					LWLockRelease(WALWriteLock);
@@ -2224,7 +2298,20 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		 * Be sure to re-zero the buffer so that bytes beyond what we've
 		 * written will look like zeroes and not valid XLOG records...
 		 */
-		MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
+		if (NvwalAvail)
+		{
+			/*
+			 * We do not take the way that combines MemSet() and pmem_persist()
+			 * because pmem_persist() may use slow and strong-ordered cache
+			 * flush instruction if weak-ordered fast one is not supported.
+			 * Instead, we first fill the buffer with zero by
+			 * pmem_memset_persist() that can leverage non-temporal fast store
+			 * instructions, then make the header persistent later.
+			 */
+			nv_memset_persist(NewPage, 0, XLOG_BLCKSZ);
+		}
+		else
+			MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
 
 		/*
 		 * Fill the new page's header
@@ -2256,7 +2343,8 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		/*
 		 * If first page of an XLOG segment file, make it a long header.
 		 */
-		if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+		is_firstpage = ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0);
+		if (is_firstpage)
 		{
 			XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
 
@@ -2271,7 +2359,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
 		 * holding a lock.
 		 */
-		pg_write_barrier();
+		if (NvwalAvail)
+		{
+			/* Make the header persistent on PMEM */
+			nv_persist(NewPage, is_firstpage ? SizeOfXLogLongPHD : SizeOfXLogShortPHD);
+		}
+		else
+			pg_write_barrier();
 
 		*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
 
@@ -2281,6 +2375,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	}
 	LWLockRelease(WALBufMappingLock);
 
+	if (NvwalAvail)
+		elog(DEBUG1, "ControlFile->discardedUpTo %X/%X; XLogCtl->InitializedUpTo %X/%X",
+			 (uint32) (ControlFile->discardedUpTo >> 32),
+			 (uint32) ControlFile->discardedUpTo,
+			 (uint32) (XLogCtl->InitializedUpTo >> 32),
+			 (uint32) XLogCtl->InitializedUpTo);
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG && npages > 0)
 	{
@@ -2662,6 +2763,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		LogwrtResult.Flush = LogwrtResult.Write;
 	}
 
+	/*
+	 * Update discardedUpTo if NVWAL is used.  A new value should not fall
+	 * behind the old one.
+	 */
+	if (NvwalAvail)
+	{
+		Assert(LogwrtResult.Write == LogwrtResult.Flush);
+
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		if (ControlFile->discardedUpTo < LogwrtResult.Write)
+		{
+			ControlFile->discardedUpTo = LogwrtResult.Write;
+			UpdateControlFile();
+		}
+		LWLockRelease(ControlFileLock);
+	}
+
 	/*
 	 * Update shared-memory status
 	 *
@@ -2866,6 +2984,123 @@ XLogFlush(XLogRecPtr record)
 		return;
 	}
 
+	if (NvwalAvail)
+	{
+		XLogRecPtr	FromPos;
+
+		/*
+		 * No page on the NVWAL is to be flushed to segment files.  Instead,
+		 * we wait all the insertions preceding this one complete.  We will
+		 * wait for all the records to be persistent on the NVWAL below.
+		 */
+		record = WaitXLogInsertionsToFinish(record);
+
+		/*
+		 * Check if another backend already have done what I am doing.
+		 *
+		 * We can compare something <= XLogCtl->persistentUpTo without
+		 * holding XLogCtl->info_lck spinlock because persistentUpTo is
+		 * monotonically increasing and can be loaded atomically on each
+		 * NVWAL-supported platform (now x64 only).
+		 */
+		FromPos = *((volatile XLogRecPtr *) &XLogCtl->persistentUpTo);
+		if (record <= FromPos)
+			return;
+
+		/*
+		 * In a very rare case, we rounded whole the NVWAL.  We do not need
+		 * to care old pages here because they already have been evicted to
+		 * segment files at record insertion.
+		 *
+		 * In such a case, we flush whole the NVWAL.  We also log it as
+		 * warning because it can be time-consuming operation.
+		 *
+		 * TODO Advance XLogCtl->persistentUpTo at the end of XLogWrite, and
+		 * we can remove the following first if-block.
+		 */
+		if (record - FromPos > NvwalSize)
+		{
+			elog(WARNING, "flush whole the NVWAL; FromPos %X/%X; record %X/%X",
+				 (uint32) (FromPos >> 32), (uint32) FromPos,
+				 (uint32) (record >> 32), (uint32) record);
+
+			nv_flush(XLogCtl->pages, NvwalSize);
+		}
+		else
+		{
+			char   *frompos;
+			char   *uptopos;
+			size_t	fromoff;
+			size_t	uptooff;
+
+			/*
+			 * Flush each record that is probably not flushed yet.
+			 *
+			 * We have two reasons why we say "probably".  The first is because
+			 * such a record copied with non-temporal store instruction has
+			 * already "flushed" but we cannot distinguish it.  nv_flush is
+			 * harmless for it in consistency.
+			 *
+			 * The second reason is that the target record might have already
+			 * been evicted to a segment file until now.  Also in this case,
+			 * nv_flush is harmless in consistency.
+			 */
+			uptooff = record % NvwalSize;
+			uptopos = XLogCtl->pages + uptooff;
+			fromoff = FromPos % NvwalSize;
+			frompos = XLogCtl->pages + fromoff;
+
+			/* Handles rotation */
+			if (uptopos <= frompos)
+			{
+				nv_flush(frompos, NvwalSize - fromoff);
+				fromoff = 0;
+				frompos = XLogCtl->pages;
+			}
+
+			nv_flush(frompos, uptooff - fromoff);
+		}
+
+		/*
+		 * To guarantee durability ("D" of ACID), we should satisfy the
+		 * following two for each transaction X:
+		 *
+		 *  (1) All the WAL records inserted by X, including the commit record
+		 *      of X, should persist on NVWAL before the server commits X.
+		 *
+		 *  (2) All the WAL records inserted by any other transactions than
+		 *      X, that have less LSN than the commit record just inserted
+		 *      by X, should persist on NVWAL before the server commits X.
+		 *
+		 * The (1) can be satisfied by a store barrier after the commit record
+		 * of X is flushed because each WAL record on X is already flushed in
+		 * the end of its insertion.  The (2) can be satisfied by waiting for
+		 * any record insertions that have less LSN than the commit record just
+		 * inserted by X, and by a store barrier as well.
+		 *
+		 * Now is the time.  Have a store barrier.
+		 */
+		nv_drain();
+
+		/*
+		 * Remember where the last persistent record is.  A new value should
+		 * not fall behind the old one.
+		 */
+		SpinLockAcquire(&XLogCtl->info_lck);
+		if (XLogCtl->persistentUpTo < record)
+			XLogCtl->persistentUpTo = record;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		/*
+		 * The records up to the returned "record" have been persisntent on
+		 * NVWAL.  Now signal walsenders.
+		 */
+		WalSndWakeupRequest();
+		WalSndWakeupProcessRequests();
+
+		return;
+	}
+
 	/* Quick exit if already known flushed */
 	if (record <= LogwrtResult.Flush)
 		return;
@@ -3049,6 +3284,13 @@ XLogBackgroundFlush(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/*
+	 * Quick exit if NVWAL buffer is used and archiving is not active. In this
+	 * case, we need no WAL segment file in pg_wal directory.
+	 */
+	if (NvwalAvail && !XLogArchivingActive())
+		return false;
+
 	/* read LogwrtResult and update local state */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
@@ -3067,6 +3309,18 @@ XLogBackgroundFlush(void)
 		flexible = false;		/* ensure it all gets written */
 	}
 
+	/*
+	 * If NVWAL is used, back off to the last compeleted segment boundary
+	 * for writing the buffer page to files in segment by segment.  We do so
+	 * nowhere but here after XLogCtl->asyncXactLSN is loaded because it
+	 * should be considered.
+	 */
+	if (NvwalAvail)
+	{
+		WriteRqst.Write -= WriteRqst.Write % wal_segment_size;
+		flexible = false;		/* ensure it all gets written */
+	}
+
 	/*
 	 * If already known flushed, we're done. Just need to check if we are
 	 * holding an open file handle to a logfile that's no longer in use,
@@ -3093,7 +3347,12 @@ XLogBackgroundFlush(void)
 	flushbytes =
 		WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
 
-	if (WalWriterFlushAfter == 0 || lastflush == 0)
+	if (NvwalAvail)
+	{
+		WriteRqst.Flush = WriteRqst.Write;
+		lastflush = now;
+	}
+	else if (WalWriterFlushAfter == 0 || lastflush == 0)
 	{
 		/* first call, or block based limits disabled */
 		WriteRqst.Flush = WriteRqst.Write;
@@ -3152,7 +3411,28 @@ XLogBackgroundFlush(void)
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
 	 */
-	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
+	if (NvwalAvail && max_wal_senders == 0)
+	{
+		XLogRecPtr		upto;
+
+		/*
+		 * If NVWAL is used and there is no walsender, nobody is to load
+		 * segments on the buffer.  So let's recycle segments up to {where we
+		 * have requested to write and flush} + NvwalSize.
+		 *
+		 * Note that if NVWAL is used and a walsender seems running, we have to
+		 * do nothing; keep the written pages on the buffer for walsenders to be
+		 * loaded from the buffer, not from the segment files.  Note that the
+		 * buffer pages are eventually to be recycled by checkpoint.
+		 */
+		Assert(WriteRqst.Write == WriteRqst.Flush);
+		Assert(WriteRqst.Write % wal_segment_size == 0);
+
+		upto = WriteRqst.Write + NvwalSize;
+		AdvanceXLInsertBuffer(upto - XLOG_BLCKSZ, false);
+	}
+	else
+		AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
 
 	/*
 	 * If we determined that we need to write data, but somebody else
@@ -3885,6 +4165,43 @@ XLogFileClose(void)
 	ReleaseExternalFD();
 }
 
+/*
+ * Preallocate non-volatile XLOG buffers.
+ *
+ * This zeroes buffers and prepare page headers up to
+ * ControlFile->discardedUpTo + S, where S is the total size of
+ * the non-volatile XLOG buffers.
+ *
+ * It is caller's responsibility to update ControlFile->discardedUpTo
+ * and to set XLogCtl->InitializedUpTo properly.
+ */
+static void
+PreallocNonVolatileXlogBuffer(void)
+{
+	XLogRecPtr	newupto,
+				InitializedUpTo;
+
+	Assert(NvwalAvail);
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	newupto = ControlFile->discardedUpTo;
+	LWLockRelease(ControlFileLock);
+
+	InitializedUpTo = XLogCtl->InitializedUpTo;
+
+	newupto += NvwalSize;
+	Assert(newupto % wal_segment_size == 0);
+
+	if (newupto <= InitializedUpTo)
+		return;
+
+	/*
+	 * Subtracting XLOG_BLCKSZ is important, because AdvanceXLInsertBuffer
+	 * handles the first argument as the beginning of pages, not the end.
+	 */
+	AdvanceXLInsertBuffer(newupto - XLOG_BLCKSZ, false);
+}
+
 /*
  * Preallocate log files beyond the specified log endpoint.
  *
@@ -4181,8 +4498,11 @@ RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr)
 	 * Before deleting the file, see if it can be recycled as a future log
 	 * segment. Only recycle normal files, pg_standby for example can create
 	 * symbolic links pointing to a separate archive directory.
+	 *
+	 * If NVWAL buffer is used, a log segment file is never to be recycled
+	 * (that is, always go into else block).
 	 */
-	if (wal_recycle &&
+	if (!NvwalAvail && wal_recycle &&
 		endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
@@ -4600,6 +4920,7 @@ InitControlFile(uint64 sysidentifier)
 	memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, MOCK_AUTH_NONCE_LEN);
 	ControlFile->state = DB_SHUTDOWNED;
 	ControlFile->unloggedLSN = FirstNormalUnloggedLSN;
+	ControlFile->discardedUpTo = (NvwalAvail) ? wal_segment_size : InvalidXLogRecPtr;
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
@@ -5430,41 +5751,58 @@ BootStrapXLOG(void)
 	record->xl_crc = crc;
 
 	/* Create first XLOG segment file */
-	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	if (NvwalAvail)
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		nv_memcpy_nodrain(XLogCtl->pages + wal_segment_size, page, XLOG_BLCKSZ);
+		pgstat_report_wait_end();
 
-	/*
-	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
-	 * close the file again in a moment.
-	 */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+		nv_drain();
+		pgstat_report_wait_end();
 
-	/* Write the first page with the initial record */
-	errno = 0;
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
-	{
-		/* if write didn't set errno, assume problem is no disk space */
-		if (errno == 0)
-			errno = ENOSPC;
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not write bootstrap write-ahead log file: %m")));
+		/*
+		 * Other WAL stuffs will be initialized in startup process.
+		 */
 	}
-	pgstat_report_wait_end();
+	else
+	{
+		use_existent = false;
+		openLogFile = XLogFileInit(1, &use_existent, false);
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
-	pgstat_report_wait_end();
+		/*
+		 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
+		 * close the file again in a moment.
+		 */
 
-	if (close(openLogFile) != 0)
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not close bootstrap write-ahead log file: %m")));
+		/* Write the first page with the initial record */
+		errno = 0;
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write bootstrap write-ahead log file: %m")));
+		}
+		pgstat_report_wait_end();
 
-	openLogFile = -1;
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+		if (pg_fsync(openLogFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_end();
+
+		if (close(openLogFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not close bootstrap write-ahead log file: %m")));
+
+		openLogFile = -1;
+	}
 
 	/* Now create pg_control */
 	InitControlFile(sysidentifier);
@@ -5718,41 +6056,47 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * happens in the middle of a segment, copy data from the last WAL segment
 	 * of the old timeline up to the switch point, to the starting WAL segment
 	 * on the new timeline.
+	 *
+	 * If non-volatile WAL buffer is used, no new segment file is created. Data
+	 * up to the switch point will be copied into NVWAL buffer by StartupXLOG().
 	 */
-	if (endLogSegNo == startLogSegNo)
-	{
-		/*
-		 * Make a copy of the file on the new timeline.
-		 *
-		 * Writing WAL isn't allowed yet, so there are no locking
-		 * considerations. But we should be just as tense as XLogFileInit to
-		 * avoid emplacing a bogus file.
-		 */
-		XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
-					 XLogSegmentOffset(endOfLog, wal_segment_size));
-	}
-	else
+	if (!NvwalAvail)
 	{
-		/*
-		 * The switch happened at a segment boundary, so just create the next
-		 * segment on the new timeline.
-		 */
-		bool		use_existent = true;
-		int			fd;
+		if (endLogSegNo == startLogSegNo)
+		{
+			/*
+			 * Make a copy of the file on the new timeline.
+			 *
+			 * Writing WAL isn't allowed yet, so there are no locking
+			 * considerations. But we should be just as tense as XLogFileInit to
+			 * avoid emplacing a bogus file.
+			 */
+			XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
+						 XLogSegmentOffset(endOfLog, wal_segment_size));
+		}
+		else
+		{
+			/*
+			 * The switch happened at a segment boundary, so just create the next
+			 * segment on the new timeline.
+			 */
+			bool		use_existent = true;
+			int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+			fd = XLogFileInit(startLogSegNo, &use_existent, true);
 
-		if (close(fd) != 0)
-		{
-			char		xlogfname[MAXFNAMELEN];
-			int			save_errno = errno;
+			if (close(fd) != 0)
+			{
+				char		xlogfname[MAXFNAMELEN];
+				int			save_errno = errno;
 
-			XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
-						 wal_segment_size);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not close file \"%s\": %m", xlogfname)));
+				XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
+							 wal_segment_size);
+				errno = save_errno;
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not close file \"%s\": %m", xlogfname)));
+			}
 		}
 	}
 
@@ -7009,6 +7353,11 @@ StartupXLOG(void)
 		InRecovery = true;
 	}
 
+	/* Dump discardedUpTo just before REDO */
+	elog(LOG, "ControlFile->discardedUpTo %X/%X",
+		 (uint32) (ControlFile->discardedUpTo >> 32),
+		 (uint32) ControlFile->discardedUpTo);
+
 	/* REDO */
 	if (InRecovery)
 	{
@@ -7795,10 +8144,88 @@ StartupXLOG(void)
 	Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
 	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
 
+	if (NvwalAvail)
+	{
+		XLogRecPtr	discardedUpTo;
+
+		discardedUpTo = ControlFile->discardedUpTo;
+		Assert(discardedUpTo == InvalidXLogRecPtr ||
+			   discardedUpTo % wal_segment_size == 0);
+
+		if (discardedUpTo == InvalidXLogRecPtr)
+		{
+			elog(DEBUG1, "brand-new NVWAL");
+
+			/* The following "Tricky point" is to initialize the buffer */
+		}
+		else if (EndOfLog <= discardedUpTo)
+		{
+			elog(DEBUG1, "no record on NVWAL has been UNDONE");
+
+			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+			ControlFile->discardedUpTo = InvalidXLogRecPtr;
+			UpdateControlFile();
+			LWLockRelease(ControlFileLock);
+
+			nv_memset_persist(XLogCtl->pages, 0, NvwalSize);
+
+			/* The following "Tricky point" is to initialize the buffer */
+		}
+		else
+		{
+			int			last_idx;
+			int			idx;
+			XLogRecPtr	ptr;
+
+			elog(DEBUG1, "some records on NVWAL have been UNDONE; keep them");
+
+			/*
+			 * Initialize xlblock array because we decided to keep UNDONE
+			 * records on NVWAL buffer; or each page on the buffer that meets
+			 * xlblocks == 0 (initialized as so by XLOGShmemInit) is to be
+			 * accidentally cleared by the following AdvanceXLInsertBuffer!
+			 *
+			 * Two cases can be considered:
+			 *
+			 * 1) EndOfLog is on a page boundary (divisible by XLOG_BLCKSZ):
+			 *    Initialize up to (and including) the page containing the last
+			 *    record.  That page should end with EndOfLog.  The one more
+			 *    next page "N" beginning with EndOfLog is to be untouched
+			 *    because, in such a very corner case that all the NVWAL
+			 *    buffer pages are already filled, page N is on the same
+			 *    location as the first page "F" beginning with discardedUpTo.
+			 *    Of cource we should not overwrite the page F.
+			 *
+			 *    In this case, we first get XLogRecPtrToBufIdx(EndOfLog) as
+			 *    last_idx, indicating the page N.  Then, we go forward from
+			 *    the page F up to (but excluding) page N that have the same
+			 *    index as the page F.
+			 *
+			 * 2) EndOfLog is not on a page boundary:  Initialize all the pages
+			 *    but the page "L" having the last record. The page L is to be
+			 *    initialized by the following "Tricky point", including its
+			 *    content.
+			 *
+			 * In either case, XLogCtl->InitializedUpTo is to be initialized in
+			 * the following "Tricky" if-else block.
+			 */
+
+			last_idx = XLogRecPtrToBufIdx(EndOfLog);
+
+			ptr = discardedUpTo;
+			for (idx = XLogRecPtrToBufIdx(ptr); idx != last_idx;
+				 idx = NextBufIdx(idx))
+			{
+				ptr += XLOG_BLCKSZ;
+				XLogCtl->xlblocks[idx] = ptr;
+			}
+		}
+	}
+
 	/*
-	 * Tricky point here: readBuf contains the *last* block that the LastRec
-	 * record spans, not the one it starts in.  The last block is indeed the
-	 * one we want to use.
+	 * Tricky point here: readBuf contains the *last* block that the
+	 * LastRec record spans, not the one it starts in.  The last block is
+	 * indeed the one we want to use.
 	 */
 	if (EndOfLog % XLOG_BLCKSZ != 0)
 	{
@@ -7818,6 +8245,9 @@ StartupXLOG(void)
 		memcpy(page, xlogreader->readBuf, len);
 		memset(page + len, 0, XLOG_BLCKSZ - len);
 
+		if (NvwalAvail)
+			nv_persist(page, XLOG_BLCKSZ);
+
 		XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
 		XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
 	}
@@ -7831,12 +8261,54 @@ StartupXLOG(void)
 		XLogCtl->InitializedUpTo = EndOfLog;
 	}
 
-	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+	if (NvwalAvail)
+	{
+		XLogRecPtr	SegBeginPtr;
+
+		/*
+		 * If NVWAL buffer is used, writing records out to segment files should
+		 * be done in segment by segment. So Logwrt{Rqst,Result} (and also
+		 * discardedUpTo) should be multiple of wal_segment_size.  Let's get
+		 * them back off to the last segment boundary.
+		 */
+
+		SegBeginPtr = EndOfLog - (EndOfLog % wal_segment_size);
+		LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+		XLogCtl->LogwrtResult = LogwrtResult;
+		XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+		XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+
+		/*
+		 * persistentUpTo does not need to be multiple of wal_segment_size,
+		 * and should be drained-up-to LSN. walsender will use it to load
+		 * records from NVWAL buffer.
+		 */
+		XLogCtl->persistentUpTo = EndOfLog;
+
+		/* Update discardedUpTo in pg_control if still invalid */
+		if (ControlFile->discardedUpTo == InvalidXLogRecPtr)
+		{
+			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+			ControlFile->discardedUpTo = SegBeginPtr;
+			UpdateControlFile();
+			LWLockRelease(ControlFileLock);
+		}
+
+		elog(DEBUG1, "EndOfLog: %X/%X",
+			 (uint32) (EndOfLog >> 32), (uint32) EndOfLog);
 
-	XLogCtl->LogwrtResult = LogwrtResult;
+		elog(DEBUG1, "SegBeginPtr: %X/%X",
+			 (uint32) (SegBeginPtr >> 32), (uint32) SegBeginPtr);
+	}
+	else
+	{
+		LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
-	XLogCtl->LogwrtRqst.Write = EndOfLog;
-	XLogCtl->LogwrtRqst.Flush = EndOfLog;
+		XLogCtl->LogwrtResult = LogwrtResult;
+
+		XLogCtl->LogwrtRqst.Write = EndOfLog;
+		XLogCtl->LogwrtRqst.Flush = EndOfLog;
+	}
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7967,6 +8439,7 @@ StartupXLOG(void)
 				char		origpath[MAXPGPATH];
 				char		partialfname[MAXFNAMELEN];
 				char		partialpath[MAXPGPATH];
+				XLogRecPtr	discardedUpTo;
 
 				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
 				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
@@ -7978,6 +8451,53 @@ StartupXLOG(void)
 				 */
 				XLogArchiveCleanup(partialfname);
 
+				/*
+				 * If NVWAL is also used for archival recovery, write old
+				 * records out to segment files to archive them.  Note that we
+				 * need locks related to WAL because LocalXLogInsertAllowed
+				 * already got to -1.
+				 */
+				discardedUpTo = ControlFile->discardedUpTo;
+				if (NvwalAvail && discardedUpTo != InvalidXLogRecPtr &&
+					discardedUpTo < EndOfLog)
+				{
+					XLogwrtRqst WriteRqst;
+					TimeLineID	thisTLI = ThisTimeLineID;
+					XLogRecPtr	SegBeginPtr =
+						EndOfLog - (EndOfLog % wal_segment_size);
+
+					/*
+					 * XXX Assume that all the records have the same TLI.
+					 */
+					ThisTimeLineID = EndOfLogTLI;
+
+					WriteRqst.Write = EndOfLog;
+					WriteRqst.Flush = 0;
+
+					LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+					XLogWrite(WriteRqst, false);
+
+					/*
+					 * Force back-off to the last segment boundary.
+					 */
+					LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+					ControlFile->discardedUpTo = SegBeginPtr;
+					UpdateControlFile();
+					LWLockRelease(ControlFileLock);
+
+					LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+
+					SpinLockAcquire(&XLogCtl->info_lck);
+					XLogCtl->LogwrtResult = LogwrtResult;
+					XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+					XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+					SpinLockRelease(&XLogCtl->info_lck);
+
+					LWLockRelease(WALWriteLock);
+
+					ThisTimeLineID = thisTLI;
+				}
+
 				durable_rename(origpath, partialpath, ERROR);
 				XLogArchiveNotify(partialfname);
 			}
@@ -7987,7 +8507,10 @@ StartupXLOG(void)
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
-	PreallocXlogFiles(EndOfLog);
+	if (NvwalAvail)
+		PreallocNonVolatileXlogBuffer();
+	else
+		PreallocXlogFiles(EndOfLog);
 
 	/*
 	 * Okay, we're officially UP.
@@ -8550,10 +9073,24 @@ GetInsertRecPtr(void)
 /*
  * GetFlushRecPtr -- Returns the current flush position, ie, the last WAL
  * position known to be fsync'd to disk.
+ *
+ * If NVWAL is used, this returns the last persistent WAL position instead.
  */
 XLogRecPtr
 GetFlushRecPtr(void)
 {
+	if (NvwalAvail)
+	{
+		XLogRecPtr		ret;
+
+		SpinLockAcquire(&XLogCtl->info_lck);
+		LogwrtResult = XLogCtl->LogwrtResult;
+		ret = XLogCtl->persistentUpTo;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		return ret;
+	}
+
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	SpinLockRelease(&XLogCtl->info_lck);
@@ -8853,6 +9390,9 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
+	/* for non-volatile WAL buffer */
+	XLogRecPtr	newDiscardedUpTo = 0;
+
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
 	 * issued at a different time.
@@ -9164,6 +9704,22 @@ CreateCheckPoint(int flags)
 	 */
 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+	/*
+	 * If non-volatile WAL buffer is used, discardedUpTo should be updated and
+	 * persist on the control file. So the new value should be caluculated
+	 * here.
+	 *
+	 * TODO Do not copy and paste codes...
+	 */
+	if (NvwalAvail)
+	{
+		XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+		KeepLogSeg(recptr, &_logSegNo);
+		_logSegNo--;
+
+		newDiscardedUpTo = _logSegNo * wal_segment_size;
+	}
+
 	/*
 	 * Update the control file.
 	 */
@@ -9172,6 +9728,16 @@ CreateCheckPoint(int flags)
 		ControlFile->state = DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
+	if (NvwalAvail)
+	{
+		/*
+		 * A new value should not fall behind the old one.
+		 */
+		if (ControlFile->discardedUpTo < newDiscardedUpTo)
+			ControlFile->discardedUpTo = newDiscardedUpTo;
+		else
+			newDiscardedUpTo = ControlFile->discardedUpTo;
+	}
 	ControlFile->time = (pg_time_t) time(NULL);
 	/* crash recovery should always recover to the end of WAL */
 	ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
@@ -9189,6 +9755,44 @@ CreateCheckPoint(int flags)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * If we use non-volatile XLOG buffer, update XLogCtl->Logwrt{Rqst,Result}
+	 * so that the XLOG records older than newDiscardedUpTo are treated as
+	 * "already written and flushed."
+	 */
+	if (NvwalAvail)
+	{
+		Assert(newDiscardedUpTo > 0);
+
+		/* Update process-local variables */
+		LogwrtResult.Write = LogwrtResult.Flush = newDiscardedUpTo;
+
+		/*
+		 * Update shared-memory variables. We need both light-weight lock and
+		 * spin lock to update them.
+		 */
+		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+		SpinLockAcquire(&XLogCtl->info_lck);
+
+		/*
+		 * Note that there can be a corner case that process-local
+		 * LogwrtResult falls behind shared XLogCtl->LogwrtResult if whole the
+		 * non-volatile XLOG buffer is filled and some pages are written out
+		 * to segment files between UpdateControlFile and LWLockAcquire above.
+		 *
+		 * TODO For now, we ignore that case because it can hardly occur.
+		 */
+		XLogCtl->LogwrtResult = LogwrtResult;
+
+		if (XLogCtl->LogwrtRqst.Write < newDiscardedUpTo)
+			XLogCtl->LogwrtRqst.Write = newDiscardedUpTo;
+		if (XLogCtl->LogwrtRqst.Flush < newDiscardedUpTo)
+			XLogCtl->LogwrtRqst.Flush = newDiscardedUpTo;
+
+		SpinLockRelease(&XLogCtl->info_lck);
+		LWLockRelease(WALWriteLock);
+	}
+
 	/* Update shared-memory copy of checkpoint XID/epoch */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->ckptFullXid = checkPoint.nextFullXid;
@@ -9212,22 +9816,48 @@ CreateCheckPoint(int flags)
 	if (PriorRedoPtr != InvalidXLogRecPtr)
 		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
 
-	/*
-	 * Delete old log files, those no longer needed for last checkpoint to
-	 * prevent the disk holding the xlog from growing full.
-	 */
-	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
-	KeepLogSeg(recptr, &_logSegNo);
-	InvalidateObsoleteReplicationSlots(_logSegNo);
-	_logSegNo--;
-	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	if (NvwalAvail)
+	{
+		/*
+		 * We already set _logSegNo to the value equivalent to discardedUpTo.
+		 * We first increment it to call InvalidateObsoleteReplicationSlots.
+		 */
+		_logSegNo++;
+		InvalidateObsoleteReplicationSlots(_logSegNo);
+
+		/*
+		 * Then we decrement _logSegNo again to remove WAL segment files
+		 * having spilled out of non-volatile WAL buffer.
+		 *
+		 * Note that you should set wal_recycle to off to remove segment files.
+		 */
+		_logSegNo--;
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	}
+	else
+	{
+		/*
+		 * Delete old log files, those no longer needed for last checkpoint to
+		 * prevent the disk holding the xlog from growing full.
+		 */
+		XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+		KeepLogSeg(recptr, &_logSegNo);
+		InvalidateObsoleteReplicationSlots(_logSegNo);
+		_logSegNo--;
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	}
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
 	if (!shutdown)
-		PreallocXlogFiles(recptr);
+	{
+		if (NvwalAvail)
+			PreallocNonVolatileXlogBuffer();
+		else
+			PreallocXlogFiles(recptr);
+	}
 
 	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
@@ -11971,6 +12601,170 @@ CancelBackup(void)
 	}
 }
 
+/*
+ * Is NVWAL used?
+ */
+bool
+IsNvwalAvail(void)
+{
+	return NvwalAvail;
+}
+
+/*
+ * Returns the size we can load from NVWAL and sets nvwalptr to load-from LSN.
+ */
+Size
+GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
+{
+	XLogRecPtr	readUpTo;
+	XLogRecPtr	discardedUpTo;
+
+	Assert(IsNvwalAvail());
+	Assert(nvwalptr != NULL);
+
+	readUpTo = target + count;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	discardedUpTo = ControlFile->discardedUpTo;
+	LWLockRelease(ControlFileLock);
+
+	/* Check if all the records are on WAL segment files */
+	if (readUpTo <= discardedUpTo)
+		return 0;
+
+	/* Check if all the records are on NVWAL */
+	if (discardedUpTo <= target)
+	{
+		*nvwalptr = target;
+		return count;
+	}
+
+	/* Some on WAL segment files, some on NVWAL */
+	*nvwalptr = discardedUpTo;
+	return (Size) (readUpTo - discardedUpTo);
+}
+
+/*
+ * It is like WALRead @ xlogreader.c, but loads from non-volatile WAL
+ * buffer.
+ */
+bool
+CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	Assert(NvwalAvail);
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	/*
+	 * Hold shared WALBufMappingLock to let others not rotate WAL buffer
+	 * while copying WAL records from it.  We do not need exclusive lock
+	 * because we will not rotate the buffer in this function.
+	 */
+	LWLockAcquire(WALBufMappingLock, LW_SHARED);
+
+	while (nbytes > 0)
+	{
+		char	   *q;
+		Size		off;
+		Size		max_copy;
+		Size		copybytes;
+		XLogRecPtr	discardedUpTo;
+
+		LWLockAcquire(ControlFileLock, LW_SHARED);
+		discardedUpTo = ControlFile->discardedUpTo;
+		LWLockRelease(ControlFileLock);
+
+		/* Check if the records we need have been already evicted or not */
+		if (recptr < discardedUpTo)
+		{
+			LWLockRelease(WALBufMappingLock);
+
+			/* TODO error handling? */
+			return false;
+		}
+
+		/*
+		 * Get the target address on non-volatile WAL buffer and the size we
+		 * can copy from it at once because the buffer can rotate and we
+		 * might have to copy what we want devided into two or more.
+		 */
+		off = recptr % NvwalSize;
+		q = XLogCtl->pages + off;
+		max_copy = NvwalSize - off;
+		copybytes = Min(nbytes, max_copy);
+
+		memcpy(p, q, copybytes);
+
+		/* Update state for copy */
+		recptr += copybytes;
+		nbytes -= copybytes;
+		p += copybytes;
+	}
+
+	LWLockRelease(WALBufMappingLock);
+	return true;
+}
+
+static bool
+IsXLogSourceFromStream(XLogSource source)
+{
+	switch (source)
+	{
+		case XLOG_FROM_STREAM:
+		case XLOG_FROM_STREAM_NVWAL:
+			return true;
+
+		default:
+			return false;
+	}
+}
+
+static bool
+IsXLogSourceFromNvwal(XLogSource source)
+{
+	switch (source)
+	{
+		case XLOG_FROM_NVWAL:
+		case XLOG_FROM_STREAM_NVWAL:
+			return true;
+
+		default:
+			return false;
+	}
+}
+
+static bool
+NeedsForMoreXLog(XLogRecPtr targetChunkEndPtr)
+{
+	switch (readSource)
+	{
+		case XLOG_FROM_ARCHIVE:
+		case XLOG_FROM_PG_WAL:
+			return (readFile < 0);
+
+		case XLOG_FROM_NVWAL:
+			Assert(NvwalAvail);
+			return false;
+
+		case XLOG_FROM_STREAM:
+			return (flushedUpto < targetChunkEndPtr);
+
+		case XLOG_FROM_STREAM_NVWAL:
+			Assert(NvwalAvail);
+			return (flushedUpto < targetChunkEndPtr);
+
+		default: /* XLOG_FROM_ANY */
+			Assert(readFile < 0);
+			return true;
+	}
+}
+
 /*
  * Read the XLOG page containing RecPtr into readBuf (if not read already).
  * Returns number of bytes read, if the page is read successfully, or -1
@@ -12012,7 +12806,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 &&
+	if ((readFile >= 0 || IsXLogSourceFromNvwal(readSource)) &&
 		!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
 	{
 		/*
@@ -12029,7 +12823,8 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			}
 		}
 
-		close(readFile);
+		if (readFile >= 0)
+			close(readFile);
 		readFile = -1;
 		readSource = XLOG_FROM_ANY;
 	}
@@ -12038,9 +12833,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
-		(readSource == XLOG_FROM_STREAM &&
-		 flushedUpto < targetPagePtr + reqLen))
+	if (NeedsForMoreXLog(targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
@@ -12061,7 +12854,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(readFile != -1 || IsXLogSourceFromNvwal(readSource));
 
 	/*
 	 * If the current segment is being streamed from master, calculate how
@@ -12069,7 +12862,7 @@ retry:
 	 * requested record has been received, but this is for the benefit of
 	 * future calls, to allow quick exit at the top of this function.
 	 */
-	if (readSource == XLOG_FROM_STREAM)
+	if (IsXLogSourceFromStream(readSource))
 	{
 		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
@@ -12080,41 +12873,59 @@ retry:
 	else
 		readLen = XLOG_BLCKSZ;
 
-	/* Read the requested page */
 	readOff = targetPageOff;
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
-	if (r != XLOG_BLCKSZ)
+	if (IsXLogSourceFromNvwal(readSource))
 	{
-		char		fname[MAXFNAMELEN];
-		int			save_errno = errno;
+		Size		offset = (Size) (targetPagePtr % NvwalSize);
+		char	   *readpos = XLogCtl->pages + offset;
+
+		Assert(offset % XLOG_BLCKSZ == 0);
 
+		/* Load the requested page from non-volatile WAL buffer */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		memcpy(readBuf, readpos, readLen);
 		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		if (r < 0)
+
+		/* There are not any other clues of TLI... */
+		xlogreader->seg.ws_tli = ((XLogPageHeader) readBuf)->xlp_tli;
+	}
+	else
+	{
+		/* Read the requested page from file */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+		if (r != XLOG_BLCKSZ)
 		{
-			errno = save_errno;
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
+			char		fname[MAXFNAMELEN];
+			int			save_errno = errno;
+
+			pgstat_report_wait_end();
+			XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+			if (r < 0)
+			{
+				errno = save_errno;
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode_for_file_access(),
+						 errmsg("could not read from log segment %s, offset %u: %m",
+								fname, readOff)));
+			}
+			else
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+								fname, readOff, r, (Size) XLOG_BLCKSZ)));
+			goto next_record_is_invalid;
 		}
-		else
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							fname, readOff, r, (Size) XLOG_BLCKSZ)));
-		goto next_record_is_invalid;
+		pgstat_report_wait_end();
+
+		xlogreader->seg.ws_tli = curFileTLI;
 	}
-	pgstat_report_wait_end();
 
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	xlogreader->seg.ws_tli = curFileTLI;
-
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
 	 * it's not valid. This may seem unnecessary, because XLogReadRecord()
@@ -12148,6 +12959,17 @@ retry:
 		goto next_record_is_invalid;
 	}
 
+	/*
+	 * Updating curFileTLI on each page verified if non-volatile WAL buffer
+	 * is used because there is no TimeLineID information in NVWAL's filename.
+	 */
+	if (IsXLogSourceFromNvwal(readSource) &&
+		curFileTLI != xlogreader->latestPageTLI)
+	{
+		curFileTLI = xlogreader->latestPageTLI;
+		elog(DEBUG1, "curFileTLI: %u", curFileTLI);
+	}
+
 	return readLen;
 
 next_record_is_invalid:
@@ -12229,7 +13051,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	if (!InArchiveRecovery)
 		currentSource = XLOG_FROM_PG_WAL;
 	else if (currentSource == XLOG_FROM_ANY ||
-			 (!StandbyMode && currentSource == XLOG_FROM_STREAM))
+			 (!StandbyMode && IsXLogSourceFromStream(currentSource)))
 	{
 		lastSourceFailed = false;
 		currentSource = XLOG_FROM_ARCHIVE;
@@ -12252,6 +13074,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			{
 				case XLOG_FROM_ARCHIVE:
 				case XLOG_FROM_PG_WAL:
+				case XLOG_FROM_NVWAL:
 
 					/*
 					 * Check to see if the trigger file exists. Note that we
@@ -12265,6 +13088,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						return false;
 					}
 
+					/* Try NVWAL if available */
+					if (NvwalAvail && currentSource != XLOG_FROM_NVWAL)
+					{
+						currentSource = XLOG_FROM_NVWAL;
+						break;
+					}
+
 					/*
 					 * Not in standby mode, and we've now tried the archive
 					 * and pg_wal.
@@ -12276,11 +13106,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * Move to XLOG_FROM_STREAM state, and set to start a
 					 * walreceiver if necessary.
 					 */
-					currentSource = XLOG_FROM_STREAM;
+					if (currentSource == XLOG_FROM_NVWAL)
+						currentSource = XLOG_FROM_STREAM_NVWAL;
+					else
+						currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
 					break;
 
 				case XLOG_FROM_STREAM:
+				case XLOG_FROM_STREAM_NVWAL:
 
 					/*
 					 * Failure while streaming. Most likely, we got here
@@ -12386,6 +13220,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		{
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
+			case XLOG_FROM_NVWAL:
 
 				/*
 				 * WAL receiver must not be running when reading WAL from
@@ -12403,6 +13238,59 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* Try to load from NVWAL */
+				if (currentSource == XLOG_FROM_NVWAL)
+				{
+					XLogRecPtr		discardedUpTo;
+
+					Assert(NvwalAvail);
+
+					/*
+					 * Check if the target page exists on NVWAL.  Note that
+					 * RecPtr points to the end of the target chunk.
+					 *
+					 * TODO need ControlFileLock?
+					 */
+					discardedUpTo = ControlFile->discardedUpTo;
+					if (discardedUpTo != InvalidXLogRecPtr &&
+						discardedUpTo < RecPtr &&
+						RecPtr <= discardedUpTo + NvwalSize)
+					{
+						/* Report recovery progress in PS display */
+						set_ps_display("recovering NVWAL");
+						elog(DEBUG1, "recovering NVWAL");
+
+						/* Track source of data and receipt time */
+						readSource = XLOG_FROM_NVWAL;
+						XLogReceiptSource = XLOG_FROM_NVWAL;
+						XLogReceiptTime = GetCurrentTimestamp();
+
+						/*
+						 * Construct expectedTLEs.  This is necessary to
+						 * recover only from NVWAL because its filename does
+						 * not have any TLI information.
+						 */
+						if (!expectedTLEs)
+						{
+							TimeLineHistoryEntry	   *entry;
+
+							entry = palloc(sizeof(TimeLineHistoryEntry));
+							entry->tli = recoveryTargetTLI;
+							entry->begin = entry->end = InvalidXLogRecPtr;
+
+							expectedTLEs = list_make1(entry);
+							elog(DEBUG1, "expectedTLEs: [%u]",
+								 (uint32) recoveryTargetTLI);
+						}
+
+						return true;
+					}
+
+					/* Target page does not exist on NVWAL */
+					lastSourceFailed = true;
+					break;
+				}
+
 				/*
 				 * Try to restore the file from archive, or read an existing
 				 * file from pg_wal.
@@ -12420,6 +13308,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				break;
 
 			case XLOG_FROM_STREAM:
+			case XLOG_FROM_STREAM_NVWAL:
 				{
 					bool		havedata;
 
@@ -12544,21 +13433,34 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						 * info is set correctly and XLogReceiptTime isn't
 						 * changed.
 						 */
-						if (readFile < 0)
+						if (currentSource == XLOG_FROM_STREAM_NVWAL)
 						{
 							if (!expectedTLEs)
 								expectedTLEs = readTimeLineHistory(receiveTLI);
-							readFile = XLogFileRead(readSegNo, PANIC,
-													receiveTLI,
-													XLOG_FROM_STREAM, false);
-							Assert(readFile >= 0);
+
+							/* TODO is it ok to return, not to break switch? */
+							readSource = XLOG_FROM_STREAM_NVWAL;
+							XLogReceiptSource = XLOG_FROM_STREAM_NVWAL;
+							return true;
 						}
 						else
 						{
-							/* just make sure source info is correct... */
-							readSource = XLOG_FROM_STREAM;
-							XLogReceiptSource = XLOG_FROM_STREAM;
-							return true;
+							if (readFile < 0)
+							{
+								if (!expectedTLEs)
+									expectedTLEs = readTimeLineHistory(receiveTLI);
+								readFile = XLogFileRead(readSegNo, PANIC,
+														receiveTLI,
+														XLOG_FROM_STREAM, false);
+								Assert(readFile >= 0);
+							}
+							else
+							{
+								/* just make sure source info is correct... */
+								readSource = XLOG_FROM_STREAM;
+								XLogReceiptSource = XLOG_FROM_STREAM;
+								return true;
+							}
 						}
 						break;
 					}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cb76be4f46..77f629fda2 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1066,11 +1066,24 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	XLogRecPtr	recptr_nvwal = 0;
+	Size		nbytes_nvwal = 0;
+#endif
 
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
 
+#ifndef FRONTEND
+	/* Try to load records directly from NVWAL if used */
+	if (IsNvwalAvail())
+	{
+		nbytes_nvwal = GetLoadableSizeFromNvwal(startptr, count, &recptr_nvwal);
+		nbytes = count - nbytes_nvwal;
+	}
+#endif
+
 	while (nbytes > 0)
 	{
 		uint32		startoff;
@@ -1138,6 +1151,17 @@ WALRead(XLogReaderState *state,
 		p += readbytes;
 	}
 
+#ifndef FRONTEND
+	if (IsNvwalAvail())
+	{
+		if (!CopyXLogRecordsFromNVWAL(p, nbytes_nvwal, recptr_nvwal))
+		{
+			/* TODO graceful error handling */
+			elog(PANIC, "some records on NVWAL had been discarded");
+		}
+	}
+#endif
+
 	return true;
 }
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index e73639df74..4c594e915f 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -272,6 +272,9 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("discarded Up To:                      %X/%X\n"),
+		   (uint32) (ControlFile->discardedUpTo >> 32),
+		   (uint32) ControlFile->discardedUpTo);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 0a05e79524..75433a6dc0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -351,6 +351,14 @@ extern void XLogRequestWalReceiverReply(void);
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
+extern bool IsNvwalAvail(void);
+extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
+										   Size count,
+										   XLogRecPtr *nvwalptr);
+extern bool CopyXLogRecordsFromNVWAL(char *buf,
+									 Size count,
+									 XLogRecPtr startptr);
+
 /*
  * Routines to start, stop, and get status of a base backup.
  */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index de5670e538..fe71992a69 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
 
 
 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION	1300
+#define PG_CONTROL_VERSION	1301
 
 /* Nonce key length, see below */
 #define MOCK_AUTH_NONCE_LEN		32
@@ -132,6 +132,21 @@ typedef struct ControlFileData
 
 	XLogRecPtr	unloggedLSN;	/* current fake LSN value, for unlogged rels */
 
+	/*
+	 * Used for non-volatile WAL buffer (NVWAL).
+	 *
+	 * discardedUpTo is updated to the oldest LSN in the NVWAL when either a
+	 * checkpoint or a restartpoint is completed successfully, or whole the
+	 * NVWAL is filled with WAL records and a new record is being inserted.
+	 * This field tells that the NVWAL contains WAL records in the range of
+	 * [discardedUpTo, discardedUpTo+S), where S is the size of the NVWAL.
+	 * Note that the WAL records whose LSN are less than discardedUpTo would
+	 * remain in WAL segment files and be needed for recovery.
+	 *
+	 * It is set to zero when NVWAL is not used.
+	 */
+	XLogRecPtr	discardedUpTo;
+
 	/*
 	 * These two values determine the minimum point we must recover up to
 	 * before starting up:
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index d3acb98d04..bbd47e1663 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -142,14 +142,17 @@ HINT:  No function matches the given name and argument types. You might need to
 select setting as segsize
 from pg_settings where name = 'wal_segment_size'
 \gset
-select count(*) > 0 as ok from pg_ls_waldir();
+select setting as nvwal_path
+from pg_settings where name = 'nvwal_path'
+\gset
+select count(*) > 0 or :'nvwal_path' <> '' as ok from pg_ls_waldir();
  ok 
 ----
  t
 (1 row)
 
 -- Test ProjectSet as well as FunctionScan
-select count(*) > 0 as ok from (select pg_ls_waldir()) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select pg_ls_waldir()) ss;
  ok 
 ----
  t
@@ -161,14 +164,15 @@ select * from pg_ls_waldir() limit 0;
 ------+------+--------------
 (0 rows)
 
-select count(*) > 0 as ok from (select * from pg_ls_waldir() limit 1) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select * from pg_ls_waldir() limit 1) ss;
  ok 
 ----
  t
 (1 row)
 
-select (w).size = :segsize as ok
-from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from
+  (select * from pg_ls_waldir() w
+   where length((w).name) = 24 and (w).size = :segsize limit 1) ss;
  ok 
 ----
  t
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 094e8f8296..09c326775d 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -39,15 +39,19 @@ SELECT num_nulls();
 select setting as segsize
 from pg_settings where name = 'wal_segment_size'
 \gset
+select setting as nvwal_path
+from pg_settings where name = 'nvwal_path'
+\gset
 
-select count(*) > 0 as ok from pg_ls_waldir();
+select count(*) > 0 or :'nvwal_path' <> '' as ok from pg_ls_waldir();
 -- Test ProjectSet as well as FunctionScan
-select count(*) > 0 as ok from (select pg_ls_waldir()) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select pg_ls_waldir()) ss;
 -- Test not-run-to-completion cases.
 select * from pg_ls_waldir() limit 0;
-select count(*) > 0 as ok from (select * from pg_ls_waldir() limit 1) ss;
-select (w).size = :segsize as ok
-from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select * from pg_ls_waldir() limit 1) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from
+  (select * from pg_ls_waldir() w
+   where length((w).name) = 24 and (w).size = :segsize limit 1) ss;
 
 select count(*) >= 0 as ok from pg_ls_archive_statusdir();
 
-- 
2.17.1

v3-0003-walreceiver-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0003-walreceiver-supports-non-volatile-WAL-buffer.patchDownload
From e3a4da834a79770c63c26c9859dc179911a37540 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:58 +0900
Subject: [PATCH v3 3/5] walreceiver supports non-volatile WAL buffer

Now walreceiver stores received records directly to non-volatile
WAL buffer if applicable.
---
 src/backend/access/transam/xlog.c     | 31 +++++++++++++++-
 src/backend/replication/walreceiver.c | 53 ++++++++++++++++++++++++++-
 src/include/access/xlog.h             |  4 ++
 3 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 45e05b9498..2a022be36a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -925,6 +925,8 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
+static bool CopyXLogRecordsOnNVWAL(char *buf, Size count, XLogRecPtr startptr,
+								   bool store);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
@@ -12650,6 +12652,21 @@ GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
  */
 bool
 CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+	return CopyXLogRecordsOnNVWAL(buf, count, startptr, false);
+}
+
+/*
+ * Called by walreceiver.
+ */
+bool
+CopyXLogRecordsToNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+	return CopyXLogRecordsOnNVWAL(buf, count, startptr, true);
+}
+
+static bool
+CopyXLogRecordsOnNVWAL(char *buf, Size count, XLogRecPtr startptr, bool store)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -12699,7 +12716,13 @@ CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
 		max_copy = NvwalSize - off;
 		copybytes = Min(nbytes, max_copy);
 
-		memcpy(p, q, copybytes);
+		if (store)
+		{
+			memcpy(q, p, copybytes);
+			nv_flush(q, copybytes);
+		}
+		else
+			memcpy(p, q, copybytes);
 
 		/* Update state for copy */
 		recptr += copybytes;
@@ -12711,6 +12734,12 @@ CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
 	return true;
 }
 
+void
+SyncNVWAL(void)
+{
+	nv_drain();
+}
+
 static bool
 IsXLogSourceFromStream(XLogSource source)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d1ad75da87..20922ed230 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -130,6 +130,7 @@ static void WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *start
 static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
+static void XLogWalRcvStore(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
@@ -856,7 +857,10 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 
 				buf += hdrlen;
 				len -= hdrlen;
-				XLogWalRcvWrite(buf, len, dataStart);
+				if (IsNvwalAvail())
+					XLogWalRcvStore(buf, len, dataStart);
+				else
+					XLogWalRcvWrite(buf, len, dataStart);
 				break;
 			}
 		case 'k':				/* Keepalive */
@@ -991,6 +995,42 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 }
 
+/*
+ * Like XLogWalRcvWrite, but store to non-volatile WAL buffer.
+ */
+static void
+XLogWalRcvStore(char *buf, Size nbytes, XLogRecPtr recptr)
+{
+	Assert(IsNvwalAvail());
+
+	CopyXLogRecordsToNVWAL(buf, nbytes, recptr);
+
+	/*
+	 * Also write out to file if we have to archive segments.
+	 *
+	 * We could do this segment by segment but we reuse existing method to
+	 * do it record by record because the former gives us more complexity
+	 * (locking WalBufMappingLock, getting the address of the segment on
+	 * non-volatile WAL buffer, etc).
+	 */
+	if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+		XLogWalRcvWrite(buf, nbytes, recptr);
+	else
+	{
+		/*
+		 * Update status as like XLogWalRcvWrite does.
+		 */
+
+		/* Update process-local status */
+		XLByteToSeg(recptr + nbytes, recvSegNo, wal_segment_size);
+		recvFileTLI = ThisTimeLineID;
+		LogstreamResult.Write = recptr + nbytes;
+
+		/* Update shared-memory status */
+		pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+	}
+}
+
 /*
  * Flush the log to disk.
  *
@@ -1004,7 +1044,16 @@ XLogWalRcvFlush(bool dying)
 	{
 		WalRcvData *walrcv = WalRcv;
 
-		issue_xlog_fsync(recvFile, recvSegNo);
+		/*
+		 * We should call both SyncNVWAL and issue_xlog_fsync if we use NVWAL
+		 * and WAL archive.  So we have the following two if-statements, not
+		 * one if-else-statement.
+		 */
+		if (IsNvwalAvail())
+			SyncNVWAL();
+
+		if (recvFile >= 0)
+			issue_xlog_fsync(recvFile, recvSegNo);
 
 		LogstreamResult.Flush = LogstreamResult.Write;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75433a6dc0..e6ca151271 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -358,6 +358,10 @@ extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
 extern bool CopyXLogRecordsFromNVWAL(char *buf,
 									 Size count,
 									 XLogRecPtr startptr);
+extern bool CopyXLogRecordsToNVWAL(char *buf,
+								   Size count,
+								   XLogRecPtr startptr);
+extern void SyncNVWAL(void);
 
 /*
  * Routines to start, stop, and get status of a base backup.
-- 
2.17.1

v3-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchDownload
From c9736171b0480c57ce8f457a3ce1a8ee29ce02f6 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:59 +0900
Subject: [PATCH v3 4/5] pg_basebackup supports non-volatile WAL buffer

Now pg_basebackup copies received WAL segments onto non-volatile
WAL buffer if you run it with "nvwal" mode (-Fn).

You should specify a new NVWAL path with --nvwal-path option.
The path will be written to postgresql.auto.conf or recovery.conf.
The size of the new NVWAL is same as the master's one.
---
 src/bin/pg_basebackup/pg_basebackup.c | 335 +++++++++++++++++++++++++-
 src/bin/pg_basebackup/streamutil.c    |  69 ++++++
 src/bin/pg_basebackup/streamutil.h    |   3 +
 src/bin/pg_rewind/pg_rewind.c         |   4 +-
 src/fe_utils/recovery_gen.c           |   9 +-
 src/include/fe_utils/recovery_gen.h   |   3 +-
 6 files changed, 407 insertions(+), 16 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 4f29671d0c..e56fae7f47 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -25,6 +25,9 @@
 #ifdef HAVE_LIBZ
 #include <zlib.h>
 #endif
+#ifdef USE_NVWAL
+#include <libpmem.h>
+#endif
 
 #include "access/xlog_internal.h"
 #include "common/file_perm.h"
@@ -127,7 +130,8 @@ typedef enum
 static char *basedir = NULL;
 static TablespaceList tablespace_dirs = {NULL, NULL};
 static char *xlog_dir = NULL;
-static char format = 'p';		/* p(lain)/t(ar) */
+static char format = 'p';			/* p(lain)/t(ar); 'p' even if 'nvwal' given */
+static bool format_nvwal = false;	/* true if 'nvwal' given */
 static char *label = "pg_basebackup base backup";
 static bool noclean = false;
 static bool checksum_failure = false;
@@ -150,14 +154,24 @@ static bool verify_checksums = true;
 static bool manifest = true;
 static bool manifest_force_encode = false;
 static char *manifest_checksums = NULL;
+static char *nvwal_path = NULL;
+#ifdef USE_NVWAL
+static size_t nvwal_size = 0;
+static char *nvwal_pages = NULL;
+static size_t nvwal_mapped_len = 0;
+#endif
 
 static bool success = false;
+static bool xlogdir_is_pg_xlog = false;
 static bool made_new_pgdata = false;
 static bool found_existing_pgdata = false;
 static bool made_new_xlogdir = false;
 static bool found_existing_xlogdir = false;
 static bool made_tablespace_dirs = false;
 static bool found_tablespace_dirs = false;
+#ifdef USE_NVWAL
+static bool made_new_nvwal = false;
+#endif
 
 /* Progress counters */
 static uint64 totalsize_kb;
@@ -381,7 +395,7 @@ usage(void)
 	printf(_("  %s [OPTION]...\n"), progname);
 	printf(_("\nOptions controlling the output:\n"));
 	printf(_("  -D, --pgdata=DIRECTORY receive base backup into directory\n"));
-	printf(_("  -F, --format=p|t       output format (plain (default), tar)\n"));
+	printf(_("  -F, --format=p|t|n     output format (plain (default), tar, nvwal)\n"));
 	printf(_("  -r, --max-rate=RATE    maximum transfer rate to transfer data directory\n"
 			 "                         (in kB/s, or use suffix \"k\" or \"M\")\n"));
 	printf(_("  -R, --write-recovery-conf\n"
@@ -389,6 +403,7 @@ usage(void)
 	printf(_("  -T, --tablespace-mapping=OLDDIR=NEWDIR\n"
 			 "                         relocate tablespace in OLDDIR to NEWDIR\n"));
 	printf(_("      --waldir=WALDIR    location for the write-ahead log directory\n"));
+	printf(_("      --nvwal-path=NVWAL location for the NVWAL file\n"));
 	printf(_("  -X, --wal-method=none|fetch|stream\n"
 			 "                         include required WAL files with specified method\n"));
 	printf(_("  -z, --gzip             compress tar output\n"));
@@ -629,9 +644,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 
 	/* In post-10 cluster, pg_xlog has been renamed to pg_wal */
 	snprintf(param->xlog, sizeof(param->xlog), "%s/%s",
-			 basedir,
-			 PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
-			 "pg_xlog" : "pg_wal");
+			 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
 
 	/* Temporary replication slots are only supported in 10 and newer */
 	if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_TEMP_SLOTS)
@@ -668,9 +681,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 		 * tar file may arrive later.
 		 */
 		snprintf(statusdir, sizeof(statusdir), "%s/%s/archive_status",
-				 basedir,
-				 PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
-				 "pg_xlog" : "pg_wal");
+				 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
 
 		if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
 		{
@@ -1787,6 +1798,135 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
 	appendPQExpBuffer(buf, copybuf, r);
 }
 
+#ifdef USE_NVWAL
+static void
+cleanup_nvwal_atexit(void)
+{
+	if (success || in_log_streamer)
+		return;
+
+	if (nvwal_pages != NULL)
+	{
+		pg_log_info("unmapping NVWAL");
+		if (pmem_unmap(nvwal_pages, nvwal_mapped_len) < 0)
+		{
+			pg_log_error("could not unmap NVWAL: %m");
+			return;
+		}
+	}
+
+	if (nvwal_path != NULL && made_new_nvwal)
+	{
+		pg_log_info("removing NVWAL file \"%s\"", nvwal_path);
+		if (unlink(nvwal_path) < 0)
+		{
+			pg_log_error("could not remove NVWAL file \"%s\": %m", nvwal_path);
+			return;
+		}
+	}
+}
+
+static int
+filter_walseg(const struct dirent *d)
+{
+	char			fullpath[MAXPGPATH];
+	struct stat		statbuf;
+
+	if (!IsXLogFileName(d->d_name))
+		return 0;
+
+	snprintf(fullpath, sizeof(fullpath), "%s/%s/%s",
+			 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal", d->d_name);
+
+	if (stat(fullpath, &statbuf) < 0)
+		return 0;
+
+	if (!S_ISREG(statbuf.st_mode))
+		return 0;
+
+	if (statbuf.st_size != WalSegSz)
+		return 0;
+
+	return 1;
+}
+
+static int
+compare_walseg(const struct dirent **a, const struct dirent **b)
+{
+	return strcmp((*a)->d_name, (*b)->d_name);
+}
+
+static void
+free_namelist(struct dirent **namelist, int nr)
+{
+	for (int i = 0; i < nr; ++i)
+		free(namelist[i]);
+
+	free(namelist);
+}
+
+static bool
+copy_walseg_onto_nvwal(const char *segname)
+{
+	char			fullpath[MAXPGPATH];
+	int				fd;
+	size_t			off;
+	struct stat		statbuf;
+	TimeLineID		tli;
+	XLogSegNo		segno;
+
+	snprintf(fullpath, sizeof(fullpath), "%s/%s/%s",
+			 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal", segname);
+
+	fd = open(fullpath, O_RDONLY);
+	if (fd < 0)
+	{
+		pg_log_error("could not open xlog segment \"%s\": %m", fullpath);
+		return false;
+	}
+
+	if (fstat(fd, &statbuf) < 0)
+	{
+		pg_log_error("could not fstat xlog segment \"%s\": %m", fullpath);
+		goto close_on_error;
+	}
+
+	if (!S_ISREG(statbuf.st_mode))
+	{
+		pg_log_error("xlog segment \"%s\" is not a regular file", fullpath);
+		goto close_on_error;
+	}
+
+	if (statbuf.st_size != WalSegSz)
+	{
+		pg_log_error("invalid size of xlog segment \"%s\"; expected %d, actual %zd",
+					 fullpath, WalSegSz, (ssize_t) statbuf.st_size);
+		goto close_on_error;
+	}
+
+	XLogFromFileName(segname, &tli, &segno, WalSegSz);
+	off = ((size_t) segno * WalSegSz) % nvwal_size;
+
+	if (read(fd, &nvwal_pages[off], WalSegSz) < WalSegSz)
+	{
+		pg_log_error("could not fully read xlog segment \"%s\": %m", fullpath);
+		goto close_on_error;
+	}
+
+	if (close(fd) < 0)
+	{
+		pg_log_error("could not close xlog segment \"%s\": %m", fullpath);
+		return false;
+	}
+
+	return true;
+
+close_on_error:
+	(void) close(fd);
+	return false;
+}
+#endif
+
 static void
 BaseBackup(void)
 {
@@ -1845,7 +1985,8 @@ BaseBackup(void)
 	 * Build contents of configuration file if requested
 	 */
 	if (writerecoveryconf)
-		recoveryconfcontents = GenerateRecoveryConfig(conn, replication_slot);
+		recoveryconfcontents = GenerateRecoveryConfig(conn, replication_slot,
+													  nvwal_path);
 
 	/*
 	 * Run IDENTIFY_SYSTEM so we can get the timeline
@@ -2214,6 +2355,69 @@ BaseBackup(void)
 			exit(1);
 	}
 
+#ifdef USE_NVWAL
+	/* Copy xlog segments into NVWAL when nvwal mode */
+	if (format_nvwal)
+	{
+		char	xldr_path[MAXPGPATH];
+		int		nr_segs;
+		struct dirent **namelist;
+
+		/* clear NVWAL before copying xlog segments */
+		pmem_memset_persist(nvwal_pages, 0, nvwal_size);
+
+		snprintf(xldr_path, sizeof(xldr_path), "%s/%s",
+				 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
+
+		/*
+		 * Sort xlog segments in ascending order, filtering out non-segment
+		 * files and directories.
+		 */
+		nr_segs = scandir(xldr_path, &namelist, filter_walseg, compare_walseg);
+		if (nr_segs < 0)
+		{
+			pg_log_error("could not scan xlog directory \"%s\": %m", xldr_path);
+			exit(1);
+		}
+
+		/* Copy xlog segments onto NVWAL */
+		for (int i = 0; i < nr_segs; ++i)
+		{
+			if (!copy_walseg_onto_nvwal(namelist[i]->d_name))
+			{
+				free_namelist(namelist, nr_segs);
+				exit(1);
+			}
+		}
+
+		/* Copy compelete; now remove all the xlog segments */
+		for (int i = 0; i < nr_segs; ++i)
+		{
+			char		fullpath[MAXPGPATH];
+
+			snprintf(fullpath, sizeof(fullpath), "%s/%s",
+					 xldr_path, namelist[i]->d_name);
+
+			if (unlink(fullpath) < 0)
+			{
+				pg_log_error("could not remove xlog segment \"%s\": %m", fullpath);
+				free_namelist(namelist, nr_segs);
+				exit(1);
+			}
+		}
+
+		free_namelist(namelist, nr_segs);
+
+		if (pmem_unmap(nvwal_pages, nvwal_mapped_len) < 0)
+		{
+			pg_log_error("could not unmap NVWAL: %m");
+			exit(1);
+		}
+		nvwal_pages = NULL;
+		nvwal_mapped_len = 0;
+	}
+#endif
+
 	if (verbose)
 		pg_log_info("base backup completed");
 }
@@ -2255,6 +2459,7 @@ main(int argc, char **argv)
 		{"no-manifest", no_argument, NULL, 5},
 		{"manifest-force-encode", no_argument, NULL, 6},
 		{"manifest-checksums", required_argument, NULL, 7},
+		{"nvwal-path", required_argument, NULL, 8},
 		{NULL, 0, NULL, 0}
 	};
 	int			c;
@@ -2295,9 +2500,27 @@ main(int argc, char **argv)
 				break;
 			case 'F':
 				if (strcmp(optarg, "p") == 0 || strcmp(optarg, "plain") == 0)
+				{
+					/* See the comment for "nvwal" below */
 					format = 'p';
+					format_nvwal = false;
+				}
 				else if (strcmp(optarg, "t") == 0 || strcmp(optarg, "tar") == 0)
+				{
+					/* See the comment for "nvwal" below */
 					format = 't';
+					format_nvwal = false;
+				}
+				else if (strcmp(optarg, "n") == 0 || strcmp(optarg, "nvwal") == 0)
+				{
+					/*
+					 * If "nvwal" mode given, we set two variables as follows
+					 * because it is almost same as "plain" mode, except NVWAL
+					 * handling.
+					 */
+					format = 'p';
+					format_nvwal = true;
+				}
 				else
 				{
 					pg_log_error("invalid output format \"%s\", must be \"plain\" or \"tar\"",
@@ -2352,6 +2575,9 @@ main(int argc, char **argv)
 			case 1:
 				xlog_dir = pg_strdup(optarg);
 				break;
+			case 8:
+				nvwal_path = pg_strdup(optarg);
+				break;
 			case 'l':
 				label = pg_strdup(optarg);
 				break;
@@ -2533,7 +2759,7 @@ main(int argc, char **argv)
 	{
 		if (format != 'p')
 		{
-			pg_log_error("WAL directory location can only be specified in plain mode");
+			pg_log_error("WAL directory location can only be specified in plain or nvwal mode");
 			fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 					progname);
 			exit(1);
@@ -2550,6 +2776,44 @@ main(int argc, char **argv)
 		}
 	}
 
+#ifdef USE_NVWAL
+	if (format_nvwal)
+	{
+		if (nvwal_path == NULL)
+		{
+			pg_log_error("NVWAL file location must be given in nvwal mode");
+			fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+					progname);
+			exit(1);
+		}
+
+		/* clean up NVWAL file name and check if it is absolute */
+		canonicalize_path(nvwal_path);
+		if (!is_absolute_path(nvwal_path))
+		{
+			pg_log_error("NVWAL file location must be an absolute path");
+			fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+					progname);
+			exit(1);
+		}
+
+		/* We do not map NVWAL file here because we do not know its size yet */
+	}
+	else if (nvwal_path != NULL)
+	{
+		pg_log_error("NVWAL file location can only be specified in plain or nvwal mode");
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+#else
+	if (format_nvwal || nvwal_path != NULL)
+	{
+		pg_log_error("this build does not support nvwal mode");
+		exit(1);
+	}
+#endif /* USE_NVWAL */
+
 #ifndef HAVE_LIBZ
 	if (compresslevel != 0)
 	{
@@ -2594,6 +2858,9 @@ main(int argc, char **argv)
 	}
 	atexit(disconnect_atexit);
 
+	/* Remember the predicate for use after disconnection */
+	xlogdir_is_pg_xlog = (PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL);
+
 	/*
 	 * Set umask so that directories/files are created with the same
 	 * permissions as directories/files in the source data directory.
@@ -2620,6 +2887,16 @@ main(int argc, char **argv)
 	if (!RetrieveWalSegSize(conn))
 		exit(1);
 
+#ifdef USE_NVWAL
+	/* determine remote server's NVWAL size */
+	if (format_nvwal)
+	{
+		nvwal_size = RetrieveNvwalSize(conn);
+		if (nvwal_size == 0)
+			exit(1);
+	}
+#endif
+
 	/* Create pg_wal symlink, if required */
 	if (xlog_dir)
 	{
@@ -2632,8 +2909,7 @@ main(int argc, char **argv)
 		 * renamed to pg_wal in post-10 clusters.
 		 */
 		linkloc = psprintf("%s/%s", basedir,
-						   PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
-						   "pg_xlog" : "pg_wal");
+						   xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
 
 #ifdef HAVE_SYMLINK
 		if (symlink(xlog_dir, linkloc) != 0)
@@ -2648,6 +2924,41 @@ main(int argc, char **argv)
 		free(linkloc);
 	}
 
+#ifdef USE_NVWAL
+	/* Create and map NVWAL file if required */
+	if (format_nvwal)
+	{
+		int		is_pmem = 0;
+
+		nvwal_pages = pmem_map_file(nvwal_path, nvwal_size,
+									PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+									pg_file_create_mode,
+									&nvwal_mapped_len, &is_pmem);
+		if (nvwal_pages == NULL)
+		{
+			pg_log_error("could not map a new NVWAL file \"%s\": %m",
+						 nvwal_path);
+			exit(1);
+		}
+
+		made_new_nvwal = true;
+		atexit(cleanup_nvwal_atexit);
+
+		if (!is_pmem)
+		{
+			pg_log_error("NVWAL file \"%s\" is not on PMEM", nvwal_path);
+			exit(1);
+		}
+
+		if (nvwal_size != nvwal_mapped_len)
+		{
+			pg_log_error("invalid size of NVWAL file \"%s\"; expected %zu, actual %zu",
+						 nvwal_path, nvwal_size, nvwal_mapped_len);
+			exit(1);
+		}
+	}
+#endif
+
 	BaseBackup();
 
 	success = true;
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 410116492e..af2bb21e4c 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -397,6 +397,75 @@ RetrieveDataDirCreatePerm(PGconn *conn)
 	return true;
 }
 
+#ifdef USE_NVWAL
+/*
+ * Returns nvwal_size in bytes if available, 0 otherwise.
+ * Note that it is caller's responsibility to check if the returned
+ * nvwal_size is really valid, that is, multiple of WAL segment size.
+ */
+size_t
+RetrieveNvwalSize(PGconn *conn)
+{
+	PGresult   *res;
+	char		unit[3];
+	int			val;
+	size_t		nvwal_size;
+
+	/* check connection existence */
+	Assert(conn != NULL);
+
+	/* fail if we do not have SHOW command */
+	if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_SHOW_CMD)
+	{
+		pg_log_error("SHOW command is not supported for retrieving nvwal_size");
+		return 0;
+	}
+
+	res = PQexec(conn, "SHOW nvwal_size");
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("could not send replication command \"%s\": %s",
+					 "SHOW nvwal_size", PQerrorMessage(conn));
+
+		PQclear(res);
+		return 0;
+	}
+	if (PQntuples(res) != 1 || PQnfields(res) < 1)
+	{
+		pg_log_error("could not fetch NVWAL size: got %d rows and %d fields, expected %d rows and %d or more fields",
+					 PQntuples(res), PQnfields(res), 1, 1);
+
+		PQclear(res);
+		return 0;
+	}
+
+	/* fetch value and unit from the result */
+	if (sscanf(PQgetvalue(res, 0, 0), "%d%s", &val, unit) != 2)
+	{
+		pg_log_error("NVWAL size could not be parsed");
+		PQclear(res);
+		return 0;
+	}
+
+	PQclear(res);
+
+	/* convert to bytes */
+	if (strcmp(unit, "MB") == 0)
+		nvwal_size = ((size_t) val) << 20;
+	else if (strcmp(unit, "GB") == 0)
+		nvwal_size = ((size_t) val) << 30;
+	else if (strcmp(unit, "TB") == 0)
+		nvwal_size = ((size_t) val) << 40;
+	else
+	{
+		pg_log_error("unsupported NVWAL unit");
+		return 0;
+	}
+
+	return nvwal_size;
+}
+#endif
+
 /*
  * Run IDENTIFY_SYSTEM through a given connection and give back to caller
  * some result information if requested:
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index 57448656e3..b4c2ab1a74 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -41,6 +41,9 @@ extern bool RunIdentifySystem(PGconn *conn, char **sysid,
 							  XLogRecPtr *startpos,
 							  char **db_name);
 extern bool RetrieveWalSegSize(PGconn *conn);
+#ifdef USE_NVWAL
+extern size_t RetrieveNvwalSize(PGconn *conn);
+#endif
 extern TimestampTz feGetCurrentTimestamp(void);
 extern void feTimestampDifference(TimestampTz start_time, TimestampTz stop_time,
 								  long *secs, int *microsecs);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 0015d3b461..578b37b588 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -360,7 +360,7 @@ main(int argc, char **argv)
 		pg_log_info("no rewind required");
 		if (writerecoveryconf && !dry_run)
 			WriteRecoveryConfig(conn, datadir_target,
-								GenerateRecoveryConfig(conn, NULL));
+								GenerateRecoveryConfig(conn, NULL, NULL));
 		exit(0);
 	}
 
@@ -460,7 +460,7 @@ main(int argc, char **argv)
 
 	if (writerecoveryconf && !dry_run)
 		WriteRecoveryConfig(conn, datadir_target,
-							GenerateRecoveryConfig(conn, NULL));
+							GenerateRecoveryConfig(conn, NULL, NULL));
 
 	pg_log_info("Done!");
 
diff --git a/src/fe_utils/recovery_gen.c b/src/fe_utils/recovery_gen.c
index 46ca20e20b..1e08ec3fa8 100644
--- a/src/fe_utils/recovery_gen.c
+++ b/src/fe_utils/recovery_gen.c
@@ -20,7 +20,7 @@ static char *escape_quotes(const char *src);
  * return it.
  */
 PQExpBuffer
-GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
+GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot, char *nvwal_path)
 {
 	PQconninfoOption *connOptions;
 	PQExpBufferData conninfo_buf;
@@ -95,6 +95,13 @@ GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
 						  replication_slot);
 	}
 
+	if (nvwal_path)
+	{
+		escaped = escape_quotes(nvwal_path);
+		appendPQExpBuffer(contents, "nvwal_path = '%s'\n", escaped);
+		free(escaped);
+	}
+
 	if (PQExpBufferBroken(contents))
 	{
 		pg_log_error("out of memory");
diff --git a/src/include/fe_utils/recovery_gen.h b/src/include/fe_utils/recovery_gen.h
index c8655cd294..061c59125b 100644
--- a/src/include/fe_utils/recovery_gen.h
+++ b/src/include/fe_utils/recovery_gen.h
@@ -21,7 +21,8 @@
 #define MINIMUM_VERSION_FOR_RECOVERY_GUC 120000
 
 extern PQExpBuffer GenerateRecoveryConfig(PGconn *pgconn,
-										  char *pg_replication_slot);
+										  char *pg_replication_slot,
+										  char *nvwal_path);
 extern void WriteRecoveryConfig(PGconn *pgconn, char *target_dir,
 								PQExpBuffer contents);
 
-- 
2.17.1

v3-0005-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v3-0005-README-for-non-volatile-WAL-buffer.patchDownload
From 5a5408159af48096d0d9a1e002e49756078b526f Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:08:00 +0900
Subject: [PATCH v3 5/5] README for non-volatile WAL buffer

---
 README.nvwal | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 184 insertions(+)
 create mode 100644 README.nvwal

diff --git a/README.nvwal b/README.nvwal
new file mode 100644
index 0000000000..b6b9d576e7
--- /dev/null
+++ b/README.nvwal
@@ -0,0 +1,184 @@
+Non-volatile WAL buffer
+=======================
+Here is a PostgreSQL branch with a proof-of-concept "non-volatile WAL buffer"
+(NVWAL) feature. Putting the WAL buffer pages on persistent memory (PMEM) [1],
+inserting WAL records into it directly, and eliminating I/O for WAL segment
+files, PostgreSQL gets lower latency and higher throughput.
+
+
+Prerequisites and recommends
+----------------------------
+* An x64 system
+  * (Recommended) Supporting CLFLUSHOPT or CLWB instruction
+    * See if lscpu shows "clflushopt" or "clwb" flag
+* An OS supporting PMEM
+  * Linux: 4.15 or later (tested on 5.2)
+  * Windows: (Sorry but we have not tested on Windows yet.)
+* A filesystem supporting DAX (tested on ext4)
+* libpmem in PMDK [2] 1.4 or later (tested on 1.7)
+* ndctl [3] (tested on 61.2)
+* ipmctl [4] if you use Intel DCPMM
+* sudo privilege
+* All other prerequisites of original PostgreSQL
+* (Recommended) PMEM module(s) (NVDIMM-N or Intel DCPMM)
+  * You can emulate PMEM using DRAM [5] even if you have no PMEM module.
+* (Recommended) numactl
+
+
+Build and install PostgreSQL with NVWAL feature
+-----------------------------------------------
+We have a new configure option --with-nvwal.
+
+I believe it is good to install under your home directory with --prefix option.
+If you do so, please DO NOT forget "export PATH".
+
+  $ ./configure --with-nvwal --prefix="$HOME/postgres"
+  $ make
+  $ make install
+  $ export PATH="$HOME/postgres/bin:$PATH"
+
+NOTE: ./configure --with-nvwal will fail if libpmem is not found.
+
+
+Prepare DAX filesystem
+----------------------
+Here we use NVDIMM-N or emulated PMEM, make ext4 filesystem on namespace0.0
+(/dev/pmem0), and mount it onto /mnt/pmem0. Please DO NOT forget "-o dax" option
+on mount. For Intel DCPMM and ipmctl, please see [4].
+
+  $ ndctl list
+  [
+    {
+      "dev":"namespace1.0",
+      "mode":"raw",
+      "size":103079215104,
+      "sector_size":512,
+      "blockdev":"pmem1",
+      "numa_node":1
+    },
+    {
+      "dev":"namespace0.0",
+      "mode":"raw",
+      "size":103079215104,
+      "sector_size":512,
+      "blockdev":"pmem0",
+      "numa_node":0
+    }
+  ]
+
+  $ sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0
+  {
+    "dev":"namespace0.0",
+    "mode":"fsdax",
+    "map":"dev",
+    "size":"94.50 GiB (101.47 GB)",
+    "uuid":"e7da9d65-140b-4e1e-90ec-6548023a1b6e",
+    "sector_size":512,
+    "blockdev":"pmem0",
+    "numa_node":0
+  }
+
+  $ ls -l /dev/pmem0
+  brw-rw---- 1 root disk 259, 3 Jan  6 17:06 /dev/pmem0
+
+  $ sudo mkfs.ext4 -q -F /dev/pmem0
+  $ sudo mkdir -p /mnt/pmem0
+  $ sudo mount -o dax /dev/pmem0 /mnt/pmem0
+  $ mount -l | grep ^/dev/pmem0
+  /dev/pmem0 on /mnt/pmem0 type ext4 (rw,relatime,dax)
+
+
+Enable transparent huge page
+----------------------------
+Of course transparent huge page would not be suitable for database workload,
+but it improves performance of PMEM by reducing overhead of page walk.
+
+  $ ls -l /sys/kernel/mm/transparent_hugepage/enabled
+  -rw-r--r-- 1 root root 4096 Dec  3 10:38 /sys/kernel/mm/transparent_hugepage/enabled
+
+  $ echo always | sudo dd of=/sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null
+  $ cat /sys/kernel/mm/transparent_hugepage/enabled
+  [always] madvise never
+
+
+initdb
+------
+We have two new options:
+
+  -P, --nvwal-path=FILE  path to file for non-volatile WAL buffer (NVWAL)
+  -Q, --nvwal-size=SIZE  size of NVWAL, in megabytes
+
+If you want to create a new 80GB (81920MB) NVWAL file on /mnt/pmem0/pgsql/nvwal,
+please run initdb as follows:
+
+  $ sudo mkdir -p /mnt/pmem0/pgsql
+  $ sudo chown "$USER:$USER" /mnt/pmem0/pgsql
+  $ export PGDATA="$HOME/pgdata"
+  $ initdb -P /mnt/pmem0/pgsql/nvwal -Q 81920
+
+You will find there is no WAL segment file to be created in PGDATA/pg_wal
+directory. That is okay; your NVWAL file has the content of the first WAL
+segment file.
+
+NOTE:
+* initdb will fail if the given NVWAL size is not multiple of WAL segment
+  size. The segment size is given with initdb --wal-segsize, or is 16MB as
+  default.
+* postgres (executed by initdb) will fail in bootstrap if the directory in
+  which the NVWAL file is being created (/mnt/pmem0/pgsql for example
+  above) does not exist.
+* postgres (executed by initdb) will fail in bootstrap if an entry already
+  exists on the given path.
+* postgres (executed by initdb) will fail in bootstrap if the given path is
+  not on PMEM or you forget "-o dax" option on mount.
+* Resizing an NVWAL file is NOT supported yet. Please be careful to decide
+  how large your NVWAL file is to be.
+* "-Q 1024" (1024MB) will be assumed if -P is given but -Q is not.
+
+
+postgresql.conf
+---------------
+We have two new parameters nvwal_path and nvwal_size, corresponding to the two
+new options of initdb. If you run initdb as above, you will find postgresql.conf
+in your PGDATA directory like as follows:
+
+  max_wal_size = 80GB
+  min_wal_size = 80GB
+  nvwal_path = '/mnt/pmem0/pgsql/nvwal'
+  nvwal_size = 80GB
+
+NOTE:
+* postgres will fail in startup if no file exists on the given nvwal_path.
+* postgres will fail in startup if the given nvwal_size is not equal to the
+  actual NVWAL file size,
+* postgres will fail in startup if the given nvwal_path is not on PMEM or you
+  forget "-o dax" option on mount.
+* wal_buffers will be ignored if nvwal_path is given.
+* You SHOULD give both max_wal_size and min_wal_size the same value as
+  nvwal_size. postgres could possibly run even though the three values are
+  not same, however, we have not tested such a case yet.
+
+
+Startup
+-------
+Same as you know:
+
+  $ pg_ctl start
+
+or use numactl as follows to let postgres run on the specified NUMA node (typi-
+cally the one on which your NVWAL file is) if you need stable performance:
+
+  $ numactl --cpunodebind=0 --membind=0 -- pg_ctl start
+
+
+References
+----------
+[1] https://pmem.io/
+[2] https://pmem.io/pmdk/
+[3] https://docs.pmem.io/ndctl-user-guide/
+[4] https://docs.pmem.io/ipmctl-user-guide/
+[5] https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
+
+
+--
+Takashi Menjo <takashi.menjou.vg AT hco.ntt.co.jp>
-- 
2.17.1

#20Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#19)
5 attachment(s)
Re: [PoC] Non-volatile WAL buffer

Rebased.

2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>:

Dear hackers,

I update my non-volatile WAL buffer's patchset to v3. Now we can use it
in streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL
buffer if applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL
buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path
will be written to postgresql.auto.conf or recovery.conf. The size of the
new NVWAL is same as the master's one.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <

hlinnaka@iki.fi>; 'Amit Langote'

<amitlangote09@gmail.com>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master. A new v2

patchset is attached to this mail.

I also measured performance before and after patchset, varying

-c/--client and -j/--jobs options of pgbench, for

each scaling factor s = 50 or 1000. The results are presented in the

following tables and the attached charts.

Conditions, steps, and other details will be shown later.

Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)

Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)

Both throughput and average latency are improved for each scaling

factor. Throughput seemed to almost reach

the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case. I think

larger scaling factor leads to less

contentions on the same tables and/or indexes, that is, less lock and

unlock operations. In such a situation,

write-ahead logging appears to be more significant for performance.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for

pg_wal

- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access

(DAX)

- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch

Steps
=====
For each (c,j) pair, I did the following steps three times then I found

the median of the three as a final result shown

in the tables above.

(1) Run initdb with proper -D and -X options; and also give --nvwal-path

and --nvwal-size options after patch

(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j

___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation

Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <

hlinnaka@iki.fi>;

'PostgreSQL-development'

<pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the

hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated patchset

and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software
Innovation Center

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas
<hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <

takashi.menjou.vg@hco.ntt.co.jp> wrote:

Hello Amit,

I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit

the "master"

really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or not

because master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using release

notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss to

notice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architecture

of your own changes, although maybe that

rarely occurs.

Thanks,
Amit

--
Takashi Menjo <takashi.menjo@gmail.com>

Attachments:

v4-0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=v4-0001-Support-GUCs-for-external-WAL-buffer.patchDownload
From 668939ff8ddca517c7efb08218b01007ee6b4e94 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:56 +0900
Subject: [PATCH v4 1/5] Support GUCs for external WAL buffer

To implement non-volatile WAL buffer, we add two new GUCs nvwal_path
and nvwal_size.  Now postgres maps a file at that path onto memory to
use it as WAL buffer.  Note that the buffer is still volatile for now.
---
 configure                                     | 262 ++++++++++++++++++
 configure.ac                                  |  43 +++
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/nv_xlog_buffer.c   |  95 +++++++
 src/backend/access/transam/xlog.c             | 164 ++++++++++-
 src/backend/utils/misc/guc.c                  |  23 +-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/bin/initdb/initdb.c                       |  93 ++++++-
 src/include/access/nv_xlog_buffer.h           |  71 +++++
 src/include/access/xlog.h                     |   2 +
 src/include/pg_config.h.in                    |   6 +
 src/include/utils/guc.h                       |   4 +
 12 files changed, 747 insertions(+), 21 deletions(-)
 create mode 100644 src/backend/access/transam/nv_xlog_buffer.c
 create mode 100644 src/include/access/nv_xlog_buffer.h

diff --git a/configure b/configure
index 19a3cd09a0..764ed1e942 100755
--- a/configure
+++ b/configure
@@ -867,6 +867,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_nvwal
 with_gnu_ld
 enable_largefile
 '
@@ -1571,6 +1572,7 @@ Optional Packages:
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
+  --with-nvwal            use non-volatile WAL buffer (NVWAL)
   --with-gnu-ld           assume the C compiler uses GNU ld [default=no]
 
 Some influential environment variables:
@@ -8601,6 +8603,203 @@ fi
 
 
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with non-volatile WAL buffer (NVWAL)" >&5
+$as_echo_n "checking whether to build with non-volatile WAL buffer (NVWAL)... " >&6; }
+
+
+
+# Check whether --with-nvwal was given.
+if test "${with_nvwal+set}" = set; then :
+  withval=$with_nvwal;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_NVWAL 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-nvwal option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_nvwal=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_nvwal" >&5
+$as_echo "$with_nvwal" >&6; }
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+    freebsd1*|freebsd2*) elf=no;;
+    freebsd3*|freebsd4*) elf=yes;;
+esac
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for grep that handles long lines and -e" >&5
+$as_echo_n "checking for grep that handles long lines and -e... " >&6; }
+if ${ac_cv_path_GREP+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  if test -z "$GREP"; then
+  ac_path_GREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in grep ggrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_GREP="$as_dir/$ac_prog$ac_exec_ext"
+      as_fn_executable_p "$ac_path_GREP" || continue
+# Check for GNU ac_path_GREP and select it if it is found.
+  # Check for GNU $ac_path_GREP
+case `"$ac_path_GREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_GREP="$ac_path_GREP" ac_path_GREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'GREP' >> "conftest.nl"
+    "$ac_path_GREP" -e 'GREP$' -e '-(cannot match)-' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_GREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_GREP="$ac_path_GREP"
+      ac_path_GREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_GREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_GREP"; then
+    as_fn_error $? "no acceptable grep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_GREP=$GREP
+fi
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_GREP" >&5
+$as_echo "$ac_cv_path_GREP" >&6; }
+ GREP="$ac_cv_path_GREP"
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for egrep" >&5
+$as_echo_n "checking for egrep... " >&6; }
+if ${ac_cv_path_EGREP+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  if echo a | $GREP -E '(a|b)' >/dev/null 2>&1
+   then ac_cv_path_EGREP="$GREP -E"
+   else
+     if test -z "$EGREP"; then
+  ac_path_EGREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in egrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_EGREP="$as_dir/$ac_prog$ac_exec_ext"
+      as_fn_executable_p "$ac_path_EGREP" || continue
+# Check for GNU ac_path_EGREP and select it if it is found.
+  # Check for GNU $ac_path_EGREP
+case `"$ac_path_EGREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_EGREP="$ac_path_EGREP" ac_path_EGREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'EGREP' >> "conftest.nl"
+    "$ac_path_EGREP" 'EGREP$' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_EGREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_EGREP="$ac_path_EGREP"
+      ac_path_EGREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_EGREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_EGREP"; then
+    as_fn_error $? "no acceptable egrep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_EGREP=$EGREP
+fi
+
+   fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_EGREP" >&5
+$as_echo "$ac_cv_path_EGREP" >&6; }
+ EGREP="$ac_cv_path_EGREP"
+
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#if __ELF__
+  yes
+#endif
+
+_ACEOF
+if (eval "$ac_cpp conftest.$ac_ext") 2>&5 |
+  $EGREP "yes" >/dev/null 2>&1; then :
+  ELF_SYS=true
+else
+  if test "X$elf" = "Xyes" ; then
+  ELF_SYS=true
+else
+  ELF_SYS=
+fi
+fi
+rm -f conftest*
+
+
+
 #
 # Assignments
 #
@@ -12962,6 +13161,57 @@ fi
 fi
 
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'libpmem' is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+fi
+
 
 ##
 ## Header files
@@ -13641,6 +13891,18 @@ fi
 
 done
 
+fi
+
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+
 fi
 
 if test "$PORTNAME" = "win32" ; then
diff --git a/configure.ac b/configure.ac
index 6b9d0487a8..afa501a665 100644
--- a/configure.ac
+++ b/configure.ac
@@ -999,6 +999,38 @@ PGAC_ARG_BOOL(with, zlib, yes,
               [do not use Zlib])
 AC_SUBST(with_zlib)
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+AC_MSG_CHECKING([whether to build with non-volatile WAL buffer (NVWAL)])
+PGAC_ARG_BOOL(with, nvwal, no, [use non-volatile WAL buffer (NVWAL)],
+              [AC_DEFINE([USE_NVWAL], 1, [Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal)])])
+AC_MSG_RESULT([$with_nvwal])
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+    freebsd1*|freebsd2*) elf=no;;
+    freebsd3*|freebsd4*) elf=yes;;
+esac
+
+AC_EGREP_CPP(yes,
+[#if __ELF__
+  yes
+#endif
+],
+[ELF_SYS=true],
+[if test "X$elf" = "Xyes" ; then
+  ELF_SYS=true
+else
+  ELF_SYS=
+fi])
+AC_SUBST(ELF_SYS)
+
 #
 # Assignments
 #
@@ -1303,6 +1335,12 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [],
+               [AC_MSG_ERROR([library 'libpmem' is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 
 ##
 ## Header files
@@ -1480,6 +1518,11 @@ elif test "$with_uuid" = ossp ; then
       [AC_MSG_ERROR([header file <ossp/uuid.h> or <uuid.h> is required for OSSP UUID])])])
 fi
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 if test "$PORTNAME" = "win32" ; then
    AC_CHECK_HEADERS(crtdefs.h)
 fi
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..b41a710e7e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -32,7 +32,8 @@ OBJS = \
 	xlogfuncs.o \
 	xloginsert.o \
 	xlogreader.o \
-	xlogutils.o
+	xlogutils.o \
+	nv_xlog_buffer.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/nv_xlog_buffer.c b/src/backend/access/transam/nv_xlog_buffer.c
new file mode 100644
index 0000000000..cfc6a6376b
--- /dev/null
+++ b/src/backend/access/transam/nv_xlog_buffer.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * nv_xlog_buffer.c
+ *		PostgreSQL non-volatile WAL buffer
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/nv_xlog_buffer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#ifdef USE_NVWAL
+
+#include <libpmem.h>
+#include "access/nv_xlog_buffer.h"
+
+#include "miscadmin.h" /* IsBootstrapProcessingMode */
+#include "common/file_perm.h" /* pg_file_create_mode */
+
+/*
+ * Maps non-volatile WAL buffer on shared memory.
+ *
+ * Returns a mapped address if success; PANICs and never return otherwise.
+ */
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+	void	   *addr;
+	size_t		map_len = 0;
+	int			is_pmem = 0;
+
+	Assert(fname != NULL);
+	Assert(fsize > 0);
+
+	if (IsBootstrapProcessingMode())
+	{
+		/*
+		 * Create and map a new file if we are in bootstrap mode (typically
+		 * executed by initdb).
+		 */
+		addr = pmem_map_file(fname, fsize, PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+							 pg_file_create_mode, &map_len, &is_pmem);
+	}
+	else
+	{
+		/*
+		 * Map an existing file.  The second argument (len) should be zero,
+		 * the third argument (flags) should have neither PMEM_FILE_CREATE nor
+		 * PMEM_FILE_EXCL, and the fourth argument (mode) will be ignored.
+		 */
+		addr = pmem_map_file(fname, 0, 0, 0, &map_len, &is_pmem);
+	}
+
+	if (addr == NULL)
+		elog(PANIC, "could not map non-volatile WAL buffer '%s': %m", fname);
+
+	if (map_len != fsize)
+		elog(PANIC, "size of non-volatile WAL buffer '%s' is invalid; "
+					"expected %zu; actual %zu",
+			 fname, fsize, map_len);
+
+	if (!is_pmem)
+		elog(PANIC, "non-volatile WAL buffer '%s' is not on persistent memory",
+			 fname);
+
+	/*
+	 * Assert page boundary alignment (8KiB as default).  It should pass because
+	 * PMDK considers hugepage boundary alignment (2MiB or 1GiB on x64).
+	 */
+	Assert((uint64) addr % XLOG_BLCKSZ == 0);
+
+	elog(LOG, "non-volatile WAL buffer '%s' is mapped on [%p-%p)",
+		 fname, addr, (char *) addr + map_len);
+	return addr;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+	Assert(addr != NULL);
+
+	if (pmem_unmap(addr, fsize) < 0)
+	{
+		elog(WARNING, "could not unmap non-volatile WAL buffer: %m");
+		return;
+	}
+
+	elog(LOG, "non-volatile WAL buffer unmapped");
+}
+
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 09c01ed4ae..a7bb7c88ff 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -37,6 +37,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "access/nv_xlog_buffer.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
@@ -873,6 +874,12 @@ static bool InRedo = false;
 /* Have we launched bgwriter during recovery? */
 static bool bgwriterLaunched = false;
 
+/* For non-volatile WAL buffer (NVWAL) */
+char	   *NvwalPath = NULL;	/* a GUC parameter */
+int			NvwalSizeMB = 1024;	/* a direct GUC parameter */
+static Size	NvwalSize = 0;		/* an indirect GUC parameter */
+static bool	NvwalAvail = false;
+
 /* For WALInsertLockAcquire/Release functions */
 static int	MyLockNo = 0;
 static bool holdingAllLocks = false;
@@ -5014,6 +5021,76 @@ check_wal_buffers(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * GUC check_hook for nvwal_path.
+ */
+bool
+check_nvwal_path(char **newval, void **extra, GucSource source)
+{
+#ifndef USE_NVWAL
+	Assert(!NvwalAvail);
+
+	if (**newval != '\0')
+	{
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("nvwal_path is invalid parameter without NVWAL");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_nvwal_path(const char *newval, void *extra)
+{
+	/* true if not empty; false if empty */
+	NvwalAvail = (bool) (*newval != '\0');
+}
+
+/*
+ * GUC check_hook for nvwal_size.
+ *
+ * It checks the boundary only and DOES NOT check if the size is multiple
+ * of wal_segment_size because the segment size (probably stored in the
+ * control file) have not been set properly here yet.
+ *
+ * See XLOGShmemSize for more validation.
+ */
+bool
+check_nvwal_size(int *newval, void **extra, GucSource source)
+{
+#ifdef USE_NVWAL
+	Size		buf_size;
+	int64		npages;
+
+	Assert(*newval > 0);
+
+	buf_size = (Size) (*newval) * 1024 * 1024;
+	npages = (int64) buf_size / XLOG_BLCKSZ;
+	Assert(npages > 0);
+
+	if (npages > INT_MAX)
+	{
+		/* XLOG_BLCKSZ could be so small that npages exceeds INT_MAX */
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("invalid value for nvwal_size (%dMB): "
+						 "the number of WAL pages too large; "
+						 "buf_size %zu; XLOG_BLCKSZ %d",
+						 *newval, buf_size, (int) XLOG_BLCKSZ);
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_nvwal_size(int newval, void *extra)
+{
+	NvwalSize = (Size) newval * 1024 * 1024;
+}
+
 /*
  * Read the control file, set respective GUCs.
  *
@@ -5042,13 +5119,49 @@ XLOGShmemSize(void)
 {
 	Size		size;
 
+	/*
+	 * If we use non-volatile WAL buffer, we don't use the given wal_buffers.
+	 * Instead, we set it the value based on the size of the file for the
+	 * buffer. This should be done here because of xlblocks array calculation.
+	 */
+	if (NvwalAvail)
+	{
+		char		buf[32];
+		int64		npages;
+
+		Assert(NvwalSizeMB > 0);
+		Assert(NvwalSize > 0);
+		Assert(wal_segment_size > 0);
+		Assert(wal_segment_size % XLOG_BLCKSZ == 0);
+
+		/*
+		 * At last, we can check if the size of non-volatile WAL buffer
+		 * (nvwal_size) is multiple of WAL segment size.
+		 *
+		 * Note that NvwalSize has already been calculated in assign_nvwal_size.
+		 */
+		if (NvwalSize % wal_segment_size != 0)
+		{
+			elog(PANIC,
+				 "invalid value for nvwal_size (%dMB): "
+				 "it should be multiple of WAL segment size; "
+				 "NvwalSize %zu; wal_segment_size %d",
+				 NvwalSizeMB, NvwalSize, wal_segment_size);
+		}
+
+		npages = (int64) NvwalSize / XLOG_BLCKSZ;
+		Assert(npages > 0 && npages <= INT_MAX);
+
+		snprintf(buf, sizeof(buf), "%d", (int) npages);
+		SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
+	}
 	/*
 	 * If the value of wal_buffers is -1, use the preferred auto-tune value.
 	 * This isn't an amazingly clean place to do this, but we must wait till
 	 * NBuffers has received its final value, and must do it before using the
 	 * value of XLOGbuffers to do anything important.
 	 */
-	if (XLOGbuffers == -1)
+	else if (XLOGbuffers == -1)
 	{
 		char		buf[32];
 
@@ -5064,10 +5177,13 @@ XLOGShmemSize(void)
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
-	/* extra alignment padding for XLOG I/O buffers */
-	size = add_size(size, XLOG_BLCKSZ);
-	/* and the buffers themselves */
-	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+	if (!NvwalAvail)
+	{
+		/* extra alignment padding for XLOG I/O buffers */
+		size = add_size(size, XLOG_BLCKSZ);
+		/* and the buffers themselves */
+		size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+	}
 
 	/*
 	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5161,13 +5277,32 @@ XLOGShmemInit(void)
 	}
 
 	/*
-	 * Align the start of the page buffers to a full xlog block size boundary.
-	 * This simplifies some calculations in XLOG insertion. It is also
-	 * required for O_DIRECT.
+	 * Open and memory-map a file for non-volatile XLOG buffer. The PMDK will
+	 * align the start of the buffer to 2-MiB boundary if the size of the
+	 * buffer is larger than or equal to 4 MiB.
 	 */
-	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
-	XLogCtl->pages = allocptr;
-	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+	if (NvwalAvail)
+	{
+		/* Logging and error-handling should be done in the function */
+		XLogCtl->pages = MapNonVolatileXLogBuffer(NvwalPath, NvwalSize);
+
+		/*
+		 * Do not memset non-volatile XLOG buffer (XLogCtl->pages) here
+		 * because it would contain records for recovery. We should do so in
+		 * checkpoint after the recovery completes successfully.
+		 */
+	}
+	else
+	{
+		/*
+		 * Align the start of the page buffers to a full xlog block size
+		 * boundary. This simplifies some calculations in XLOG insertion. It
+		 * is also required for O_DIRECT.
+		 */
+		allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+		XLogCtl->pages = allocptr;
+		memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+	}
 
 	/*
 	 * Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
@@ -8523,6 +8658,13 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+
+	/*
+	 * If we use non-volatile XLOG buffer, unmap it.
+	 */
+	if (NvwalAvail)
+		UnmapNonVolatileXLogBuffer(XLogCtl->pages, NvwalSize);
+
 	ShutdownCLOG();
 	ShutdownCommitTs();
 	ShutdownSUBTRANS();
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index de87ad6ef7..77a1b8bb32 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2714,7 +2714,7 @@ static struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_XBLOCKS
 		},
 		&XLOGbuffers,
-		-1, -1, (INT_MAX / XLOG_BLCKSZ),
+		-1, -1, INT_MAX,
 		check_wal_buffers, NULL, NULL
 	},
 
@@ -3399,6 +3399,17 @@ static struct config_int ConfigureNamesInt[] =
 		check_huge_page_size, NULL, NULL
 	},
 
+	{
+		{"nvwal_size", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Size of non-volatile WAL buffer (NVWAL)."),
+			NULL,
+			GUC_UNIT_MB
+		},
+		&NvwalSizeMB,
+		1024, 1, INT_MAX,
+		check_nvwal_size, assign_nvwal_size, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -4448,6 +4459,16 @@ static struct config_string ConfigureNamesString[] =
 		check_backtrace_functions, assign_backtrace_functions, NULL
 	},
 
+	{
+		{"nvwal_path", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Path to file for non-volatile WAL buffer (NVWAL)."),
+			NULL
+		},
+		&NvwalPath,
+		"",
+		check_nvwal_path, assign_nvwal_path, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 9cb571f7cc..f343d6b296 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -230,6 +230,8 @@
 #checkpoint_timeout = 5min		# range 30s-1d
 #max_wal_size = 1GB
 #min_wal_size = 80MB
+#nvwal_path = '/path/to/nvwal'
+#nvwal_size = 1GB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 37e0d7ceab..2dd0a09734 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -145,7 +145,10 @@ static bool show_setting = false;
 static bool data_checksums = false;
 static char *xlog_dir = NULL;
 static char *str_wal_segment_size_mb = NULL;
+static char *nvwal_path = NULL;
+static char *str_nvwal_size_mb = NULL;
 static int	wal_segment_size_mb;
+static int	nvwal_size_mb;
 
 
 /* internal vars */
@@ -1098,14 +1101,78 @@ setup_config(void)
 	conflines = replace_token(conflines, "#port = 5432", repltok);
 #endif
 
-	/* set default max_wal_size and min_wal_size */
-	snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
-			 pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
-	conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+	if (nvwal_path != NULL)
+	{
+		int nr_segs;
+
+		if (str_nvwal_size_mb == NULL)
+			nvwal_size_mb = 1024;
+		else
+		{
+			char *endptr;
+
+			/* check that the argument is a number */
+			nvwal_size_mb = strtol(str_nvwal_size_mb, &endptr, 10);
+
+			/* verify that the size of non-volatile WAL buffer is valid */
+			if (endptr == str_nvwal_size_mb || *endptr != '\0')
+			{
+				pg_log_error("argument of --nvwal-size must be a number; "
+							 "str_nvwal_size_mb '%s'",
+							 str_nvwal_size_mb);
+				exit(1);
+			}
+			if (nvwal_size_mb <= 0)
+			{
+				pg_log_error("argument of --nvwal-size must be a positive number; "
+							 "str_nvwal_size_mb '%s'; nvwal_size_mb %d",
+							 str_nvwal_size_mb, nvwal_size_mb);
+				exit(1);
+			}
+			if (nvwal_size_mb % wal_segment_size_mb != 0)
+			{
+				pg_log_error("argument of --nvwal-size must be multiple of WAL segment size; "
+							 "str_nvwal_size_mb '%s'; nvwal_size_mb %d; wal_segment_size_mb %d",
+							 str_nvwal_size_mb, nvwal_size_mb, wal_segment_size_mb);
+				exit(1);
+			}
+		}
+
+		/*
+		 * XXX We set {min_,max_,nv}wal_size to the same value.  Note that
+		 * postgres might bootstrap and run if the three config does not have
+		 * the same value, but have not been tested yet.
+		 */
+		nr_segs = nvwal_size_mb / wal_segment_size_mb;
 
-	snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
-			 pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
-	conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+		snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "nvwal_path = '%s'",
+				 nvwal_path);
+		conflines = replace_token(conflines,
+								  "#nvwal_path = '/path/to/nvwal'", repltok);
+
+		snprintf(repltok, sizeof(repltok), "nvwal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#nvwal_size = 1GB", repltok);
+	}
+	else
+	{
+		/* set default max_wal_size and min_wal_size */
+		snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+				 pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
+		conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+				 pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
+		conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+	}
 
 	snprintf(repltok, sizeof(repltok), "lc_messages = '%s'",
 			 escape_quotes(lc_messages));
@@ -2310,6 +2377,8 @@ usage(const char *progname)
 	printf(_("  -W, --pwprompt            prompt for a password for the new superuser\n"));
 	printf(_("  -X, --waldir=WALDIR       location for the write-ahead log directory\n"));
 	printf(_("      --wal-segsize=SIZE    size of WAL segments, in megabytes\n"));
+	printf(_("  -P, --nvwal-path=FILE     path to file for non-volatile WAL buffer (NVWAL)\n"));
+	printf(_("  -Q, --nvwal-size=SIZE     size of NVWAL, in megabytes\n"));
 	printf(_("\nLess commonly used options:\n"));
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("  -k, --data-checksums      use data page checksums\n"));
@@ -2978,6 +3047,8 @@ main(int argc, char *argv[])
 		{"sync-only", no_argument, NULL, 'S'},
 		{"waldir", required_argument, NULL, 'X'},
 		{"wal-segsize", required_argument, NULL, 12},
+		{"nvwal-path", required_argument, NULL, 'P'},
+		{"nvwal-size", required_argument, NULL, 'Q'},
 		{"data-checksums", no_argument, NULL, 'k'},
 		{"allow-group-access", no_argument, NULL, 'g'},
 		{NULL, 0, NULL, 0}
@@ -3021,7 +3092,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:g", long_options, &option_index)) != -1)
+	while ((c = getopt_long(argc, argv, "dD:E:kL:nNU:WA:sST:X:P:Q:g", long_options, &option_index)) != -1)
 	{
 		switch (c)
 		{
@@ -3115,6 +3186,12 @@ main(int argc, char *argv[])
 			case 12:
 				str_wal_segment_size_mb = pg_strdup(optarg);
 				break;
+			case 'P':
+				nvwal_path = pg_strdup(optarg);
+				break;
+			case 'Q':
+				str_nvwal_size_mb = pg_strdup(optarg);
+				break;
 			case 'g':
 				SetDataDirectoryCreatePerm(PG_DIR_MODE_GROUP);
 				break;
diff --git a/src/include/access/nv_xlog_buffer.h b/src/include/access/nv_xlog_buffer.h
new file mode 100644
index 0000000000..b58878c92b
--- /dev/null
+++ b/src/include/access/nv_xlog_buffer.h
@@ -0,0 +1,71 @@
+/*
+ * nv_xlog_buffer.h
+ *
+ * PostgreSQL non-volatile WAL buffer
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nv_xlog_buffer.h
+ */
+#ifndef NV_XLOG_BUFFER_H
+#define NV_XLOG_BUFFER_H
+
+extern void *MapNonVolatileXLogBuffer(const char *fname, Size fsize);
+extern void	UnmapNonVolatileXLogBuffer(void *addr, Size fsize);
+
+#ifdef USE_NVWAL
+#include <libpmem.h>
+
+#define nv_memset_persist	pmem_memset_persist
+#define nv_memcpy_nodrain	pmem_memcpy_nodrain
+#define nv_flush			pmem_flush
+#define nv_drain			pmem_drain
+#define nv_persist			pmem_persist
+
+#else
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+	return NULL;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+	return;
+}
+
+static inline void *
+nv_memset_persist(void *pmemdest, int c, size_t len)
+{
+	return NULL;
+}
+
+static inline void *
+nv_memcpy_nodrain(void *pmemdest, const void *src,
+				  size_t len)
+{
+	return NULL;
+}
+
+static inline void
+nv_flush(void *pmemdest, size_t len)
+{
+	return;
+}
+
+static inline void
+nv_drain(void)
+{
+	return;
+}
+
+static inline void
+nv_persist(const void *addr, size_t len)
+{
+	return;
+}
+
+#endif							/* USE_NVWAL */
+#endif							/* NV_XLOG_BUFFER_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..03fd1267e8 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,8 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern char *NvwalPath;
+extern int  NvwalSizeMB;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index fb270df678..961be9aff5 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -325,6 +325,9 @@
 /* Define to 1 if you have the `pam' library (-lpam). */
 #undef HAVE_LIBPAM
 
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
 /* Define if you have a function readline library */
 #undef HAVE_LIBREADLINE
 
@@ -884,6 +887,9 @@
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
+/* Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal) */
+#undef USE_NVWAL
+
 /* Define to build with OpenSSL support. (--with-openssl) */
 #undef USE_OPENSSL
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..d941a76d43 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,6 +438,10 @@ extern void assign_search_path(const char *newval, void *extra);
 
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_nvwal_path(char **newval, void **extra, GucSource source);
+extern void assign_nvwal_path(const char *newval, void *extra);
+extern bool check_nvwal_size(int *newval, void **extra, GucSource source);
+extern void assign_nvwal_size(int newval, void *extra);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
 #endif							/* GUC_H */
-- 
2.17.1

v4-0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0002-Non-volatile-WAL-buffer.patchDownload
From 9d2ebe6744b9fdb966da78d4a535bc5c4fee33e0 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:57 +0900
Subject: [PATCH v4 2/5] Non-volatile WAL buffer

Now external WAL buffer becomes non-volatile.

Bumps PG_CONTROL_VERSION.
---
 src/backend/access/transam/xlog.c            | 1154 ++++++++++++++++--
 src/backend/access/transam/xlogreader.c      |   24 +
 src/bin/pg_controldata/pg_controldata.c      |    3 +
 src/include/access/xlog.h                    |    8 +
 src/include/catalog/pg_control.h             |   17 +-
 src/test/regress/expected/misc_functions.out |   14 +-
 src/test/regress/sql/misc_functions.sql      |   14 +-
 7 files changed, 1097 insertions(+), 137 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a7bb7c88ff..6a579a308f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -652,6 +652,13 @@ typedef struct XLogCtlData
 	TimeLineID	ThisTimeLineID;
 	TimeLineID	PrevTimeLineID;
 
+	/*
+	 * Used for non-volatile WAL buffer (NVWAL).
+	 *
+	 * All the records up to this LSN are persistent in NVWAL.
+	 */
+	XLogRecPtr	persistentUpTo;
+
 	/*
 	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
@@ -783,11 +790,13 @@ typedef enum
 	XLOG_FROM_ANY = 0,			/* request to read WAL from any source */
 	XLOG_FROM_ARCHIVE,			/* restored using restore_command */
 	XLOG_FROM_PG_WAL,			/* existing file in pg_wal */
-	XLOG_FROM_STREAM			/* streamed from primary */
+	XLOG_FROM_NVWAL,			/* non-volatile WAL buffer */
+	XLOG_FROM_STREAM,			/* streamed from primary via segment file */
+	XLOG_FROM_STREAM_NVWAL		/* same as above, but via NVWAL */
 } XLogSource;
 
 /* human-readable names for XLogSources, for debugging output */
-static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
+static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "nvwal", "stream", "stream_nvwal"};
 
 /*
  * openLogFile is -1 or a kernel FD for an open log file segment.
@@ -922,6 +931,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
+static void PreallocNonVolatileXlogBuffer(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
@@ -1204,6 +1214,43 @@ XLogInsertRecord(XLogRecData *rdata,
 		}
 	}
 
+	/*
+	 * Request a checkpoint here if non-volatile WAL buffer is used and we
+	 * have consumed too much WAL since the last checkpoint.
+	 *
+	 * We first screen under the condition (1) OR (2) below:
+	 *
+	 * (1) The record was the first one in a certain segment.
+	 * (2) The record was inserted across segments.
+	 *
+	 * We then check the segment number which the record was inserted into.
+	 */
+	if (NvwalAvail && inserted &&
+		(StartPos % wal_segment_size == SizeOfXLogLongPHD ||
+		 StartPos / wal_segment_size < EndPos / wal_segment_size))
+	{
+		XLogSegNo	end_segno;
+
+		XLByteToSeg(EndPos, end_segno, wal_segment_size);
+
+		/*
+		 * NOTE: We do not signal walsender here because the inserted record
+		 * have not drained by NVWAL buffer yet.
+		 *
+		 * NOTE: We do not signal walarchiver here because the inserted record
+		 * have not flushed to a segment file.  So we don't need to update
+		 * XLogCtl->lastSegSwitch{Time,LSN}, used only by CheckArchiveTimeout.
+		 */
+
+		/* Two-step checking for speed (see also XLogWrite) */
+		if (IsUnderPostmaster && XLogCheckpointNeeded(end_segno))
+		{
+			(void) GetRedoRecPtr();
+			if (XLogCheckpointNeeded(end_segno))
+				RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+		}
+	}
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
 	{
@@ -2136,6 +2183,15 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
 	int			npages = 0;
+	bool		is_firstpage;
+
+	if (NvwalAvail)
+		elog(DEBUG1, "XLogCtl->InitializedUpTo %X/%X; upto %X/%X; opportunistic %s",
+			 (uint32) (XLogCtl->InitializedUpTo >> 32),
+			 (uint32) XLogCtl->InitializedUpTo,
+			 (uint32) (upto >> 32),
+			 (uint32) upto,
+			 opportunistic ? "true" : "false");
 
 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
 
@@ -2197,7 +2253,25 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 				{
 					/* Have to write it ourselves */
 					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
-					WriteRqst.Write = OldPageRqstPtr;
+
+					if (NvwalAvail)
+					{
+						/*
+						 * If we use non-volatile WAL buffer, it is a special
+						 * but expected case to write the buffer pages out to
+						 * segment files, and for simplicity, it is done in
+						 * segment by segment.
+						 */
+						XLogRecPtr		OldSegEndPtr;
+
+						OldSegEndPtr = OldPageRqstPtr - XLOG_BLCKSZ + wal_segment_size;
+						Assert(OldSegEndPtr % wal_segment_size == 0);
+
+						WriteRqst.Write = OldSegEndPtr;
+					}
+					else
+						WriteRqst.Write = OldPageRqstPtr;
+
 					WriteRqst.Flush = 0;
 					XLogWrite(WriteRqst, false);
 					LWLockRelease(WALWriteLock);
@@ -2224,7 +2298,20 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		 * Be sure to re-zero the buffer so that bytes beyond what we've
 		 * written will look like zeroes and not valid XLOG records...
 		 */
-		MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
+		if (NvwalAvail)
+		{
+			/*
+			 * We do not take the way that combines MemSet() and pmem_persist()
+			 * because pmem_persist() may use slow and strong-ordered cache
+			 * flush instruction if weak-ordered fast one is not supported.
+			 * Instead, we first fill the buffer with zero by
+			 * pmem_memset_persist() that can leverage non-temporal fast store
+			 * instructions, then make the header persistent later.
+			 */
+			nv_memset_persist(NewPage, 0, XLOG_BLCKSZ);
+		}
+		else
+			MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
 
 		/*
 		 * Fill the new page's header
@@ -2256,7 +2343,8 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		/*
 		 * If first page of an XLOG segment file, make it a long header.
 		 */
-		if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+		is_firstpage = ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0);
+		if (is_firstpage)
 		{
 			XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
 
@@ -2271,7 +2359,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
 		 * holding a lock.
 		 */
-		pg_write_barrier();
+		if (NvwalAvail)
+		{
+			/* Make the header persistent on PMEM */
+			nv_persist(NewPage, is_firstpage ? SizeOfXLogLongPHD : SizeOfXLogShortPHD);
+		}
+		else
+			pg_write_barrier();
 
 		*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
 
@@ -2281,6 +2375,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	}
 	LWLockRelease(WALBufMappingLock);
 
+	if (NvwalAvail)
+		elog(DEBUG1, "ControlFile->discardedUpTo %X/%X; XLogCtl->InitializedUpTo %X/%X",
+			 (uint32) (ControlFile->discardedUpTo >> 32),
+			 (uint32) ControlFile->discardedUpTo,
+			 (uint32) (XLogCtl->InitializedUpTo >> 32),
+			 (uint32) XLogCtl->InitializedUpTo);
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG && npages > 0)
 	{
@@ -2662,6 +2763,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		LogwrtResult.Flush = LogwrtResult.Write;
 	}
 
+	/*
+	 * Update discardedUpTo if NVWAL is used.  A new value should not fall
+	 * behind the old one.
+	 */
+	if (NvwalAvail)
+	{
+		Assert(LogwrtResult.Write == LogwrtResult.Flush);
+
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		if (ControlFile->discardedUpTo < LogwrtResult.Write)
+		{
+			ControlFile->discardedUpTo = LogwrtResult.Write;
+			UpdateControlFile();
+		}
+		LWLockRelease(ControlFileLock);
+	}
+
 	/*
 	 * Update shared-memory status
 	 *
@@ -2866,6 +2984,123 @@ XLogFlush(XLogRecPtr record)
 		return;
 	}
 
+	if (NvwalAvail)
+	{
+		XLogRecPtr	FromPos;
+
+		/*
+		 * No page on the NVWAL is to be flushed to segment files.  Instead,
+		 * we wait all the insertions preceding this one complete.  We will
+		 * wait for all the records to be persistent on the NVWAL below.
+		 */
+		record = WaitXLogInsertionsToFinish(record);
+
+		/*
+		 * Check if another backend already have done what I am doing.
+		 *
+		 * We can compare something <= XLogCtl->persistentUpTo without
+		 * holding XLogCtl->info_lck spinlock because persistentUpTo is
+		 * monotonically increasing and can be loaded atomically on each
+		 * NVWAL-supported platform (now x64 only).
+		 */
+		FromPos = *((volatile XLogRecPtr *) &XLogCtl->persistentUpTo);
+		if (record <= FromPos)
+			return;
+
+		/*
+		 * In a very rare case, we rounded whole the NVWAL.  We do not need
+		 * to care old pages here because they already have been evicted to
+		 * segment files at record insertion.
+		 *
+		 * In such a case, we flush whole the NVWAL.  We also log it as
+		 * warning because it can be time-consuming operation.
+		 *
+		 * TODO Advance XLogCtl->persistentUpTo at the end of XLogWrite, and
+		 * we can remove the following first if-block.
+		 */
+		if (record - FromPos > NvwalSize)
+		{
+			elog(WARNING, "flush whole the NVWAL; FromPos %X/%X; record %X/%X",
+				 (uint32) (FromPos >> 32), (uint32) FromPos,
+				 (uint32) (record >> 32), (uint32) record);
+
+			nv_flush(XLogCtl->pages, NvwalSize);
+		}
+		else
+		{
+			char   *frompos;
+			char   *uptopos;
+			size_t	fromoff;
+			size_t	uptooff;
+
+			/*
+			 * Flush each record that is probably not flushed yet.
+			 *
+			 * We have two reasons why we say "probably".  The first is because
+			 * such a record copied with non-temporal store instruction has
+			 * already "flushed" but we cannot distinguish it.  nv_flush is
+			 * harmless for it in consistency.
+			 *
+			 * The second reason is that the target record might have already
+			 * been evicted to a segment file until now.  Also in this case,
+			 * nv_flush is harmless in consistency.
+			 */
+			uptooff = record % NvwalSize;
+			uptopos = XLogCtl->pages + uptooff;
+			fromoff = FromPos % NvwalSize;
+			frompos = XLogCtl->pages + fromoff;
+
+			/* Handles rotation */
+			if (uptopos <= frompos)
+			{
+				nv_flush(frompos, NvwalSize - fromoff);
+				fromoff = 0;
+				frompos = XLogCtl->pages;
+			}
+
+			nv_flush(frompos, uptooff - fromoff);
+		}
+
+		/*
+		 * To guarantee durability ("D" of ACID), we should satisfy the
+		 * following two for each transaction X:
+		 *
+		 *  (1) All the WAL records inserted by X, including the commit record
+		 *      of X, should persist on NVWAL before the server commits X.
+		 *
+		 *  (2) All the WAL records inserted by any other transactions than
+		 *      X, that have less LSN than the commit record just inserted
+		 *      by X, should persist on NVWAL before the server commits X.
+		 *
+		 * The (1) can be satisfied by a store barrier after the commit record
+		 * of X is flushed because each WAL record on X is already flushed in
+		 * the end of its insertion.  The (2) can be satisfied by waiting for
+		 * any record insertions that have less LSN than the commit record just
+		 * inserted by X, and by a store barrier as well.
+		 *
+		 * Now is the time.  Have a store barrier.
+		 */
+		nv_drain();
+
+		/*
+		 * Remember where the last persistent record is.  A new value should
+		 * not fall behind the old one.
+		 */
+		SpinLockAcquire(&XLogCtl->info_lck);
+		if (XLogCtl->persistentUpTo < record)
+			XLogCtl->persistentUpTo = record;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		/*
+		 * The records up to the returned "record" have been persisntent on
+		 * NVWAL.  Now signal walsenders.
+		 */
+		WalSndWakeupRequest();
+		WalSndWakeupProcessRequests();
+
+		return;
+	}
+
 	/* Quick exit if already known flushed */
 	if (record <= LogwrtResult.Flush)
 		return;
@@ -3049,6 +3284,13 @@ XLogBackgroundFlush(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/*
+	 * Quick exit if NVWAL buffer is used and archiving is not active. In this
+	 * case, we need no WAL segment file in pg_wal directory.
+	 */
+	if (NvwalAvail && !XLogArchivingActive())
+		return false;
+
 	/* read LogwrtResult and update local state */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
@@ -3067,6 +3309,18 @@ XLogBackgroundFlush(void)
 		flexible = false;		/* ensure it all gets written */
 	}
 
+	/*
+	 * If NVWAL is used, back off to the last compeleted segment boundary
+	 * for writing the buffer page to files in segment by segment.  We do so
+	 * nowhere but here after XLogCtl->asyncXactLSN is loaded because it
+	 * should be considered.
+	 */
+	if (NvwalAvail)
+	{
+		WriteRqst.Write -= WriteRqst.Write % wal_segment_size;
+		flexible = false;		/* ensure it all gets written */
+	}
+
 	/*
 	 * If already known flushed, we're done. Just need to check if we are
 	 * holding an open file handle to a logfile that's no longer in use,
@@ -3093,7 +3347,12 @@ XLogBackgroundFlush(void)
 	flushbytes =
 		WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
 
-	if (WalWriterFlushAfter == 0 || lastflush == 0)
+	if (NvwalAvail)
+	{
+		WriteRqst.Flush = WriteRqst.Write;
+		lastflush = now;
+	}
+	else if (WalWriterFlushAfter == 0 || lastflush == 0)
 	{
 		/* first call, or block based limits disabled */
 		WriteRqst.Flush = WriteRqst.Write;
@@ -3152,7 +3411,28 @@ XLogBackgroundFlush(void)
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
 	 */
-	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
+	if (NvwalAvail && max_wal_senders == 0)
+	{
+		XLogRecPtr		upto;
+
+		/*
+		 * If NVWAL is used and there is no walsender, nobody is to load
+		 * segments on the buffer.  So let's recycle segments up to {where we
+		 * have requested to write and flush} + NvwalSize.
+		 *
+		 * Note that if NVWAL is used and a walsender seems running, we have to
+		 * do nothing; keep the written pages on the buffer for walsenders to be
+		 * loaded from the buffer, not from the segment files.  Note that the
+		 * buffer pages are eventually to be recycled by checkpoint.
+		 */
+		Assert(WriteRqst.Write == WriteRqst.Flush);
+		Assert(WriteRqst.Write % wal_segment_size == 0);
+
+		upto = WriteRqst.Write + NvwalSize;
+		AdvanceXLInsertBuffer(upto - XLOG_BLCKSZ, false);
+	}
+	else
+		AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
 
 	/*
 	 * If we determined that we need to write data, but somebody else
@@ -3885,6 +4165,43 @@ XLogFileClose(void)
 	ReleaseExternalFD();
 }
 
+/*
+ * Preallocate non-volatile XLOG buffers.
+ *
+ * This zeroes buffers and prepare page headers up to
+ * ControlFile->discardedUpTo + S, where S is the total size of
+ * the non-volatile XLOG buffers.
+ *
+ * It is caller's responsibility to update ControlFile->discardedUpTo
+ * and to set XLogCtl->InitializedUpTo properly.
+ */
+static void
+PreallocNonVolatileXlogBuffer(void)
+{
+	XLogRecPtr	newupto,
+				InitializedUpTo;
+
+	Assert(NvwalAvail);
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	newupto = ControlFile->discardedUpTo;
+	LWLockRelease(ControlFileLock);
+
+	InitializedUpTo = XLogCtl->InitializedUpTo;
+
+	newupto += NvwalSize;
+	Assert(newupto % wal_segment_size == 0);
+
+	if (newupto <= InitializedUpTo)
+		return;
+
+	/*
+	 * Subtracting XLOG_BLCKSZ is important, because AdvanceXLInsertBuffer
+	 * handles the first argument as the beginning of pages, not the end.
+	 */
+	AdvanceXLInsertBuffer(newupto - XLOG_BLCKSZ, false);
+}
+
 /*
  * Preallocate log files beyond the specified log endpoint.
  *
@@ -4181,8 +4498,11 @@ RemoveXlogFile(const char *segname, XLogRecPtr lastredoptr, XLogRecPtr endptr)
 	 * Before deleting the file, see if it can be recycled as a future log
 	 * segment. Only recycle normal files, pg_standby for example can create
 	 * symbolic links pointing to a separate archive directory.
+	 *
+	 * If NVWAL buffer is used, a log segment file is never to be recycled
+	 * (that is, always go into else block).
 	 */
-	if (wal_recycle &&
+	if (!NvwalAvail && wal_recycle &&
 		endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
@@ -4600,6 +4920,7 @@ InitControlFile(uint64 sysidentifier)
 	memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, MOCK_AUTH_NONCE_LEN);
 	ControlFile->state = DB_SHUTDOWNED;
 	ControlFile->unloggedLSN = FirstNormalUnloggedLSN;
+	ControlFile->discardedUpTo = (NvwalAvail) ? wal_segment_size : InvalidXLogRecPtr;
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
@@ -5430,41 +5751,58 @@ BootStrapXLOG(void)
 	record->xl_crc = crc;
 
 	/* Create first XLOG segment file */
-	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	if (NvwalAvail)
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		nv_memcpy_nodrain(XLogCtl->pages + wal_segment_size, page, XLOG_BLCKSZ);
+		pgstat_report_wait_end();
 
-	/*
-	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
-	 * close the file again in a moment.
-	 */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+		nv_drain();
+		pgstat_report_wait_end();
 
-	/* Write the first page with the initial record */
-	errno = 0;
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
-	{
-		/* if write didn't set errno, assume problem is no disk space */
-		if (errno == 0)
-			errno = ENOSPC;
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not write bootstrap write-ahead log file: %m")));
+		/*
+		 * Other WAL stuffs will be initialized in startup process.
+		 */
 	}
-	pgstat_report_wait_end();
+	else
+	{
+		use_existent = false;
+		openLogFile = XLogFileInit(1, &use_existent, false);
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
-	pgstat_report_wait_end();
+		/*
+		 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
+		 * close the file again in a moment.
+		 */
 
-	if (close(openLogFile) != 0)
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not close bootstrap write-ahead log file: %m")));
+		/* Write the first page with the initial record */
+		errno = 0;
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write bootstrap write-ahead log file: %m")));
+		}
+		pgstat_report_wait_end();
 
-	openLogFile = -1;
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+		if (pg_fsync(openLogFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_end();
+
+		if (close(openLogFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not close bootstrap write-ahead log file: %m")));
+
+		openLogFile = -1;
+	}
 
 	/* Now create pg_control */
 	InitControlFile(sysidentifier);
@@ -5718,41 +6056,47 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * happens in the middle of a segment, copy data from the last WAL segment
 	 * of the old timeline up to the switch point, to the starting WAL segment
 	 * on the new timeline.
+	 *
+	 * If non-volatile WAL buffer is used, no new segment file is created. Data
+	 * up to the switch point will be copied into NVWAL buffer by StartupXLOG().
 	 */
-	if (endLogSegNo == startLogSegNo)
-	{
-		/*
-		 * Make a copy of the file on the new timeline.
-		 *
-		 * Writing WAL isn't allowed yet, so there are no locking
-		 * considerations. But we should be just as tense as XLogFileInit to
-		 * avoid emplacing a bogus file.
-		 */
-		XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
-					 XLogSegmentOffset(endOfLog, wal_segment_size));
-	}
-	else
+	if (!NvwalAvail)
 	{
-		/*
-		 * The switch happened at a segment boundary, so just create the next
-		 * segment on the new timeline.
-		 */
-		bool		use_existent = true;
-		int			fd;
+		if (endLogSegNo == startLogSegNo)
+		{
+			/*
+			 * Make a copy of the file on the new timeline.
+			 *
+			 * Writing WAL isn't allowed yet, so there are no locking
+			 * considerations. But we should be just as tense as XLogFileInit to
+			 * avoid emplacing a bogus file.
+			 */
+			XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
+						 XLogSegmentOffset(endOfLog, wal_segment_size));
+		}
+		else
+		{
+			/*
+			 * The switch happened at a segment boundary, so just create the next
+			 * segment on the new timeline.
+			 */
+			bool		use_existent = true;
+			int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+			fd = XLogFileInit(startLogSegNo, &use_existent, true);
 
-		if (close(fd) != 0)
-		{
-			char		xlogfname[MAXFNAMELEN];
-			int			save_errno = errno;
+			if (close(fd) != 0)
+			{
+				char		xlogfname[MAXFNAMELEN];
+				int			save_errno = errno;
 
-			XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
-						 wal_segment_size);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not close file \"%s\": %m", xlogfname)));
+				XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
+							 wal_segment_size);
+				errno = save_errno;
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not close file \"%s\": %m", xlogfname)));
+			}
 		}
 	}
 
@@ -7009,6 +7353,11 @@ StartupXLOG(void)
 		InRecovery = true;
 	}
 
+	/* Dump discardedUpTo just before REDO */
+	elog(LOG, "ControlFile->discardedUpTo %X/%X",
+		 (uint32) (ControlFile->discardedUpTo >> 32),
+		 (uint32) ControlFile->discardedUpTo);
+
 	/* REDO */
 	if (InRecovery)
 	{
@@ -7795,10 +8144,88 @@ StartupXLOG(void)
 	Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
 	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
 
+	if (NvwalAvail)
+	{
+		XLogRecPtr	discardedUpTo;
+
+		discardedUpTo = ControlFile->discardedUpTo;
+		Assert(discardedUpTo == InvalidXLogRecPtr ||
+			   discardedUpTo % wal_segment_size == 0);
+
+		if (discardedUpTo == InvalidXLogRecPtr)
+		{
+			elog(DEBUG1, "brand-new NVWAL");
+
+			/* The following "Tricky point" is to initialize the buffer */
+		}
+		else if (EndOfLog <= discardedUpTo)
+		{
+			elog(DEBUG1, "no record on NVWAL has been UNDONE");
+
+			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+			ControlFile->discardedUpTo = InvalidXLogRecPtr;
+			UpdateControlFile();
+			LWLockRelease(ControlFileLock);
+
+			nv_memset_persist(XLogCtl->pages, 0, NvwalSize);
+
+			/* The following "Tricky point" is to initialize the buffer */
+		}
+		else
+		{
+			int			last_idx;
+			int			idx;
+			XLogRecPtr	ptr;
+
+			elog(DEBUG1, "some records on NVWAL have been UNDONE; keep them");
+
+			/*
+			 * Initialize xlblock array because we decided to keep UNDONE
+			 * records on NVWAL buffer; or each page on the buffer that meets
+			 * xlblocks == 0 (initialized as so by XLOGShmemInit) is to be
+			 * accidentally cleared by the following AdvanceXLInsertBuffer!
+			 *
+			 * Two cases can be considered:
+			 *
+			 * 1) EndOfLog is on a page boundary (divisible by XLOG_BLCKSZ):
+			 *    Initialize up to (and including) the page containing the last
+			 *    record.  That page should end with EndOfLog.  The one more
+			 *    next page "N" beginning with EndOfLog is to be untouched
+			 *    because, in such a very corner case that all the NVWAL
+			 *    buffer pages are already filled, page N is on the same
+			 *    location as the first page "F" beginning with discardedUpTo.
+			 *    Of cource we should not overwrite the page F.
+			 *
+			 *    In this case, we first get XLogRecPtrToBufIdx(EndOfLog) as
+			 *    last_idx, indicating the page N.  Then, we go forward from
+			 *    the page F up to (but excluding) page N that have the same
+			 *    index as the page F.
+			 *
+			 * 2) EndOfLog is not on a page boundary:  Initialize all the pages
+			 *    but the page "L" having the last record. The page L is to be
+			 *    initialized by the following "Tricky point", including its
+			 *    content.
+			 *
+			 * In either case, XLogCtl->InitializedUpTo is to be initialized in
+			 * the following "Tricky" if-else block.
+			 */
+
+			last_idx = XLogRecPtrToBufIdx(EndOfLog);
+
+			ptr = discardedUpTo;
+			for (idx = XLogRecPtrToBufIdx(ptr); idx != last_idx;
+				 idx = NextBufIdx(idx))
+			{
+				ptr += XLOG_BLCKSZ;
+				XLogCtl->xlblocks[idx] = ptr;
+			}
+		}
+	}
+
 	/*
-	 * Tricky point here: readBuf contains the *last* block that the LastRec
-	 * record spans, not the one it starts in.  The last block is indeed the
-	 * one we want to use.
+	 * Tricky point here: readBuf contains the *last* block that the
+	 * LastRec record spans, not the one it starts in.  The last block is
+	 * indeed the one we want to use.
 	 */
 	if (EndOfLog % XLOG_BLCKSZ != 0)
 	{
@@ -7818,6 +8245,9 @@ StartupXLOG(void)
 		memcpy(page, xlogreader->readBuf, len);
 		memset(page + len, 0, XLOG_BLCKSZ - len);
 
+		if (NvwalAvail)
+			nv_persist(page, XLOG_BLCKSZ);
+
 		XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
 		XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
 	}
@@ -7831,12 +8261,54 @@ StartupXLOG(void)
 		XLogCtl->InitializedUpTo = EndOfLog;
 	}
 
-	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+	if (NvwalAvail)
+	{
+		XLogRecPtr	SegBeginPtr;
+
+		/*
+		 * If NVWAL buffer is used, writing records out to segment files should
+		 * be done in segment by segment. So Logwrt{Rqst,Result} (and also
+		 * discardedUpTo) should be multiple of wal_segment_size.  Let's get
+		 * them back off to the last segment boundary.
+		 */
+
+		SegBeginPtr = EndOfLog - (EndOfLog % wal_segment_size);
+		LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+		XLogCtl->LogwrtResult = LogwrtResult;
+		XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+		XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+
+		/*
+		 * persistentUpTo does not need to be multiple of wal_segment_size,
+		 * and should be drained-up-to LSN. walsender will use it to load
+		 * records from NVWAL buffer.
+		 */
+		XLogCtl->persistentUpTo = EndOfLog;
+
+		/* Update discardedUpTo in pg_control if still invalid */
+		if (ControlFile->discardedUpTo == InvalidXLogRecPtr)
+		{
+			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+			ControlFile->discardedUpTo = SegBeginPtr;
+			UpdateControlFile();
+			LWLockRelease(ControlFileLock);
+		}
+
+		elog(DEBUG1, "EndOfLog: %X/%X",
+			 (uint32) (EndOfLog >> 32), (uint32) EndOfLog);
 
-	XLogCtl->LogwrtResult = LogwrtResult;
+		elog(DEBUG1, "SegBeginPtr: %X/%X",
+			 (uint32) (SegBeginPtr >> 32), (uint32) SegBeginPtr);
+	}
+	else
+	{
+		LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
-	XLogCtl->LogwrtRqst.Write = EndOfLog;
-	XLogCtl->LogwrtRqst.Flush = EndOfLog;
+		XLogCtl->LogwrtResult = LogwrtResult;
+
+		XLogCtl->LogwrtRqst.Write = EndOfLog;
+		XLogCtl->LogwrtRqst.Flush = EndOfLog;
+	}
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -7967,6 +8439,7 @@ StartupXLOG(void)
 				char		origpath[MAXPGPATH];
 				char		partialfname[MAXFNAMELEN];
 				char		partialpath[MAXPGPATH];
+				XLogRecPtr	discardedUpTo;
 
 				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
 				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
@@ -7978,6 +8451,53 @@ StartupXLOG(void)
 				 */
 				XLogArchiveCleanup(partialfname);
 
+				/*
+				 * If NVWAL is also used for archival recovery, write old
+				 * records out to segment files to archive them.  Note that we
+				 * need locks related to WAL because LocalXLogInsertAllowed
+				 * already got to -1.
+				 */
+				discardedUpTo = ControlFile->discardedUpTo;
+				if (NvwalAvail && discardedUpTo != InvalidXLogRecPtr &&
+					discardedUpTo < EndOfLog)
+				{
+					XLogwrtRqst WriteRqst;
+					TimeLineID	thisTLI = ThisTimeLineID;
+					XLogRecPtr	SegBeginPtr =
+						EndOfLog - (EndOfLog % wal_segment_size);
+
+					/*
+					 * XXX Assume that all the records have the same TLI.
+					 */
+					ThisTimeLineID = EndOfLogTLI;
+
+					WriteRqst.Write = EndOfLog;
+					WriteRqst.Flush = 0;
+
+					LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+					XLogWrite(WriteRqst, false);
+
+					/*
+					 * Force back-off to the last segment boundary.
+					 */
+					LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+					ControlFile->discardedUpTo = SegBeginPtr;
+					UpdateControlFile();
+					LWLockRelease(ControlFileLock);
+
+					LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+
+					SpinLockAcquire(&XLogCtl->info_lck);
+					XLogCtl->LogwrtResult = LogwrtResult;
+					XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+					XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+					SpinLockRelease(&XLogCtl->info_lck);
+
+					LWLockRelease(WALWriteLock);
+
+					ThisTimeLineID = thisTLI;
+				}
+
 				durable_rename(origpath, partialpath, ERROR);
 				XLogArchiveNotify(partialfname);
 			}
@@ -7987,7 +8507,10 @@ StartupXLOG(void)
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
-	PreallocXlogFiles(EndOfLog);
+	if (NvwalAvail)
+		PreallocNonVolatileXlogBuffer();
+	else
+		PreallocXlogFiles(EndOfLog);
 
 	/*
 	 * Okay, we're officially UP.
@@ -8551,10 +9074,24 @@ GetInsertRecPtr(void)
 /*
  * GetFlushRecPtr -- Returns the current flush position, ie, the last WAL
  * position known to be fsync'd to disk.
+ *
+ * If NVWAL is used, this returns the last persistent WAL position instead.
  */
 XLogRecPtr
 GetFlushRecPtr(void)
 {
+	if (NvwalAvail)
+	{
+		XLogRecPtr		ret;
+
+		SpinLockAcquire(&XLogCtl->info_lck);
+		LogwrtResult = XLogCtl->LogwrtResult;
+		ret = XLogCtl->persistentUpTo;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		return ret;
+	}
+
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	SpinLockRelease(&XLogCtl->info_lck);
@@ -8854,6 +9391,9 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
+	/* for non-volatile WAL buffer */
+	XLogRecPtr	newDiscardedUpTo = 0;
+
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
 	 * issued at a different time.
@@ -9165,6 +9705,22 @@ CreateCheckPoint(int flags)
 	 */
 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+	/*
+	 * If non-volatile WAL buffer is used, discardedUpTo should be updated and
+	 * persist on the control file. So the new value should be caluculated
+	 * here.
+	 *
+	 * TODO Do not copy and paste codes...
+	 */
+	if (NvwalAvail)
+	{
+		XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+		KeepLogSeg(recptr, &_logSegNo);
+		_logSegNo--;
+
+		newDiscardedUpTo = _logSegNo * wal_segment_size;
+	}
+
 	/*
 	 * Update the control file.
 	 */
@@ -9173,6 +9729,16 @@ CreateCheckPoint(int flags)
 		ControlFile->state = DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
+	if (NvwalAvail)
+	{
+		/*
+		 * A new value should not fall behind the old one.
+		 */
+		if (ControlFile->discardedUpTo < newDiscardedUpTo)
+			ControlFile->discardedUpTo = newDiscardedUpTo;
+		else
+			newDiscardedUpTo = ControlFile->discardedUpTo;
+	}
 	ControlFile->time = (pg_time_t) time(NULL);
 	/* crash recovery should always recover to the end of WAL */
 	ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
@@ -9190,6 +9756,44 @@ CreateCheckPoint(int flags)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * If we use non-volatile XLOG buffer, update XLogCtl->Logwrt{Rqst,Result}
+	 * so that the XLOG records older than newDiscardedUpTo are treated as
+	 * "already written and flushed."
+	 */
+	if (NvwalAvail)
+	{
+		Assert(newDiscardedUpTo > 0);
+
+		/* Update process-local variables */
+		LogwrtResult.Write = LogwrtResult.Flush = newDiscardedUpTo;
+
+		/*
+		 * Update shared-memory variables. We need both light-weight lock and
+		 * spin lock to update them.
+		 */
+		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+		SpinLockAcquire(&XLogCtl->info_lck);
+
+		/*
+		 * Note that there can be a corner case that process-local
+		 * LogwrtResult falls behind shared XLogCtl->LogwrtResult if whole the
+		 * non-volatile XLOG buffer is filled and some pages are written out
+		 * to segment files between UpdateControlFile and LWLockAcquire above.
+		 *
+		 * TODO For now, we ignore that case because it can hardly occur.
+		 */
+		XLogCtl->LogwrtResult = LogwrtResult;
+
+		if (XLogCtl->LogwrtRqst.Write < newDiscardedUpTo)
+			XLogCtl->LogwrtRqst.Write = newDiscardedUpTo;
+		if (XLogCtl->LogwrtRqst.Flush < newDiscardedUpTo)
+			XLogCtl->LogwrtRqst.Flush = newDiscardedUpTo;
+
+		SpinLockRelease(&XLogCtl->info_lck);
+		LWLockRelease(WALWriteLock);
+	}
+
 	/* Update shared-memory copy of checkpoint XID/epoch */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->ckptFullXid = checkPoint.nextXid;
@@ -9213,22 +9817,48 @@ CreateCheckPoint(int flags)
 	if (PriorRedoPtr != InvalidXLogRecPtr)
 		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
 
-	/*
-	 * Delete old log files, those no longer needed for last checkpoint to
-	 * prevent the disk holding the xlog from growing full.
-	 */
-	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
-	KeepLogSeg(recptr, &_logSegNo);
-	InvalidateObsoleteReplicationSlots(_logSegNo);
-	_logSegNo--;
-	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	if (NvwalAvail)
+	{
+		/*
+		 * We already set _logSegNo to the value equivalent to discardedUpTo.
+		 * We first increment it to call InvalidateObsoleteReplicationSlots.
+		 */
+		_logSegNo++;
+		InvalidateObsoleteReplicationSlots(_logSegNo);
+
+		/*
+		 * Then we decrement _logSegNo again to remove WAL segment files
+		 * having spilled out of non-volatile WAL buffer.
+		 *
+		 * Note that you should set wal_recycle to off to remove segment files.
+		 */
+		_logSegNo--;
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	}
+	else
+	{
+		/*
+		 * Delete old log files, those no longer needed for last checkpoint to
+		 * prevent the disk holding the xlog from growing full.
+		 */
+		XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+		KeepLogSeg(recptr, &_logSegNo);
+		InvalidateObsoleteReplicationSlots(_logSegNo);
+		_logSegNo--;
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	}
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
 	if (!shutdown)
-		PreallocXlogFiles(recptr);
+	{
+		if (NvwalAvail)
+			PreallocNonVolatileXlogBuffer();
+		else
+			PreallocXlogFiles(recptr);
+	}
 
 	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
@@ -11985,6 +12615,170 @@ CancelBackup(void)
 	}
 }
 
+/*
+ * Is NVWAL used?
+ */
+bool
+IsNvwalAvail(void)
+{
+	return NvwalAvail;
+}
+
+/*
+ * Returns the size we can load from NVWAL and sets nvwalptr to load-from LSN.
+ */
+Size
+GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
+{
+	XLogRecPtr	readUpTo;
+	XLogRecPtr	discardedUpTo;
+
+	Assert(IsNvwalAvail());
+	Assert(nvwalptr != NULL);
+
+	readUpTo = target + count;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	discardedUpTo = ControlFile->discardedUpTo;
+	LWLockRelease(ControlFileLock);
+
+	/* Check if all the records are on WAL segment files */
+	if (readUpTo <= discardedUpTo)
+		return 0;
+
+	/* Check if all the records are on NVWAL */
+	if (discardedUpTo <= target)
+	{
+		*nvwalptr = target;
+		return count;
+	}
+
+	/* Some on WAL segment files, some on NVWAL */
+	*nvwalptr = discardedUpTo;
+	return (Size) (readUpTo - discardedUpTo);
+}
+
+/*
+ * It is like WALRead @ xlogreader.c, but loads from non-volatile WAL
+ * buffer.
+ */
+bool
+CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	Assert(NvwalAvail);
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	/*
+	 * Hold shared WALBufMappingLock to let others not rotate WAL buffer
+	 * while copying WAL records from it.  We do not need exclusive lock
+	 * because we will not rotate the buffer in this function.
+	 */
+	LWLockAcquire(WALBufMappingLock, LW_SHARED);
+
+	while (nbytes > 0)
+	{
+		char	   *q;
+		Size		off;
+		Size		max_copy;
+		Size		copybytes;
+		XLogRecPtr	discardedUpTo;
+
+		LWLockAcquire(ControlFileLock, LW_SHARED);
+		discardedUpTo = ControlFile->discardedUpTo;
+		LWLockRelease(ControlFileLock);
+
+		/* Check if the records we need have been already evicted or not */
+		if (recptr < discardedUpTo)
+		{
+			LWLockRelease(WALBufMappingLock);
+
+			/* TODO error handling? */
+			return false;
+		}
+
+		/*
+		 * Get the target address on non-volatile WAL buffer and the size we
+		 * can copy from it at once because the buffer can rotate and we
+		 * might have to copy what we want devided into two or more.
+		 */
+		off = recptr % NvwalSize;
+		q = XLogCtl->pages + off;
+		max_copy = NvwalSize - off;
+		copybytes = Min(nbytes, max_copy);
+
+		memcpy(p, q, copybytes);
+
+		/* Update state for copy */
+		recptr += copybytes;
+		nbytes -= copybytes;
+		p += copybytes;
+	}
+
+	LWLockRelease(WALBufMappingLock);
+	return true;
+}
+
+static bool
+IsXLogSourceFromStream(XLogSource source)
+{
+	switch (source)
+	{
+		case XLOG_FROM_STREAM:
+		case XLOG_FROM_STREAM_NVWAL:
+			return true;
+
+		default:
+			return false;
+	}
+}
+
+static bool
+IsXLogSourceFromNvwal(XLogSource source)
+{
+	switch (source)
+	{
+		case XLOG_FROM_NVWAL:
+		case XLOG_FROM_STREAM_NVWAL:
+			return true;
+
+		default:
+			return false;
+	}
+}
+
+static bool
+NeedsForMoreXLog(XLogRecPtr targetChunkEndPtr)
+{
+	switch (readSource)
+	{
+		case XLOG_FROM_ARCHIVE:
+		case XLOG_FROM_PG_WAL:
+			return (readFile < 0);
+
+		case XLOG_FROM_NVWAL:
+			Assert(NvwalAvail);
+			return false;
+
+		case XLOG_FROM_STREAM:
+			return (flushedUpto < targetChunkEndPtr);
+
+		case XLOG_FROM_STREAM_NVWAL:
+			Assert(NvwalAvail);
+			return (flushedUpto < targetChunkEndPtr);
+
+		default: /* XLOG_FROM_ANY */
+			Assert(readFile < 0);
+			return true;
+	}
+}
+
 /*
  * Read the XLOG page containing RecPtr into readBuf (if not read already).
  * Returns number of bytes read, if the page is read successfully, or -1
@@ -12026,7 +12820,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 &&
+	if ((readFile >= 0 || IsXLogSourceFromNvwal(readSource)) &&
 		!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
 	{
 		/*
@@ -12043,7 +12837,8 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			}
 		}
 
-		close(readFile);
+		if (readFile >= 0)
+			close(readFile);
 		readFile = -1;
 		readSource = XLOG_FROM_ANY;
 	}
@@ -12052,9 +12847,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
-		(readSource == XLOG_FROM_STREAM &&
-		 flushedUpto < targetPagePtr + reqLen))
+	if (NeedsForMoreXLog(targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
@@ -12075,7 +12868,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(readFile != -1 || IsXLogSourceFromNvwal(readSource));
 
 	/*
 	 * If the current segment is being streamed from the primary, calculate how
@@ -12083,7 +12876,7 @@ retry:
 	 * requested record has been received, but this is for the benefit of
 	 * future calls, to allow quick exit at the top of this function.
 	 */
-	if (readSource == XLOG_FROM_STREAM)
+	if (IsXLogSourceFromStream(readSource))
 	{
 		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
@@ -12094,41 +12887,59 @@ retry:
 	else
 		readLen = XLOG_BLCKSZ;
 
-	/* Read the requested page */
 	readOff = targetPageOff;
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
-	if (r != XLOG_BLCKSZ)
+	if (IsXLogSourceFromNvwal(readSource))
 	{
-		char		fname[MAXFNAMELEN];
-		int			save_errno = errno;
+		Size		offset = (Size) (targetPagePtr % NvwalSize);
+		char	   *readpos = XLogCtl->pages + offset;
+
+		Assert(offset % XLOG_BLCKSZ == 0);
 
+		/* Load the requested page from non-volatile WAL buffer */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		memcpy(readBuf, readpos, readLen);
 		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		if (r < 0)
+
+		/* There are not any other clues of TLI... */
+		xlogreader->seg.ws_tli = ((XLogPageHeader) readBuf)->xlp_tli;
+	}
+	else
+	{
+		/* Read the requested page from file */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+		if (r != XLOG_BLCKSZ)
 		{
-			errno = save_errno;
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
+			char		fname[MAXFNAMELEN];
+			int			save_errno = errno;
+
+			pgstat_report_wait_end();
+			XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+			if (r < 0)
+			{
+				errno = save_errno;
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode_for_file_access(),
+						 errmsg("could not read from log segment %s, offset %u: %m",
+								fname, readOff)));
+			}
+			else
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+								fname, readOff, r, (Size) XLOG_BLCKSZ)));
+			goto next_record_is_invalid;
 		}
-		else
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							fname, readOff, r, (Size) XLOG_BLCKSZ)));
-		goto next_record_is_invalid;
+		pgstat_report_wait_end();
+
+		xlogreader->seg.ws_tli = curFileTLI;
 	}
-	pgstat_report_wait_end();
 
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	xlogreader->seg.ws_tli = curFileTLI;
-
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
 	 * it's not valid. This may seem unnecessary, because XLogReadRecord()
@@ -12162,6 +12973,17 @@ retry:
 		goto next_record_is_invalid;
 	}
 
+	/*
+	 * Updating curFileTLI on each page verified if non-volatile WAL buffer
+	 * is used because there is no TimeLineID information in NVWAL's filename.
+	 */
+	if (IsXLogSourceFromNvwal(readSource) &&
+		curFileTLI != xlogreader->latestPageTLI)
+	{
+		curFileTLI = xlogreader->latestPageTLI;
+		elog(DEBUG1, "curFileTLI: %u", curFileTLI);
+	}
+
 	return readLen;
 
 next_record_is_invalid:
@@ -12243,7 +13065,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	if (!InArchiveRecovery)
 		currentSource = XLOG_FROM_PG_WAL;
 	else if (currentSource == XLOG_FROM_ANY ||
-			 (!StandbyMode && currentSource == XLOG_FROM_STREAM))
+			 (!StandbyMode && IsXLogSourceFromStream(currentSource)))
 	{
 		lastSourceFailed = false;
 		currentSource = XLOG_FROM_ARCHIVE;
@@ -12266,6 +13088,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			{
 				case XLOG_FROM_ARCHIVE:
 				case XLOG_FROM_PG_WAL:
+				case XLOG_FROM_NVWAL:
 
 					/*
 					 * Check to see if the trigger file exists. Note that we
@@ -12279,6 +13102,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						return false;
 					}
 
+					/* Try NVWAL if available */
+					if (NvwalAvail && currentSource != XLOG_FROM_NVWAL)
+					{
+						currentSource = XLOG_FROM_NVWAL;
+						break;
+					}
+
 					/*
 					 * Not in standby mode, and we've now tried the archive
 					 * and pg_wal.
@@ -12290,11 +13120,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * Move to XLOG_FROM_STREAM state, and set to start a
 					 * walreceiver if necessary.
 					 */
-					currentSource = XLOG_FROM_STREAM;
+					if (currentSource == XLOG_FROM_NVWAL)
+						currentSource = XLOG_FROM_STREAM_NVWAL;
+					else
+						currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
 					break;
 
 				case XLOG_FROM_STREAM:
+				case XLOG_FROM_STREAM_NVWAL:
 
 					/*
 					 * Failure while streaming. Most likely, we got here
@@ -12400,6 +13234,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		{
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
+			case XLOG_FROM_NVWAL:
 
 				/*
 				 * WAL receiver must not be running when reading WAL from
@@ -12417,6 +13252,59 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* Try to load from NVWAL */
+				if (currentSource == XLOG_FROM_NVWAL)
+				{
+					XLogRecPtr		discardedUpTo;
+
+					Assert(NvwalAvail);
+
+					/*
+					 * Check if the target page exists on NVWAL.  Note that
+					 * RecPtr points to the end of the target chunk.
+					 *
+					 * TODO need ControlFileLock?
+					 */
+					discardedUpTo = ControlFile->discardedUpTo;
+					if (discardedUpTo != InvalidXLogRecPtr &&
+						discardedUpTo < RecPtr &&
+						RecPtr <= discardedUpTo + NvwalSize)
+					{
+						/* Report recovery progress in PS display */
+						set_ps_display("recovering NVWAL");
+						elog(DEBUG1, "recovering NVWAL");
+
+						/* Track source of data and receipt time */
+						readSource = XLOG_FROM_NVWAL;
+						XLogReceiptSource = XLOG_FROM_NVWAL;
+						XLogReceiptTime = GetCurrentTimestamp();
+
+						/*
+						 * Construct expectedTLEs.  This is necessary to
+						 * recover only from NVWAL because its filename does
+						 * not have any TLI information.
+						 */
+						if (!expectedTLEs)
+						{
+							TimeLineHistoryEntry	   *entry;
+
+							entry = palloc(sizeof(TimeLineHistoryEntry));
+							entry->tli = recoveryTargetTLI;
+							entry->begin = entry->end = InvalidXLogRecPtr;
+
+							expectedTLEs = list_make1(entry);
+							elog(DEBUG1, "expectedTLEs: [%u]",
+								 (uint32) recoveryTargetTLI);
+						}
+
+						return true;
+					}
+
+					/* Target page does not exist on NVWAL */
+					lastSourceFailed = true;
+					break;
+				}
+
 				/*
 				 * Try to restore the file from archive, or read an existing
 				 * file from pg_wal.
@@ -12434,6 +13322,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				break;
 
 			case XLOG_FROM_STREAM:
+			case XLOG_FROM_STREAM_NVWAL:
 				{
 					bool		havedata;
 
@@ -12558,21 +13447,34 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						 * info is set correctly and XLogReceiptTime isn't
 						 * changed.
 						 */
-						if (readFile < 0)
+						if (currentSource == XLOG_FROM_STREAM_NVWAL)
 						{
 							if (!expectedTLEs)
 								expectedTLEs = readTimeLineHistory(receiveTLI);
-							readFile = XLogFileRead(readSegNo, PANIC,
-													receiveTLI,
-													XLOG_FROM_STREAM, false);
-							Assert(readFile >= 0);
+
+							/* TODO is it ok to return, not to break switch? */
+							readSource = XLOG_FROM_STREAM_NVWAL;
+							XLogReceiptSource = XLOG_FROM_STREAM_NVWAL;
+							return true;
 						}
 						else
 						{
-							/* just make sure source info is correct... */
-							readSource = XLOG_FROM_STREAM;
-							XLogReceiptSource = XLOG_FROM_STREAM;
-							return true;
+							if (readFile < 0)
+							{
+								if (!expectedTLEs)
+									expectedTLEs = readTimeLineHistory(receiveTLI);
+								readFile = XLogFileRead(readSegNo, PANIC,
+														receiveTLI,
+														XLOG_FROM_STREAM, false);
+								Assert(readFile >= 0);
+							}
+							else
+							{
+								/* just make sure source info is correct... */
+								readSource = XLOG_FROM_STREAM;
+								XLogReceiptSource = XLOG_FROM_STREAM;
+								return true;
+							}
 						}
 						break;
 					}
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a63ad8cfd0..d3841cc559 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1067,11 +1067,24 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	XLogRecPtr	recptr_nvwal = 0;
+	Size		nbytes_nvwal = 0;
+#endif
 
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
 
+#ifndef FRONTEND
+	/* Try to load records directly from NVWAL if used */
+	if (IsNvwalAvail())
+	{
+		nbytes_nvwal = GetLoadableSizeFromNvwal(startptr, count, &recptr_nvwal);
+		nbytes = count - nbytes_nvwal;
+	}
+#endif
+
 	while (nbytes > 0)
 	{
 		uint32		startoff;
@@ -1139,6 +1152,17 @@ WALRead(XLogReaderState *state,
 		p += readbytes;
 	}
 
+#ifndef FRONTEND
+	if (IsNvwalAvail())
+	{
+		if (!CopyXLogRecordsFromNVWAL(p, nbytes_nvwal, recptr_nvwal))
+		{
+			/* TODO graceful error handling */
+			elog(PANIC, "some records on NVWAL had been discarded");
+		}
+	}
+#endif
+
 	return true;
 }
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f70..eabcaae2ff 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -272,6 +272,9 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("discarded Up To:                      %X/%X\n"),
+		   (uint32) (ControlFile->discardedUpTo >> 32),
+		   (uint32) ControlFile->discardedUpTo);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 03fd1267e8..ddf786290b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -354,6 +354,14 @@ extern void XLogRequestWalReceiverReply(void);
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
+extern bool IsNvwalAvail(void);
+extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
+										   Size count,
+										   XLogRecPtr *nvwalptr);
+extern bool CopyXLogRecordsFromNVWAL(char *buf,
+									 Size count,
+									 XLogRecPtr startptr);
+
 /*
  * Routines to start, stop, and get status of a base backup.
  */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index 06bed90c5e..012eeee058 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
 
 
 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION	1300
+#define PG_CONTROL_VERSION	1301
 
 /* Nonce key length, see below */
 #define MOCK_AUTH_NONCE_LEN		32
@@ -132,6 +132,21 @@ typedef struct ControlFileData
 
 	XLogRecPtr	unloggedLSN;	/* current fake LSN value, for unlogged rels */
 
+	/*
+	 * Used for non-volatile WAL buffer (NVWAL).
+	 *
+	 * discardedUpTo is updated to the oldest LSN in the NVWAL when either a
+	 * checkpoint or a restartpoint is completed successfully, or whole the
+	 * NVWAL is filled with WAL records and a new record is being inserted.
+	 * This field tells that the NVWAL contains WAL records in the range of
+	 * [discardedUpTo, discardedUpTo+S), where S is the size of the NVWAL.
+	 * Note that the WAL records whose LSN are less than discardedUpTo would
+	 * remain in WAL segment files and be needed for recovery.
+	 *
+	 * It is set to zero when NVWAL is not used.
+	 */
+	XLogRecPtr	discardedUpTo;
+
 	/*
 	 * These two values determine the minimum point we must recover up to
 	 * before starting up:
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index d3acb98d04..bbd47e1663 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -142,14 +142,17 @@ HINT:  No function matches the given name and argument types. You might need to
 select setting as segsize
 from pg_settings where name = 'wal_segment_size'
 \gset
-select count(*) > 0 as ok from pg_ls_waldir();
+select setting as nvwal_path
+from pg_settings where name = 'nvwal_path'
+\gset
+select count(*) > 0 or :'nvwal_path' <> '' as ok from pg_ls_waldir();
  ok 
 ----
  t
 (1 row)
 
 -- Test ProjectSet as well as FunctionScan
-select count(*) > 0 as ok from (select pg_ls_waldir()) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select pg_ls_waldir()) ss;
  ok 
 ----
  t
@@ -161,14 +164,15 @@ select * from pg_ls_waldir() limit 0;
 ------+------+--------------
 (0 rows)
 
-select count(*) > 0 as ok from (select * from pg_ls_waldir() limit 1) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select * from pg_ls_waldir() limit 1) ss;
  ok 
 ----
  t
 (1 row)
 
-select (w).size = :segsize as ok
-from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from
+  (select * from pg_ls_waldir() w
+   where length((w).name) = 24 and (w).size = :segsize limit 1) ss;
  ok 
 ----
  t
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 094e8f8296..09c326775d 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -39,15 +39,19 @@ SELECT num_nulls();
 select setting as segsize
 from pg_settings where name = 'wal_segment_size'
 \gset
+select setting as nvwal_path
+from pg_settings where name = 'nvwal_path'
+\gset
 
-select count(*) > 0 as ok from pg_ls_waldir();
+select count(*) > 0 or :'nvwal_path' <> '' as ok from pg_ls_waldir();
 -- Test ProjectSet as well as FunctionScan
-select count(*) > 0 as ok from (select pg_ls_waldir()) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select pg_ls_waldir()) ss;
 -- Test not-run-to-completion cases.
 select * from pg_ls_waldir() limit 0;
-select count(*) > 0 as ok from (select * from pg_ls_waldir() limit 1) ss;
-select (w).size = :segsize as ok
-from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select * from pg_ls_waldir() limit 1) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from
+  (select * from pg_ls_waldir() w
+   where length((w).name) = 24 and (w).size = :segsize limit 1) ss;
 
 select count(*) >= 0 as ok from pg_ls_archive_statusdir();
 
-- 
2.17.1

v4-0003-walreceiver-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0003-walreceiver-supports-non-volatile-WAL-buffer.patchDownload
From 72506483b9b02a7f89273a5090ec6ab061457831 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:58 +0900
Subject: [PATCH v4 3/5] walreceiver supports non-volatile WAL buffer

Now walreceiver stores received records directly to non-volatile
WAL buffer if applicable.
---
 src/backend/access/transam/xlog.c     | 31 +++++++++++++++-
 src/backend/replication/walreceiver.c | 53 ++++++++++++++++++++++++++-
 src/include/access/xlog.h             |  4 ++
 3 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6a579a308f..dfa7c2517b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -925,6 +925,8 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
+static bool CopyXLogRecordsOnNVWAL(char *buf, Size count, XLogRecPtr startptr,
+								   bool store);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
@@ -12664,6 +12666,21 @@ GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
  */
 bool
 CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+	return CopyXLogRecordsOnNVWAL(buf, count, startptr, false);
+}
+
+/*
+ * Called by walreceiver.
+ */
+bool
+CopyXLogRecordsToNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+	return CopyXLogRecordsOnNVWAL(buf, count, startptr, true);
+}
+
+static bool
+CopyXLogRecordsOnNVWAL(char *buf, Size count, XLogRecPtr startptr, bool store)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -12713,7 +12730,13 @@ CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
 		max_copy = NvwalSize - off;
 		copybytes = Min(nbytes, max_copy);
 
-		memcpy(p, q, copybytes);
+		if (store)
+		{
+			memcpy(q, p, copybytes);
+			nv_flush(q, copybytes);
+		}
+		else
+			memcpy(p, q, copybytes);
 
 		/* Update state for copy */
 		recptr += copybytes;
@@ -12725,6 +12748,12 @@ CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
 	return true;
 }
 
+void
+SyncNVWAL(void)
+{
+	nv_drain();
+}
+
 static bool
 IsXLogSourceFromStream(XLogSource source)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 7c11e1ab44..563dd59ec0 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -130,6 +130,7 @@ static void WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *start
 static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
+static void XLogWalRcvStore(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
@@ -856,7 +857,10 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 
 				buf += hdrlen;
 				len -= hdrlen;
-				XLogWalRcvWrite(buf, len, dataStart);
+				if (IsNvwalAvail())
+					XLogWalRcvStore(buf, len, dataStart);
+				else
+					XLogWalRcvWrite(buf, len, dataStart);
 				break;
 			}
 		case 'k':				/* Keepalive */
@@ -991,6 +995,42 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 }
 
+/*
+ * Like XLogWalRcvWrite, but store to non-volatile WAL buffer.
+ */
+static void
+XLogWalRcvStore(char *buf, Size nbytes, XLogRecPtr recptr)
+{
+	Assert(IsNvwalAvail());
+
+	CopyXLogRecordsToNVWAL(buf, nbytes, recptr);
+
+	/*
+	 * Also write out to file if we have to archive segments.
+	 *
+	 * We could do this segment by segment but we reuse existing method to
+	 * do it record by record because the former gives us more complexity
+	 * (locking WalBufMappingLock, getting the address of the segment on
+	 * non-volatile WAL buffer, etc).
+	 */
+	if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+		XLogWalRcvWrite(buf, nbytes, recptr);
+	else
+	{
+		/*
+		 * Update status as like XLogWalRcvWrite does.
+		 */
+
+		/* Update process-local status */
+		XLByteToSeg(recptr + nbytes, recvSegNo, wal_segment_size);
+		recvFileTLI = ThisTimeLineID;
+		LogstreamResult.Write = recptr + nbytes;
+
+		/* Update shared-memory status */
+		pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+	}
+}
+
 /*
  * Flush the log to disk.
  *
@@ -1004,7 +1044,16 @@ XLogWalRcvFlush(bool dying)
 	{
 		WalRcvData *walrcv = WalRcv;
 
-		issue_xlog_fsync(recvFile, recvSegNo);
+		/*
+		 * We should call both SyncNVWAL and issue_xlog_fsync if we use NVWAL
+		 * and WAL archive.  So we have the following two if-statements, not
+		 * one if-else-statement.
+		 */
+		if (IsNvwalAvail())
+			SyncNVWAL();
+
+		if (recvFile >= 0)
+			issue_xlog_fsync(recvFile, recvSegNo);
 
 		LogstreamResult.Flush = LogstreamResult.Write;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index ddf786290b..799357cfac 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -361,6 +361,10 @@ extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
 extern bool CopyXLogRecordsFromNVWAL(char *buf,
 									 Size count,
 									 XLogRecPtr startptr);
+extern bool CopyXLogRecordsToNVWAL(char *buf,
+								   Size count,
+								   XLogRecPtr startptr);
+extern void SyncNVWAL(void);
 
 /*
  * Routines to start, stop, and get status of a base backup.
-- 
2.17.1

v4-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchDownload
From 5b794eab4f57a17c41b79769ee6def3cc050bdd0 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:59 +0900
Subject: [PATCH v4 4/5] pg_basebackup supports non-volatile WAL buffer

Now pg_basebackup copies received WAL segments onto non-volatile
WAL buffer if you run it with "nvwal" mode (-Fn).

You should specify a new NVWAL path with --nvwal-path option.
The path will be written to postgresql.auto.conf or recovery.conf.
The size of the new NVWAL is same as the master's one.
---
 src/bin/pg_basebackup/pg_basebackup.c | 335 +++++++++++++++++++++++++-
 src/bin/pg_basebackup/streamutil.c    |  69 ++++++
 src/bin/pg_basebackup/streamutil.h    |   3 +
 src/bin/pg_rewind/pg_rewind.c         |   4 +-
 src/fe_utils/recovery_gen.c           |   9 +-
 src/include/fe_utils/recovery_gen.h   |   3 +-
 6 files changed, 407 insertions(+), 16 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 7a5d4562f9..9b85949078 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -25,6 +25,9 @@
 #ifdef HAVE_LIBZ
 #include <zlib.h>
 #endif
+#ifdef USE_NVWAL
+#include <libpmem.h>
+#endif
 
 #include "access/xlog_internal.h"
 #include "common/file_perm.h"
@@ -127,7 +130,8 @@ typedef enum
 static char *basedir = NULL;
 static TablespaceList tablespace_dirs = {NULL, NULL};
 static char *xlog_dir = NULL;
-static char format = 'p';		/* p(lain)/t(ar) */
+static char format = 'p';			/* p(lain)/t(ar); 'p' even if 'nvwal' given */
+static bool format_nvwal = false;	/* true if 'nvwal' given */
 static char *label = "pg_basebackup base backup";
 static bool noclean = false;
 static bool checksum_failure = false;
@@ -150,14 +154,24 @@ static bool verify_checksums = true;
 static bool manifest = true;
 static bool manifest_force_encode = false;
 static char *manifest_checksums = NULL;
+static char *nvwal_path = NULL;
+#ifdef USE_NVWAL
+static size_t nvwal_size = 0;
+static char *nvwal_pages = NULL;
+static size_t nvwal_mapped_len = 0;
+#endif
 
 static bool success = false;
+static bool xlogdir_is_pg_xlog = false;
 static bool made_new_pgdata = false;
 static bool found_existing_pgdata = false;
 static bool made_new_xlogdir = false;
 static bool found_existing_xlogdir = false;
 static bool made_tablespace_dirs = false;
 static bool found_tablespace_dirs = false;
+#ifdef USE_NVWAL
+static bool made_new_nvwal = false;
+#endif
 
 /* Progress counters */
 static uint64 totalsize_kb;
@@ -382,7 +396,7 @@ usage(void)
 	printf(_("  %s [OPTION]...\n"), progname);
 	printf(_("\nOptions controlling the output:\n"));
 	printf(_("  -D, --pgdata=DIRECTORY receive base backup into directory\n"));
-	printf(_("  -F, --format=p|t       output format (plain (default), tar)\n"));
+	printf(_("  -F, --format=p|t|n     output format (plain (default), tar, nvwal)\n"));
 	printf(_("  -r, --max-rate=RATE    maximum transfer rate to transfer data directory\n"
 			 "                         (in kB/s, or use suffix \"k\" or \"M\")\n"));
 	printf(_("  -R, --write-recovery-conf\n"
@@ -390,6 +404,7 @@ usage(void)
 	printf(_("  -T, --tablespace-mapping=OLDDIR=NEWDIR\n"
 			 "                         relocate tablespace in OLDDIR to NEWDIR\n"));
 	printf(_("      --waldir=WALDIR    location for the write-ahead log directory\n"));
+	printf(_("      --nvwal-path=NVWAL location for the NVWAL file\n"));
 	printf(_("  -X, --wal-method=none|fetch|stream\n"
 			 "                         include required WAL files with specified method\n"));
 	printf(_("  -z, --gzip             compress tar output\n"));
@@ -630,9 +645,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 
 	/* In post-10 cluster, pg_xlog has been renamed to pg_wal */
 	snprintf(param->xlog, sizeof(param->xlog), "%s/%s",
-			 basedir,
-			 PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
-			 "pg_xlog" : "pg_wal");
+			 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
 
 	/* Temporary replication slots are only supported in 10 and newer */
 	if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_TEMP_SLOTS)
@@ -669,9 +682,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 		 * tar file may arrive later.
 		 */
 		snprintf(statusdir, sizeof(statusdir), "%s/%s/archive_status",
-				 basedir,
-				 PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
-				 "pg_xlog" : "pg_wal");
+				 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
 
 		if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
 		{
@@ -1793,6 +1804,135 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
 	appendPQExpBuffer(buf, copybuf, r);
 }
 
+#ifdef USE_NVWAL
+static void
+cleanup_nvwal_atexit(void)
+{
+	if (success || in_log_streamer)
+		return;
+
+	if (nvwal_pages != NULL)
+	{
+		pg_log_info("unmapping NVWAL");
+		if (pmem_unmap(nvwal_pages, nvwal_mapped_len) < 0)
+		{
+			pg_log_error("could not unmap NVWAL: %m");
+			return;
+		}
+	}
+
+	if (nvwal_path != NULL && made_new_nvwal)
+	{
+		pg_log_info("removing NVWAL file \"%s\"", nvwal_path);
+		if (unlink(nvwal_path) < 0)
+		{
+			pg_log_error("could not remove NVWAL file \"%s\": %m", nvwal_path);
+			return;
+		}
+	}
+}
+
+static int
+filter_walseg(const struct dirent *d)
+{
+	char			fullpath[MAXPGPATH];
+	struct stat		statbuf;
+
+	if (!IsXLogFileName(d->d_name))
+		return 0;
+
+	snprintf(fullpath, sizeof(fullpath), "%s/%s/%s",
+			 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal", d->d_name);
+
+	if (stat(fullpath, &statbuf) < 0)
+		return 0;
+
+	if (!S_ISREG(statbuf.st_mode))
+		return 0;
+
+	if (statbuf.st_size != WalSegSz)
+		return 0;
+
+	return 1;
+}
+
+static int
+compare_walseg(const struct dirent **a, const struct dirent **b)
+{
+	return strcmp((*a)->d_name, (*b)->d_name);
+}
+
+static void
+free_namelist(struct dirent **namelist, int nr)
+{
+	for (int i = 0; i < nr; ++i)
+		free(namelist[i]);
+
+	free(namelist);
+}
+
+static bool
+copy_walseg_onto_nvwal(const char *segname)
+{
+	char			fullpath[MAXPGPATH];
+	int				fd;
+	size_t			off;
+	struct stat		statbuf;
+	TimeLineID		tli;
+	XLogSegNo		segno;
+
+	snprintf(fullpath, sizeof(fullpath), "%s/%s/%s",
+			 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal", segname);
+
+	fd = open(fullpath, O_RDONLY);
+	if (fd < 0)
+	{
+		pg_log_error("could not open xlog segment \"%s\": %m", fullpath);
+		return false;
+	}
+
+	if (fstat(fd, &statbuf) < 0)
+	{
+		pg_log_error("could not fstat xlog segment \"%s\": %m", fullpath);
+		goto close_on_error;
+	}
+
+	if (!S_ISREG(statbuf.st_mode))
+	{
+		pg_log_error("xlog segment \"%s\" is not a regular file", fullpath);
+		goto close_on_error;
+	}
+
+	if (statbuf.st_size != WalSegSz)
+	{
+		pg_log_error("invalid size of xlog segment \"%s\"; expected %d, actual %zd",
+					 fullpath, WalSegSz, (ssize_t) statbuf.st_size);
+		goto close_on_error;
+	}
+
+	XLogFromFileName(segname, &tli, &segno, WalSegSz);
+	off = ((size_t) segno * WalSegSz) % nvwal_size;
+
+	if (read(fd, &nvwal_pages[off], WalSegSz) < WalSegSz)
+	{
+		pg_log_error("could not fully read xlog segment \"%s\": %m", fullpath);
+		goto close_on_error;
+	}
+
+	if (close(fd) < 0)
+	{
+		pg_log_error("could not close xlog segment \"%s\": %m", fullpath);
+		return false;
+	}
+
+	return true;
+
+close_on_error:
+	(void) close(fd);
+	return false;
+}
+#endif
+
 static void
 BaseBackup(void)
 {
@@ -1851,7 +1991,8 @@ BaseBackup(void)
 	 * Build contents of configuration file if requested
 	 */
 	if (writerecoveryconf)
-		recoveryconfcontents = GenerateRecoveryConfig(conn, replication_slot);
+		recoveryconfcontents = GenerateRecoveryConfig(conn, replication_slot,
+													  nvwal_path);
 
 	/*
 	 * Run IDENTIFY_SYSTEM so we can get the timeline
@@ -2216,6 +2357,69 @@ BaseBackup(void)
 			exit(1);
 	}
 
+#ifdef USE_NVWAL
+	/* Copy xlog segments into NVWAL when nvwal mode */
+	if (format_nvwal)
+	{
+		char	xldr_path[MAXPGPATH];
+		int		nr_segs;
+		struct dirent **namelist;
+
+		/* clear NVWAL before copying xlog segments */
+		pmem_memset_persist(nvwal_pages, 0, nvwal_size);
+
+		snprintf(xldr_path, sizeof(xldr_path), "%s/%s",
+				 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
+
+		/*
+		 * Sort xlog segments in ascending order, filtering out non-segment
+		 * files and directories.
+		 */
+		nr_segs = scandir(xldr_path, &namelist, filter_walseg, compare_walseg);
+		if (nr_segs < 0)
+		{
+			pg_log_error("could not scan xlog directory \"%s\": %m", xldr_path);
+			exit(1);
+		}
+
+		/* Copy xlog segments onto NVWAL */
+		for (int i = 0; i < nr_segs; ++i)
+		{
+			if (!copy_walseg_onto_nvwal(namelist[i]->d_name))
+			{
+				free_namelist(namelist, nr_segs);
+				exit(1);
+			}
+		}
+
+		/* Copy compelete; now remove all the xlog segments */
+		for (int i = 0; i < nr_segs; ++i)
+		{
+			char		fullpath[MAXPGPATH];
+
+			snprintf(fullpath, sizeof(fullpath), "%s/%s",
+					 xldr_path, namelist[i]->d_name);
+
+			if (unlink(fullpath) < 0)
+			{
+				pg_log_error("could not remove xlog segment \"%s\": %m", fullpath);
+				free_namelist(namelist, nr_segs);
+				exit(1);
+			}
+		}
+
+		free_namelist(namelist, nr_segs);
+
+		if (pmem_unmap(nvwal_pages, nvwal_mapped_len) < 0)
+		{
+			pg_log_error("could not unmap NVWAL: %m");
+			exit(1);
+		}
+		nvwal_pages = NULL;
+		nvwal_mapped_len = 0;
+	}
+#endif
+
 	if (verbose)
 		pg_log_info("base backup completed");
 }
@@ -2257,6 +2461,7 @@ main(int argc, char **argv)
 		{"no-manifest", no_argument, NULL, 5},
 		{"manifest-force-encode", no_argument, NULL, 6},
 		{"manifest-checksums", required_argument, NULL, 7},
+		{"nvwal-path", required_argument, NULL, 8},
 		{NULL, 0, NULL, 0}
 	};
 	int			c;
@@ -2297,9 +2502,27 @@ main(int argc, char **argv)
 				break;
 			case 'F':
 				if (strcmp(optarg, "p") == 0 || strcmp(optarg, "plain") == 0)
+				{
+					/* See the comment for "nvwal" below */
 					format = 'p';
+					format_nvwal = false;
+				}
 				else if (strcmp(optarg, "t") == 0 || strcmp(optarg, "tar") == 0)
+				{
+					/* See the comment for "nvwal" below */
 					format = 't';
+					format_nvwal = false;
+				}
+				else if (strcmp(optarg, "n") == 0 || strcmp(optarg, "nvwal") == 0)
+				{
+					/*
+					 * If "nvwal" mode given, we set two variables as follows
+					 * because it is almost same as "plain" mode, except NVWAL
+					 * handling.
+					 */
+					format = 'p';
+					format_nvwal = true;
+				}
 				else
 				{
 					pg_log_error("invalid output format \"%s\", must be \"plain\" or \"tar\"",
@@ -2354,6 +2577,9 @@ main(int argc, char **argv)
 			case 1:
 				xlog_dir = pg_strdup(optarg);
 				break;
+			case 8:
+				nvwal_path = pg_strdup(optarg);
+				break;
 			case 'l':
 				label = pg_strdup(optarg);
 				break;
@@ -2535,7 +2761,7 @@ main(int argc, char **argv)
 	{
 		if (format != 'p')
 		{
-			pg_log_error("WAL directory location can only be specified in plain mode");
+			pg_log_error("WAL directory location can only be specified in plain or nvwal mode");
 			fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 					progname);
 			exit(1);
@@ -2552,6 +2778,44 @@ main(int argc, char **argv)
 		}
 	}
 
+#ifdef USE_NVWAL
+	if (format_nvwal)
+	{
+		if (nvwal_path == NULL)
+		{
+			pg_log_error("NVWAL file location must be given in nvwal mode");
+			fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+					progname);
+			exit(1);
+		}
+
+		/* clean up NVWAL file name and check if it is absolute */
+		canonicalize_path(nvwal_path);
+		if (!is_absolute_path(nvwal_path))
+		{
+			pg_log_error("NVWAL file location must be an absolute path");
+			fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+					progname);
+			exit(1);
+		}
+
+		/* We do not map NVWAL file here because we do not know its size yet */
+	}
+	else if (nvwal_path != NULL)
+	{
+		pg_log_error("NVWAL file location can only be specified in plain or nvwal mode");
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+#else
+	if (format_nvwal || nvwal_path != NULL)
+	{
+		pg_log_error("this build does not support nvwal mode");
+		exit(1);
+	}
+#endif /* USE_NVWAL */
+
 #ifndef HAVE_LIBZ
 	if (compresslevel != 0)
 	{
@@ -2596,6 +2860,9 @@ main(int argc, char **argv)
 	}
 	atexit(disconnect_atexit);
 
+	/* Remember the predicate for use after disconnection */
+	xlogdir_is_pg_xlog = (PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL);
+
 	/*
 	 * Set umask so that directories/files are created with the same
 	 * permissions as directories/files in the source data directory.
@@ -2622,6 +2889,16 @@ main(int argc, char **argv)
 	if (!RetrieveWalSegSize(conn))
 		exit(1);
 
+#ifdef USE_NVWAL
+	/* determine remote server's NVWAL size */
+	if (format_nvwal)
+	{
+		nvwal_size = RetrieveNvwalSize(conn);
+		if (nvwal_size == 0)
+			exit(1);
+	}
+#endif
+
 	/* Create pg_wal symlink, if required */
 	if (xlog_dir)
 	{
@@ -2634,8 +2911,7 @@ main(int argc, char **argv)
 		 * renamed to pg_wal in post-10 clusters.
 		 */
 		linkloc = psprintf("%s/%s", basedir,
-						   PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
-						   "pg_xlog" : "pg_wal");
+						   xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
 
 #ifdef HAVE_SYMLINK
 		if (symlink(xlog_dir, linkloc) != 0)
@@ -2650,6 +2926,41 @@ main(int argc, char **argv)
 		free(linkloc);
 	}
 
+#ifdef USE_NVWAL
+	/* Create and map NVWAL file if required */
+	if (format_nvwal)
+	{
+		int		is_pmem = 0;
+
+		nvwal_pages = pmem_map_file(nvwal_path, nvwal_size,
+									PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+									pg_file_create_mode,
+									&nvwal_mapped_len, &is_pmem);
+		if (nvwal_pages == NULL)
+		{
+			pg_log_error("could not map a new NVWAL file \"%s\": %m",
+						 nvwal_path);
+			exit(1);
+		}
+
+		made_new_nvwal = true;
+		atexit(cleanup_nvwal_atexit);
+
+		if (!is_pmem)
+		{
+			pg_log_error("NVWAL file \"%s\" is not on PMEM", nvwal_path);
+			exit(1);
+		}
+
+		if (nvwal_size != nvwal_mapped_len)
+		{
+			pg_log_error("invalid size of NVWAL file \"%s\"; expected %zu, actual %zu",
+						 nvwal_path, nvwal_size, nvwal_mapped_len);
+			exit(1);
+		}
+	}
+#endif
+
 	BaseBackup();
 
 	success = true;
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index be653ebb2d..baf3a7bc53 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -398,6 +398,75 @@ RetrieveDataDirCreatePerm(PGconn *conn)
 	return true;
 }
 
+#ifdef USE_NVWAL
+/*
+ * Returns nvwal_size in bytes if available, 0 otherwise.
+ * Note that it is caller's responsibility to check if the returned
+ * nvwal_size is really valid, that is, multiple of WAL segment size.
+ */
+size_t
+RetrieveNvwalSize(PGconn *conn)
+{
+	PGresult   *res;
+	char		unit[3];
+	int			val;
+	size_t		nvwal_size;
+
+	/* check connection existence */
+	Assert(conn != NULL);
+
+	/* fail if we do not have SHOW command */
+	if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_SHOW_CMD)
+	{
+		pg_log_error("SHOW command is not supported for retrieving nvwal_size");
+		return 0;
+	}
+
+	res = PQexec(conn, "SHOW nvwal_size");
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("could not send replication command \"%s\": %s",
+					 "SHOW nvwal_size", PQerrorMessage(conn));
+
+		PQclear(res);
+		return 0;
+	}
+	if (PQntuples(res) != 1 || PQnfields(res) < 1)
+	{
+		pg_log_error("could not fetch NVWAL size: got %d rows and %d fields, expected %d rows and %d or more fields",
+					 PQntuples(res), PQnfields(res), 1, 1);
+
+		PQclear(res);
+		return 0;
+	}
+
+	/* fetch value and unit from the result */
+	if (sscanf(PQgetvalue(res, 0, 0), "%d%s", &val, unit) != 2)
+	{
+		pg_log_error("NVWAL size could not be parsed");
+		PQclear(res);
+		return 0;
+	}
+
+	PQclear(res);
+
+	/* convert to bytes */
+	if (strcmp(unit, "MB") == 0)
+		nvwal_size = ((size_t) val) << 20;
+	else if (strcmp(unit, "GB") == 0)
+		nvwal_size = ((size_t) val) << 30;
+	else if (strcmp(unit, "TB") == 0)
+		nvwal_size = ((size_t) val) << 40;
+	else
+	{
+		pg_log_error("unsupported NVWAL unit");
+		return 0;
+	}
+
+	return nvwal_size;
+}
+#endif
+
 /*
  * Run IDENTIFY_SYSTEM through a given connection and give back to caller
  * some result information if requested:
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index 57448656e3..b4c2ab1a74 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -41,6 +41,9 @@ extern bool RunIdentifySystem(PGconn *conn, char **sysid,
 							  XLogRecPtr *startpos,
 							  char **db_name);
 extern bool RetrieveWalSegSize(PGconn *conn);
+#ifdef USE_NVWAL
+extern size_t RetrieveNvwalSize(PGconn *conn);
+#endif
 extern TimestampTz feGetCurrentTimestamp(void);
 extern void feTimestampDifference(TimestampTz start_time, TimestampTz stop_time,
 								  long *secs, int *microsecs);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 23fc749e44..858a399f52 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -360,7 +360,7 @@ main(int argc, char **argv)
 		pg_log_info("no rewind required");
 		if (writerecoveryconf && !dry_run)
 			WriteRecoveryConfig(conn, datadir_target,
-								GenerateRecoveryConfig(conn, NULL));
+								GenerateRecoveryConfig(conn, NULL, NULL));
 		exit(0);
 	}
 
@@ -459,7 +459,7 @@ main(int argc, char **argv)
 
 	if (writerecoveryconf && !dry_run)
 		WriteRecoveryConfig(conn, datadir_target,
-							GenerateRecoveryConfig(conn, NULL));
+							GenerateRecoveryConfig(conn, NULL, NULL));
 
 	pg_log_info("Done!");
 
diff --git a/src/fe_utils/recovery_gen.c b/src/fe_utils/recovery_gen.c
index 46ca20e20b..1e08ec3fa8 100644
--- a/src/fe_utils/recovery_gen.c
+++ b/src/fe_utils/recovery_gen.c
@@ -20,7 +20,7 @@ static char *escape_quotes(const char *src);
  * return it.
  */
 PQExpBuffer
-GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
+GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot, char *nvwal_path)
 {
 	PQconninfoOption *connOptions;
 	PQExpBufferData conninfo_buf;
@@ -95,6 +95,13 @@ GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
 						  replication_slot);
 	}
 
+	if (nvwal_path)
+	{
+		escaped = escape_quotes(nvwal_path);
+		appendPQExpBuffer(contents, "nvwal_path = '%s'\n", escaped);
+		free(escaped);
+	}
+
 	if (PQExpBufferBroken(contents))
 	{
 		pg_log_error("out of memory");
diff --git a/src/include/fe_utils/recovery_gen.h b/src/include/fe_utils/recovery_gen.h
index c8655cd294..061c59125b 100644
--- a/src/include/fe_utils/recovery_gen.h
+++ b/src/include/fe_utils/recovery_gen.h
@@ -21,7 +21,8 @@
 #define MINIMUM_VERSION_FOR_RECOVERY_GUC 120000
 
 extern PQExpBuffer GenerateRecoveryConfig(PGconn *pgconn,
-										  char *pg_replication_slot);
+										  char *pg_replication_slot,
+										  char *nvwal_path);
 extern void WriteRecoveryConfig(PGconn *pgconn, char *target_dir,
 								PQExpBuffer contents);
 
-- 
2.17.1

v4-0005-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0005-README-for-non-volatile-WAL-buffer.patchDownload
From a5ef218e1eab55dedcc6061f88eb3fae3b057fdf Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:08:00 +0900
Subject: [PATCH v4 5/5] README for non-volatile WAL buffer

---
 README.nvwal | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 184 insertions(+)
 create mode 100644 README.nvwal

diff --git a/README.nvwal b/README.nvwal
new file mode 100644
index 0000000000..b6b9d576e7
--- /dev/null
+++ b/README.nvwal
@@ -0,0 +1,184 @@
+Non-volatile WAL buffer
+=======================
+Here is a PostgreSQL branch with a proof-of-concept "non-volatile WAL buffer"
+(NVWAL) feature. Putting the WAL buffer pages on persistent memory (PMEM) [1],
+inserting WAL records into it directly, and eliminating I/O for WAL segment
+files, PostgreSQL gets lower latency and higher throughput.
+
+
+Prerequisites and recommends
+----------------------------
+* An x64 system
+  * (Recommended) Supporting CLFLUSHOPT or CLWB instruction
+    * See if lscpu shows "clflushopt" or "clwb" flag
+* An OS supporting PMEM
+  * Linux: 4.15 or later (tested on 5.2)
+  * Windows: (Sorry but we have not tested on Windows yet.)
+* A filesystem supporting DAX (tested on ext4)
+* libpmem in PMDK [2] 1.4 or later (tested on 1.7)
+* ndctl [3] (tested on 61.2)
+* ipmctl [4] if you use Intel DCPMM
+* sudo privilege
+* All other prerequisites of original PostgreSQL
+* (Recommended) PMEM module(s) (NVDIMM-N or Intel DCPMM)
+  * You can emulate PMEM using DRAM [5] even if you have no PMEM module.
+* (Recommended) numactl
+
+
+Build and install PostgreSQL with NVWAL feature
+-----------------------------------------------
+We have a new configure option --with-nvwal.
+
+I believe it is good to install under your home directory with --prefix option.
+If you do so, please DO NOT forget "export PATH".
+
+  $ ./configure --with-nvwal --prefix="$HOME/postgres"
+  $ make
+  $ make install
+  $ export PATH="$HOME/postgres/bin:$PATH"
+
+NOTE: ./configure --with-nvwal will fail if libpmem is not found.
+
+
+Prepare DAX filesystem
+----------------------
+Here we use NVDIMM-N or emulated PMEM, make ext4 filesystem on namespace0.0
+(/dev/pmem0), and mount it onto /mnt/pmem0. Please DO NOT forget "-o dax" option
+on mount. For Intel DCPMM and ipmctl, please see [4].
+
+  $ ndctl list
+  [
+    {
+      "dev":"namespace1.0",
+      "mode":"raw",
+      "size":103079215104,
+      "sector_size":512,
+      "blockdev":"pmem1",
+      "numa_node":1
+    },
+    {
+      "dev":"namespace0.0",
+      "mode":"raw",
+      "size":103079215104,
+      "sector_size":512,
+      "blockdev":"pmem0",
+      "numa_node":0
+    }
+  ]
+
+  $ sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0
+  {
+    "dev":"namespace0.0",
+    "mode":"fsdax",
+    "map":"dev",
+    "size":"94.50 GiB (101.47 GB)",
+    "uuid":"e7da9d65-140b-4e1e-90ec-6548023a1b6e",
+    "sector_size":512,
+    "blockdev":"pmem0",
+    "numa_node":0
+  }
+
+  $ ls -l /dev/pmem0
+  brw-rw---- 1 root disk 259, 3 Jan  6 17:06 /dev/pmem0
+
+  $ sudo mkfs.ext4 -q -F /dev/pmem0
+  $ sudo mkdir -p /mnt/pmem0
+  $ sudo mount -o dax /dev/pmem0 /mnt/pmem0
+  $ mount -l | grep ^/dev/pmem0
+  /dev/pmem0 on /mnt/pmem0 type ext4 (rw,relatime,dax)
+
+
+Enable transparent huge page
+----------------------------
+Of course transparent huge page would not be suitable for database workload,
+but it improves performance of PMEM by reducing overhead of page walk.
+
+  $ ls -l /sys/kernel/mm/transparent_hugepage/enabled
+  -rw-r--r-- 1 root root 4096 Dec  3 10:38 /sys/kernel/mm/transparent_hugepage/enabled
+
+  $ echo always | sudo dd of=/sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null
+  $ cat /sys/kernel/mm/transparent_hugepage/enabled
+  [always] madvise never
+
+
+initdb
+------
+We have two new options:
+
+  -P, --nvwal-path=FILE  path to file for non-volatile WAL buffer (NVWAL)
+  -Q, --nvwal-size=SIZE  size of NVWAL, in megabytes
+
+If you want to create a new 80GB (81920MB) NVWAL file on /mnt/pmem0/pgsql/nvwal,
+please run initdb as follows:
+
+  $ sudo mkdir -p /mnt/pmem0/pgsql
+  $ sudo chown "$USER:$USER" /mnt/pmem0/pgsql
+  $ export PGDATA="$HOME/pgdata"
+  $ initdb -P /mnt/pmem0/pgsql/nvwal -Q 81920
+
+You will find there is no WAL segment file to be created in PGDATA/pg_wal
+directory. That is okay; your NVWAL file has the content of the first WAL
+segment file.
+
+NOTE:
+* initdb will fail if the given NVWAL size is not multiple of WAL segment
+  size. The segment size is given with initdb --wal-segsize, or is 16MB as
+  default.
+* postgres (executed by initdb) will fail in bootstrap if the directory in
+  which the NVWAL file is being created (/mnt/pmem0/pgsql for example
+  above) does not exist.
+* postgres (executed by initdb) will fail in bootstrap if an entry already
+  exists on the given path.
+* postgres (executed by initdb) will fail in bootstrap if the given path is
+  not on PMEM or you forget "-o dax" option on mount.
+* Resizing an NVWAL file is NOT supported yet. Please be careful to decide
+  how large your NVWAL file is to be.
+* "-Q 1024" (1024MB) will be assumed if -P is given but -Q is not.
+
+
+postgresql.conf
+---------------
+We have two new parameters nvwal_path and nvwal_size, corresponding to the two
+new options of initdb. If you run initdb as above, you will find postgresql.conf
+in your PGDATA directory like as follows:
+
+  max_wal_size = 80GB
+  min_wal_size = 80GB
+  nvwal_path = '/mnt/pmem0/pgsql/nvwal'
+  nvwal_size = 80GB
+
+NOTE:
+* postgres will fail in startup if no file exists on the given nvwal_path.
+* postgres will fail in startup if the given nvwal_size is not equal to the
+  actual NVWAL file size,
+* postgres will fail in startup if the given nvwal_path is not on PMEM or you
+  forget "-o dax" option on mount.
+* wal_buffers will be ignored if nvwal_path is given.
+* You SHOULD give both max_wal_size and min_wal_size the same value as
+  nvwal_size. postgres could possibly run even though the three values are
+  not same, however, we have not tested such a case yet.
+
+
+Startup
+-------
+Same as you know:
+
+  $ pg_ctl start
+
+or use numactl as follows to let postgres run on the specified NUMA node (typi-
+cally the one on which your NVWAL file is) if you need stable performance:
+
+  $ numactl --cpunodebind=0 --membind=0 -- pg_ctl start
+
+
+References
+----------
+[1] https://pmem.io/
+[2] https://pmem.io/pmdk/
+[3] https://docs.pmem.io/ndctl-user-guide/
+[4] https://docs.pmem.io/ipmctl-user-guide/
+[5] https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
+
+
+--
+Takashi Menjo <takashi.menjou.vg AT hco.ntt.co.jp>
-- 
2.17.1

#21Deng, Gang
gang.deng@intel.com
In reply to: Takashi Menjo (#20)
RE: [PoC] Non-volatile WAL buffer

Hi Takashi,

Thank you for the patch and work on accelerating PG performance with NVM. I applied the patch and made some performance test based on the patch v4. I stored database data files on NVMe SSD and stored WAL file on Intel PMem (NVM). I used two methods to store WAL file(s):

1. Leverage your patch to access PMem with libpmem (NVWAL patch).

2. Access PMem with legacy filesystem interface, that means use PMem as ordinary block device, no PG patch is required to access PMem (Storage over App Direct).

I tried two insert scenarios:

A. Insert small record (length of record to be inserted is 24 bytes), I think it is similar as your test

B. Insert large record (length of record to be inserted is 328 bytes)

My original purpose is to see higher performance gain in scenario B as it is more write intensive on WAL. But I observed that NVWAL patch method had ~5% performance improvement compared with Storage over App Direct method in scenario A, while had ~20% performance degradation in scenario B.

I made further investigation on the test. I found that NVWAL patch can improve performance of XlogFlush function, but it may impact performance of CopyXlogRecordToWAL function. It may be related to the higher latency of memcpy to Intel PMem comparing with DRAM. Here are key data in my test:

Scenario A (length of record to be inserted: 24 bytes per record):
==============================
NVWAL SoAD
------------------------------------ ------- -------
Througput (10^3 TPS) 310.5 296.0
CPU Time % of CopyXlogRecordToWAL 0.4 0.2
CPU Time % of XLogInsertRecord 1.5 0.8
CPU Time % of XLogFlush 2.1 9.6

Scenario B (length of record to be inserted: 328 bytes per record):
==============================
NVWAL SoAD
------------------------------------ ------- -------
Througput (10^3 TPS) 13.0 16.9
CPU Time % of CopyXlogRecordToWAL 3.0 1.6
CPU Time % of XLogInsertRecord 23.0 16.4
CPU Time % of XLogFlush 2.3 5.9

Best Regards,
Gang

From: Takashi Menjo <takashi.menjo@gmail.com>
Sent: Thursday, September 10, 2020 4:01 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: pgsql-hackers@postgresql.org
Subject: Re: [PoC] Non-volatile WAL buffer

Rebased.

2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>>:
Dear hackers,

I update my non-volatile WAL buffer's patchset to v3. Now we can use it in streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>>
NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>>
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org<mailto:pgsql-hackers@postgresql.org>>
Cc: 'Robert Haas' <robertmhaas@gmail.com<mailto:robertmhaas@gmail.com>>; 'Heikki Linnakangas' <hlinnaka@iki.fi<mailto:hlinnaka@iki.fi>>; 'Amit Langote'
<amitlangote09@gmail.com<mailto:amitlangote09@gmail.com>>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master. A new v2 patchset is attached to this mail.

I also measured performance before and after patchset, varying -c/--client and -j/--jobs options of pgbench, for
each scaling factor s = 50 or 1000. The results are presented in the following tables and the attached charts.
Conditions, steps, and other details will be shown later.

Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)

Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)

Both throughput and average latency are improved for each scaling factor. Throughput seemed to almost reach
the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case. I think larger scaling factor leads to less
contentions on the same tables and/or indexes, that is, less lock and unlock operations. In such a situation,
write-ahead logging appears to be more significant for performance.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch

Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as a final result shown
in the tables above.

(1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size options after patch
(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>> NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com<mailto:amitlangote09@gmail.com>>
Cc: 'Robert Haas' <robertmhaas@gmail.com<mailto:robertmhaas@gmail.com>>; 'Heikki Linnakangas' <hlinnaka@iki.fi<mailto:hlinnaka@iki.fi>>;

'PostgreSQL-development'

<pgsql-hackers@postgresql.org<mailto:pgsql-hackers@postgresql.org>>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated patchset and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>> NTT Software
Innovation Center

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com<mailto:amitlangote09@gmail.com>>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>>
Cc: Robert Haas <robertmhaas@gmail.com<mailto:robertmhaas@gmail.com>>; Heikki Linnakangas
<hlinnaka@iki.fi<mailto:hlinnaka@iki.fi>>; PostgreSQL-development
<pgsql-hackers@postgresql.org<mailto:pgsql-hackers@postgresql.org>>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp<mailto:takashi.menjou.vg@hco.ntt.co.jp>> wrote:

Hello Amit,

I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or not

because master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using release notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss to

notice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architecture of your own changes, although maybe that

rarely occurs.

Thanks,
Amit

--
Takashi Menjo <takashi.menjo@gmail.com<mailto:takashi.menjo@gmail.com>>

#22Takashi Menjo
takashi.menjo@gmail.com
In reply to: Deng, Gang (#21)
Re: [PoC] Non-volatile WAL buffer

Hello Gang,

Thank you for your report. I have not taken care of record size deeply yet,
so your report is very interesting. I will also have a test like yours then
post results here.

Regards,
Takashi

2020年9月21日(月) 14:14 Deng, Gang <gang.deng@intel.com>:

Hi Takashi,

Thank you for the patch and work on accelerating PG performance with NVM.
I applied the patch and made some performance test based on the patch v4. I
stored database data files on NVMe SSD and stored WAL file on Intel PMem
(NVM). I used two methods to store WAL file(s):

1. Leverage your patch to access PMem with libpmem (NVWAL patch).

2. Access PMem with legacy filesystem interface, that means use PMem
as ordinary block device, no PG patch is required to access PMem (Storage
over App Direct).

I tried two insert scenarios:

A. Insert small record (length of record to be inserted is 24 bytes),
I think it is similar as your test

B. Insert large record (length of record to be inserted is 328 bytes)

My original purpose is to see higher performance gain in scenario B as it
is more write intensive on WAL. But I observed that NVWAL patch method had
~5% performance improvement compared with Storage over App Direct method in
scenario A, while had ~20% performance degradation in scenario B.

I made further investigation on the test. I found that NVWAL patch can
improve performance of XlogFlush function, but it may impact performance of
CopyXlogRecordToWAL function. It may be related to the higher latency of
memcpy to Intel PMem comparing with DRAM. Here are key data in my test:

Scenario A (length of record to be inserted: 24 bytes per record):

==============================

NVWAL SoAD

------------------------------------
------- -------

Througput (10^3 TPS)
310.5 296.0

CPU Time % of CopyXlogRecordToWAL
0.4 0.2

CPU Time % of XLogInsertRecord
1.5 0.8

CPU Time % of XLogFlush
2.1 9.6

Scenario B (length of record to be inserted: 328 bytes per record):

==============================

NVWAL SoAD

------------------------------------
------- -------

Througput (10^3 TPS)
13.0 16.9

CPU Time % of CopyXlogRecordToWAL
3.0 1.6

CPU Time % of XLogInsertRecord
23.0 16.4

CPU Time % of XLogFlush
2.3 5.9

Best Regards,

Gang

*From:* Takashi Menjo <takashi.menjo@gmail.com>
*Sent:* Thursday, September 10, 2020 4:01 PM
*To:* Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
*Cc:* pgsql-hackers@postgresql.org
*Subject:* Re: [PoC] Non-volatile WAL buffer

Rebased.

2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>:

Dear hackers,

I update my non-volatile WAL buffer's patchset to v3. Now we can use it
in streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL
buffer if applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL
buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path
will be written to postgresql.auto.conf or recovery.conf. The size of the
new NVWAL is same as the master's one.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <

hlinnaka@iki.fi>; 'Amit Langote'

<amitlangote09@gmail.com>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master. A new v2

patchset is attached to this mail.

I also measured performance before and after patchset, varying

-c/--client and -j/--jobs options of pgbench, for

each scaling factor s = 50 or 1000. The results are presented in the

following tables and the attached charts.

Conditions, steps, and other details will be shown later.

Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)

Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)

Both throughput and average latency are improved for each scaling

factor. Throughput seemed to almost reach

the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case. I think

larger scaling factor leads to less

contentions on the same tables and/or indexes, that is, less lock and

unlock operations. In such a situation,

write-ahead logging appears to be more significant for performance.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for

pg_wal

- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access

(DAX)

- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch

Steps
=====
For each (c,j) pair, I did the following steps three times then I found

the median of the three as a final result shown

in the tables above.

(1) Run initdb with proper -D and -X options; and also give --nvwal-path

and --nvwal-size options after patch

(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j

___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation

Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com>
Cc: 'Robert Haas' <robertmhaas@gmail.com>; 'Heikki Linnakangas' <

hlinnaka@iki.fi>;

'PostgreSQL-development'

<pgsql-hackers@postgresql.org>
Subject: RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the

hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated patchset

and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software
Innovation Center

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com>
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: Robert Haas <robertmhaas@gmail.com>; Heikki Linnakangas
<hlinnaka@iki.fi>; PostgreSQL-development
<pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <

takashi.menjou.vg@hco.ntt.co.jp> wrote:

Hello Amit,

I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit

the "master"

really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or not

because master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using release

notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss to

notice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architecture

of your own changes, although maybe that

rarely occurs.

Thanks,
Amit

--

Takashi Menjo <takashi.menjo@gmail.com>

--
Takashi Menjo <takashi.menjo@gmail.com>

#23Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Takashi Menjo (#22)
2 attachment(s)
RE: [PoC] Non-volatile WAL buffer

Hi Gang,

I have tried to but yet cannot reproduce performance degrade you reported when inserting 328-byte records. So I think the condition of you and me would be different, such as steps to reproduce, postgresql.conf, installation setup, and so on.

My results and condition are as follows. May I have your condition in more detail? Note that I refer to your "Storage over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer."

Best regards,
Takashi

# Results
See the attached figure. In short, Non-volatile WAL buffer got better performance than Original (PMEM).

# Steps
Note that I ran postgres server and pgbench in a single-machine system but separated two NUMA nodes. PMEM and PCI SSD for the server process are on the server-side NUMA node.

01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile WAL buffer
07) Edit postgresql.conf as the attached one
- Please remove nvwal_* lines in the case of Original (PMEM)
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Change # characters of "filler" column of "pgbench_history" table to 300 (ALTER TABLE pgbench_history ALTER filler TYPE character(300);)
- This would make the row size of the table 328 bytes
12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
13) Remount the PMEM and the PCIe SSD
14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
15) Run pg_prewarm for all the four pgbench_* tables
16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r -M prepared -T 1800 -c __ -j __)
- It executes the default tpcb-like transactions

I repeated all the steps three times for each (c,j) then got the median "tps = __ (including connections establishing)" of the three as throughput and the "latency average = __ ms " of that time as average latency.

# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata

# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 sockets (256 GiB per channel x 6 channels per socket; interleaving enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C compiler: gcc 9.3.0
- libc: glibc 2.31
- Linux kernel: 5.7 (vanilla)
- Filesystem: ext4 (DAX enabled when using Optane PMem)
- PMDK: 1.9
- PostgreSQL (Original): 14devel (200f610: Jul 26, 2020)
- PostgreSQL (Non-volatile WAL buffer): 14devel (200f610: Jul 26, 2020) + non-volatile WAL buffer patchset v4

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Takashi Menjo <takashi.menjo@gmail.com>
Sent: Thursday, September 24, 2020 2:38 AM
To: Deng, Gang <gang.deng@intel.com>
Cc: pgsql-hackers@postgresql.org; Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello Gang,

Thank you for your report. I have not taken care of record size deeply yet, so your report is very interesting. I will
also have a test like yours then post results here.

Regards,
Takashi

2020年9月21日(月) 14:14 Deng, Gang <gang.deng@intel.com <mailto:gang.deng@intel.com> >:

Hi Takashi,

Thank you for the patch and work on accelerating PG performance with NVM. I applied the patch and made
some performance test based on the patch v4. I stored database data files on NVMe SSD and stored WAL file on
Intel PMem (NVM). I used two methods to store WAL file(s):

1. Leverage your patch to access PMem with libpmem (NVWAL patch).

2. Access PMem with legacy filesystem interface, that means use PMem as ordinary block device, no
PG patch is required to access PMem (Storage over App Direct).

I tried two insert scenarios:

A. Insert small record (length of record to be inserted is 24 bytes), I think it is similar as your test

B. Insert large record (length of record to be inserted is 328 bytes)

My original purpose is to see higher performance gain in scenario B as it is more write intensive on WAL.
But I observed that NVWAL patch method had ~5% performance improvement compared with Storage over App
Direct method in scenario A, while had ~20% performance degradation in scenario B.

I made further investigation on the test. I found that NVWAL patch can improve performance of XlogFlush
function, but it may impact performance of CopyXlogRecordToWAL function. It may be related to the higher
latency of memcpy to Intel PMem comparing with DRAM. Here are key data in my test:

Scenario A (length of record to be inserted: 24 bytes per record):

==============================

NVWAL
SoAD

------------------------------------ ------- -------

Througput (10^3 TPS) 310.5
296.0

CPU Time % of CopyXlogRecordToWAL 0.4 0.2

CPU Time % of XLogInsertRecord 1.5 0.8

CPU Time % of XLogFlush 2.1 9.6

Scenario B (length of record to be inserted: 328 bytes per record):

==============================

NVWAL
SoAD

------------------------------------ ------- -------

Througput (10^3 TPS) 13.0
16.9

CPU Time % of CopyXlogRecordToWAL 3.0 1.6

CPU Time % of XLogInsertRecord 23.0 16.4

CPU Time % of XLogFlush 2.3 5.9

Best Regards,

Gang

From: Takashi Menjo <takashi.menjo@gmail.com <mailto:takashi.menjo@gmail.com> >
Sent: Thursday, September 10, 2020 4:01 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Cc: pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Rebased.

2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp
<mailto:takashi.menjou.vg@hco.ntt.co.jp> >:

Dear hackers,

I update my non-volatile WAL buffer's patchset to v3. Now we can use it in streaming replication
mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL buffer if you run it with
"nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The path will be written to
postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> >

Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org

<mailto:pgsql-hackers@postgresql.org> >

Cc: 'Robert Haas' <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com> >; 'Heikki

Linnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; 'Amit Langote'

<amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Subject: RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master. A new v2 patchset is attached

to this mail.

I also measured performance before and after patchset, varying -c/--client and -j/--jobs

options of pgbench, for

each scaling factor s = 50 or 1000. The results are presented in the following tables and the

attached charts.

Conditions, steps, and other details will be shown later.

Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)

Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)

Both throughput and average latency are improved for each scaling factor. Throughput seemed

to almost reach

the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case. I think larger scaling factor

leads to less

contentions on the same tables and/or indexes, that is, less lock and unlock operations. In such

a situation,

write-ahead logging appears to be more significant for performance.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch

Steps
=====
For each (c,j) pair, I did the following steps three times then I found the median of the three as

a final result shown

in the tables above.

(1) Run initdb with proper -D and -X options; and also give --nvwal-path and --nvwal-size

options after patch

(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >

NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> >

Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Cc: 'Robert Haas' <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com> >; 'Heikki

Linnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >;

'PostgreSQL-development'

<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated patchset and performance report

later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp>

NTT Software

Innovation Center

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> >

Cc: Robert Haas <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com> >; Heikki

Linnakangas

<hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; PostgreSQL-development
<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> > wrote:

Hello Amit,

I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or not

because master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using release notes and user

manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss to

notice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the architecture of your own changes,

although maybe that

rarely occurs.

Thanks,
Amit

--

Takashi Menjo <takashi.menjo@gmail.com <mailto:takashi.menjo@gmail.com> >

--

Takashi Menjo <takashi.menjo@gmail.com <mailto:takashi.menjo@gmail.com> >

Attachments:

performance-s50-filler300.pngimage/png; name=performance-s50-filler300.pngDownload
postgresql.confapplication/octet-stream; name=postgresql.confDownload
#24Deng, Gang
gang.deng@intel.com
In reply to: Takashi Menjo (#23)
2 attachment(s)
RE: [PoC] Non-volatile WAL buffer

Hi Takashi,

There are some differences between our HW/SW configuration and test steps. I attached postgresql.conf I used for your reference. I would like to try postgresql.conf and steps you provided in the later days to see if I can find cause.

I also ran pgbench and postgres server on the same server but on different NUMA node, and ensure server process and PMEM on the same NUMA node. I used similar steps are yours from step 1 to 9. But some difference in later steps, major of them are:

In step 10), I created a database and table for test by:
#create database:
psql -c "create database insert_bench;"
#create table:
psql -d insert_bench -c "create table test(crt_time timestamp, info text default '75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d79a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');"

in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds to warm up.
In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f ./test.sql -T 600 -c _ -j _ insert_bench. (test.sql can be found in attachment)

For HW/SW conf, the major differences are:
CPU: I used Xeon 8268 (24c@2.9Ghz, HT enabled)
OS Distro: CentOS 8.2.2004
Kernel: 4.18.0-193.6.3.el8_2.x86_64
GCC: 8.3.1

Best regards
Gang

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Tuesday, October 6, 2020 4:49 PM
To: Deng, Gang <gang.deng@intel.com>
Cc: pgsql-hackers@postgresql.org; 'Takashi Menjo' <takashi.menjo@gmail.com>
Subject: RE: [PoC] Non-volatile WAL buffer

Hi Gang,

I have tried to but yet cannot reproduce performance degrade you reported when inserting 328-byte records. So I think the condition of you and me would be different, such as steps to reproduce, postgresql.conf, installation setup, and so on.

My results and condition are as follows. May I have your condition in more detail? Note that I refer to your "Storage over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer."

Best regards,
Takashi

# Results
See the attached figure. In short, Non-volatile WAL buffer got better performance than Original (PMEM).

# Steps
Note that I ran postgres server and pgbench in a single-machine system but separated two NUMA nodes. PMEM and PCI SSD for the server process are on the server-side NUMA node.

01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0 /mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile WAL buffer
07) Edit postgresql.conf as the attached one
- Please remove nvwal_* lines in the case of Original (PMEM)
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Change # characters of "filler" column of "pgbench_history" table to 300 (ALTER TABLE pgbench_history ALTER filler TYPE character(300);)
- This would make the row size of the table 328 bytes
12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
13) Remount the PMEM and the PCIe SSD
14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
15) Run pg_prewarm for all the four pgbench_* tables
16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r -M prepared -T 1800 -c __ -j __)
- It executes the default tpcb-like transactions

I repeated all the steps three times for each (c,j) then got the median "tps = __ (including connections establishing)" of the three as throughput and the "latency average = __ ms " of that time as average latency.

# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata

# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 sockets (256 GiB per channel x 6 channels per socket; interleaving enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C compiler: gcc 9.3.0
- libc: glibc 2.31
- Linux kernel: 5.7 (vanilla)
- Filesystem: ext4 (DAX enabled when using Optane PMem)
- PMDK: 1.9
- PostgreSQL (Original): 14devel (200f610: Jul 26, 2020)
- PostgreSQL (Non-volatile WAL buffer): 14devel (200f610: Jul 26, 2020) + non-volatile WAL buffer patchset v4

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Takashi Menjo <takashi.menjo@gmail.com>
Sent: Thursday, September 24, 2020 2:38 AM
To: Deng, Gang <gang.deng@intel.com>
Cc: pgsql-hackers@postgresql.org; Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello Gang,

Thank you for your report. I have not taken care of record size deeply
yet, so your report is very interesting. I will also have a test like yours then post results here.

Regards,
Takashi

2020年9月21日(月) 14:14 Deng, Gang <gang.deng@intel.com <mailto:gang.deng@intel.com> >:

Hi Takashi,

Thank you for the patch and work on accelerating PG performance with
NVM. I applied the patch and made some performance test based on the
patch v4. I stored database data files on NVMe SSD and stored WAL file on Intel PMem (NVM). I used two methods to store WAL file(s):

1. Leverage your patch to access PMem with libpmem (NVWAL patch).

2. Access PMem with legacy filesystem interface, that means use PMem as ordinary block device, no
PG patch is required to access PMem (Storage over App Direct).

I tried two insert scenarios:

A. Insert small record (length of record to be inserted is 24 bytes), I think it is similar as your test

B. Insert large record (length of record to be inserted is 328 bytes)

My original purpose is to see higher performance gain in scenario B as it is more write intensive on WAL.
But I observed that NVWAL patch method had ~5% performance improvement
compared with Storage over App Direct method in scenario A, while had ~20% performance degradation in scenario B.

I made further investigation on the test. I found that NVWAL patch
can improve performance of XlogFlush function, but it may impact
performance of CopyXlogRecordToWAL function. It may be related to the higher latency of memcpy to Intel PMem comparing with DRAM. Here are key data in my test:

Scenario A (length of record to be inserted: 24 bytes per record):

==============================

NVWAL SoAD

------------------------------------ ------- -------

Througput (10^3 TPS) 310.5
296.0

CPU Time % of CopyXlogRecordToWAL 0.4 0.2

CPU Time % of XLogInsertRecord 1.5 0.8

CPU Time % of XLogFlush 2.1 9.6

Scenario B (length of record to be inserted: 328 bytes per record):

==============================

NVWAL SoAD

------------------------------------ ------- -------

Througput (10^3 TPS) 13.0
16.9

CPU Time % of CopyXlogRecordToWAL 3.0 1.6

CPU Time % of XLogInsertRecord 23.0 16.4

CPU Time % of XLogFlush 2.3 5.9

Best Regards,

Gang

From: Takashi Menjo <takashi.menjo@gmail.com <mailto:takashi.menjo@gmail.com> >
Sent: Thursday, September 10, 2020 4:01 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Cc: pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Rebased.

2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp
<mailto:takashi.menjou.vg@hco.ntt.co.jp> >:

Dear hackers,

I update my non-volatile WAL buffer's patchset to v3. Now we can
use it in streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL
buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The
path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the master's one.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> >

Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org

<mailto:pgsql-hackers@postgresql.org> >

Cc: 'Robert Haas' <robertmhaas@gmail.com

<mailto:robertmhaas@gmail.com> >; 'Heikki Linnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; 'Amit Langote'

<amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Subject: RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master. A

new v2 patchset is attached to this mail.

I also measured performance before and after patchset, varying

-c/--client and -j/--jobs options of pgbench, for

each scaling factor s = 50 or 1000. The results are presented in

the following tables and the attached charts.

Conditions, steps, and other details will be shown later.

Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)

Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)

Both throughput and average latency are improved for each scaling

factor. Throughput seemed to almost reach

the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case. I

think larger scaling factor leads to less

contentions on the same tables and/or indexes, that is, less lock

and unlock operations. In such a situation,

write-ahead logging appears to be more significant for performance.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch

Steps
=====
For each (c,j) pair, I did the following steps three times then I

found the median of the three as a final result shown

in the tables above.

(1) Run initdb with proper -D and -X options; and also give

--nvwal-path and --nvwal-size options after patch

(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> > NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> >

Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Cc: 'Robert Haas' <robertmhaas@gmail.com

<mailto:robertmhaas@gmail.com> >; 'Heikki Linnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >;

'PostgreSQL-development'

<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated

patchset and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp>

NTT Software

Innovation Center

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> >

Cc: Robert Haas <robertmhaas@gmail.com

<mailto:robertmhaas@gmail.com> >; Heikki Linnakangas

<hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; PostgreSQL-development
<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo

<takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> > wrote:

Hello Amit,

I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or not

because master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using

release notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss to

notice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the

architecture of your own changes, although maybe that

rarely occurs.

Thanks,
Amit

--

Takashi Menjo <takashi.menjo@gmail.com
<mailto:takashi.menjo@gmail.com> >

--

Takashi Menjo <takashi.menjo@gmail.com
<mailto:takashi.menjo@gmail.com> >

Attachments:

postgresql.confapplication/octet-stream; name=postgresql.confDownload
test.sqlapplication/octet-stream; name=test.sqlDownload
#25Takashi Menjo
takashi.menjou.vg@hco.ntt.co.jp
In reply to: Deng, Gang (#24)
RE: [PoC] Non-volatile WAL buffer

Hi Gang,

Thanks. I have tried to reproduce performance degrade, using your configuration, query, and steps. And today, I got some results that Original (PMEM) achieved better performance than Non-volatile WAL buffer on my Ubuntu environment. Now I work for further investigation.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
NTT Software Innovation Center

Show quoted text

-----Original Message-----
From: Deng, Gang <gang.deng@intel.com>
Sent: Friday, October 9, 2020 3:10 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Cc: pgsql-hackers@postgresql.org; 'Takashi Menjo' <takashi.menjo@gmail.com>
Subject: RE: [PoC] Non-volatile WAL buffer

Hi Takashi,

There are some differences between our HW/SW configuration and test steps. I attached postgresql.conf I used
for your reference. I would like to try postgresql.conf and steps you provided in the later days to see if I can find
cause.

I also ran pgbench and postgres server on the same server but on different NUMA node, and ensure server process
and PMEM on the same NUMA node. I used similar steps are yours from step 1 to 9. But some difference in later
steps, major of them are:

In step 10), I created a database and table for test by:
#create database:
psql -c "create database insert_bench;"
#create table:
psql -d insert_bench -c "create table test(crt_time timestamp, info text default
'75feba6d5ca9ff65d09af35a67fe962a4e3fa5ef279f94df6696bee65f4529a4bbb03ae56c3b5b86c22b447fc
48da894740ed1a9d518a9646b3a751a57acaca1142ccfc945b1082b40043e3f83f8b7605b5a55fcd7eb8fc1
d0475c7fe465477da47d96957849327731ae76322f440d167725d2e2bbb60313150a4f69d9a8c9e86f9d7
9a742e7a35bf159f670e54413fb89ff81b8e5e8ab215c3ddfd00bb6aeb4');"

in step 15), I did not use pg_prewarm, but just ran pg_bench for 180 seconds to warm up.
In step 16), I ran pgbench using command: pgbench -M prepared -n -r -P 10 -f ./test.sql -T 600 -c _ -j _
insert_bench. (test.sql can be found in attachment)

For HW/SW conf, the major differences are:
CPU: I used Xeon 8268 (24c@2.9Ghz, HT enabled) OS Distro: CentOS 8.2.2004
Kernel: 4.18.0-193.6.3.el8_2.x86_64
GCC: 8.3.1

Best regards
Gang

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Sent: Tuesday, October 6, 2020 4:49 PM
To: Deng, Gang <gang.deng@intel.com>
Cc: pgsql-hackers@postgresql.org; 'Takashi Menjo' <takashi.menjo@gmail.com>
Subject: RE: [PoC] Non-volatile WAL buffer

Hi Gang,

I have tried to but yet cannot reproduce performance degrade you reported when inserting 328-byte records. So
I think the condition of you and me would be different, such as steps to reproduce, postgresql.conf, installation
setup, and so on.

My results and condition are as follows. May I have your condition in more detail? Note that I refer to your "Storage
over App Direct" as my "Original (PMEM)" and "NVWAL patch" to "Non-volatile WAL buffer."

Best regards,
Takashi

# Results
See the attached figure. In short, Non-volatile WAL buffer got better performance than Original (PMEM).

# Steps
Note that I ran postgres server and pgbench in a single-machine system but separated two NUMA nodes. PMEM
and PCI SSD for the server process are on the server-side NUMA node.

01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option (sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo
mount -o dax /dev/pmem0 /mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount
/dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of Non-volatile WAL buffer
07) Edit postgresql.conf as the attached one
- Please remove nvwal_* lines in the case of Original (PMEM)
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Change # characters of "filler" column of "pgbench_history" table to 300 (ALTER TABLE pgbench_history
ALTER filler TYPE character(300);)
- This would make the row size of the table 328 bytes
12) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
13) Remount the PMEM and the PCIe SSD
14) Start postgres server process on NUMA node 0 again (numactl -N 0 -m 0 -- pg_ctl -l pg.log start)
15) Run pg_prewarm for all the four pgbench_* tables
16) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 -- pgbench -r -M prepared -T 1800 -c __
-j __)
- It executes the default tpcb-like transactions

I repeated all the steps three times for each (c,j) then got the median "tps = __ (including connections
establishing)" of the three as throughput and the "latency average = __ ms " of that time as average latency.

# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata

# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT disabled by BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6 channels per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket x2 sockets (256 GiB per channel
x 6 channels per socket; interleaving enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C compiler: gcc 9.3.0
- libc: glibc 2.31
- Linux kernel: 5.7 (vanilla)
- Filesystem: ext4 (DAX enabled when using Optane PMem)
- PMDK: 1.9
- PostgreSQL (Original): 14devel (200f610: Jul 26, 2020)
- PostgreSQL (Non-volatile WAL buffer): 14devel (200f610: Jul 26, 2020) + non-volatile WAL buffer patchset
v4

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp> NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjo@gmail.com>
Sent: Thursday, September 24, 2020 2:38 AM
To: Deng, Gang <gang.deng@intel.com>
Cc: pgsql-hackers@postgresql.org; Takashi Menjo
<takashi.menjou.vg@hco.ntt.co.jp>
Subject: Re: [PoC] Non-volatile WAL buffer

Hello Gang,

Thank you for your report. I have not taken care of record size deeply
yet, so your report is very interesting. I will also have a test like yours then post results here.

Regards,
Takashi

2020年9月21日(月) 14:14 Deng, Gang <gang.deng@intel.com <mailto:gang.deng@intel.com> >:

Hi Takashi,

Thank you for the patch and work on accelerating PG performance with
NVM. I applied the patch and made some performance test based on the
patch v4. I stored database data files on NVMe SSD and stored WAL file on Intel PMem (NVM). I used two

methods to store WAL file(s):

1. Leverage your patch to access PMem with libpmem (NVWAL patch).

2. Access PMem with legacy filesystem interface, that means use PMem as ordinary block device, no
PG patch is required to access PMem (Storage over App Direct).

I tried two insert scenarios:

A. Insert small record (length of record to be inserted is 24 bytes), I think it is similar as your test

B. Insert large record (length of record to be inserted is 328 bytes)

My original purpose is to see higher performance gain in scenario B as it is more write intensive on WAL.
But I observed that NVWAL patch method had ~5% performance improvement
compared with Storage over App Direct method in scenario A, while had ~20% performance degradation in

scenario B.

I made further investigation on the test. I found that NVWAL patch
can improve performance of XlogFlush function, but it may impact
performance of CopyXlogRecordToWAL function. It may be related to the higher latency of memcpy to Intel

PMem comparing with DRAM. Here are key data in my test:

Scenario A (length of record to be inserted: 24 bytes per record):

==============================

NVWAL SoAD

------------------------------------ ------- -------

Througput (10^3 TPS) 310.5
296.0

CPU Time % of CopyXlogRecordToWAL 0.4 0.2

CPU Time % of XLogInsertRecord 1.5 0.8

CPU Time % of XLogFlush 2.1 9.6

Scenario B (length of record to be inserted: 328 bytes per record):

==============================

NVWAL SoAD

------------------------------------ ------- -------

Througput (10^3 TPS) 13.0
16.9

CPU Time % of CopyXlogRecordToWAL 3.0 1.6

CPU Time % of XLogInsertRecord 23.0 16.4

CPU Time % of XLogFlush 2.3 5.9

Best Regards,

Gang

From: Takashi Menjo <takashi.menjo@gmail.com <mailto:takashi.menjo@gmail.com> >
Sent: Thursday, September 10, 2020 4:01 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
Cc: pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org>
Subject: Re: [PoC] Non-volatile WAL buffer

Rebased.

2020年6月24日(水) 16:44 Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp
<mailto:takashi.menjou.vg@hco.ntt.co.jp> >:

Dear hackers,

I update my non-volatile WAL buffer's patchset to v3. Now we can
use it in streaming replication mode.

Updates from v2:

- walreceiver supports non-volatile WAL buffer
Now walreceiver stores received records directly to non-volatile WAL buffer if applicable.

- pg_basebackup supports non-volatile WAL buffer
Now pg_basebackup copies received WAL segments onto non-volatile WAL
buffer if you run it with "nvwal" mode (-Fn).
You should specify a new NVWAL path with --nvwal-path option. The
path will be written to postgresql.auto.conf or recovery.conf. The size of the new NVWAL is same as the

master's one.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> >
NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> >

Sent: Wednesday, March 18, 2020 5:59 PM
To: 'PostgreSQL-development' <pgsql-hackers@postgresql.org

<mailto:pgsql-hackers@postgresql.org> >

Cc: 'Robert Haas' <robertmhaas@gmail.com

<mailto:robertmhaas@gmail.com> >; 'Heikki Linnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; 'Amit

Langote'

<amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Subject: RE: [PoC] Non-volatile WAL buffer

Dear hackers,

I rebased my non-volatile WAL buffer's patchset onto master. A

new v2 patchset is attached to this mail.

I also measured performance before and after patchset, varying

-c/--client and -j/--jobs options of pgbench, for

each scaling factor s = 50 or 1000. The results are presented in

the following tables and the attached charts.

Conditions, steps, and other details will be shown later.

Results (s=50)
==============
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 35.7 37.1 (+3.9%) 0.224 0.216 (-3.6%)
(18,18) 70.9 74.7 (+5.3%) 0.254 0.241 (-5.1%)
(36,18) 76.0 80.8 (+6.3%) 0.473 0.446 (-5.7%)
(54,18) 75.5 81.8 (+8.3%) 0.715 0.660 (-7.7%)

Results (s=1000)
================
Throughput [10^3 TPS] Average latency [ms]
( c, j) before after before after
------- --------------------- ---------------------
( 8, 8) 37.4 40.1 (+7.3%) 0.214 0.199 (-7.0%)
(18,18) 79.3 86.7 (+9.3%) 0.227 0.208 (-8.4%)
(36,18) 87.2 95.5 (+9.5%) 0.413 0.377 (-8.7%)
(54,18) 86.8 94.8 (+9.3%) 0.622 0.569 (-8.5%)

Both throughput and average latency are improved for each scaling

factor. Throughput seemed to almost reach

the upper limit when (c,j)=(36,18).

The percentage in s=1000 case looks larger than in s=50 case. I

think larger scaling factor leads to less

contentions on the same tables and/or indexes, that is, less lock

and unlock operations. In such a situation,

write-ahead logging appears to be more significant for performance.

Conditions
==========
- Use one physical server having 2 NUMA nodes (node 0 and 1)
- Pin postgres (server processes) to node 0 and pgbench to node 1
- 18 cores and 192GiB DRAM per node
- Use an NVMe SSD for PGDATA and an interleaved 6-in-1 NVDIMM-N set for pg_wal
- Both are installed on the server-side node, that is, node 0
- Both are formatted with ext4
- NVDIMM-N is mounted with "-o dax" option to enable Direct Access (DAX)
- Use the attached postgresql.conf
- Two new items nvwal_path and nvwal_size are used only after patch

Steps
=====
For each (c,j) pair, I did the following steps three times then I

found the median of the three as a final result shown

in the tables above.

(1) Run initdb with proper -D and -X options; and also give

--nvwal-path and --nvwal-size options after patch

(2) Start postgres and create a database for pgbench tables
(3) Run "pgbench -i -s ___" to create tables (s = 50 or 1000)
(4) Stop postgres, remount filesystems, and start postgres again
(5) Execute pg_prewarm extension for all the four pgbench tables
(6) Run pgbench during 30 minutes

pgbench command line
====================
$ pgbench -h /tmp -p 5432 -U username -r -M prepared -T 1800 -c ___ -j ___ dbname

I gave no -b option to use the built-in "TPC-B (sort-of)" query.

Software
========
- Distro: Ubuntu 18.04
- Kernel: Linux 5.4 (vanilla kernel)
- C Compiler: gcc 7.4.0
- PMDK: 1.7
- PostgreSQL: d677550 (master on Mar 3, 2020)

Hardware
========
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6154 (Skylake) x 2sockets
- DRAM: DDR4 2666MHz {32GiB/ch x 6ch}/socket x 2sockets
- NVDIMM-N: DDR4 2666MHz {16GiB/ch x 6ch}/socket x 2sockets
- NVMe SSD: Intel Optane DC P4800X Series SSDPED1K750GA

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> > NTT Software Innovation Center

-----Original Message-----
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> >

Sent: Thursday, February 20, 2020 6:30 PM
To: 'Amit Langote' <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Cc: 'Robert Haas' <robertmhaas@gmail.com

<mailto:robertmhaas@gmail.com> >; 'Heikki Linnakangas' <hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >;

'PostgreSQL-development'

<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: RE: [PoC] Non-volatile WAL buffer

Dear Amit,

Thank you for your advice. Exactly, it's so to speak "do as the hackers do when in pgsql"...

I'm rebasing my branch onto master. I'll submit an updated

patchset and performance report later.

Best regards,
Takashi

--
Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp>

NTT Software

Innovation Center

-----Original Message-----
From: Amit Langote <amitlangote09@gmail.com <mailto:amitlangote09@gmail.com> >
Sent: Monday, February 17, 2020 5:21 PM
To: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp

<mailto:takashi.menjou.vg@hco.ntt.co.jp> >

Cc: Robert Haas <robertmhaas@gmail.com

<mailto:robertmhaas@gmail.com> >; Heikki Linnakangas

<hlinnaka@iki.fi <mailto:hlinnaka@iki.fi> >; PostgreSQL-development
<pgsql-hackers@postgresql.org <mailto:pgsql-hackers@postgresql.org> >
Subject: Re: [PoC] Non-volatile WAL buffer

Hello,

On Mon, Feb 17, 2020 at 4:16 PM Takashi Menjo

<takashi.menjou.vg@hco.ntt.co.jp <mailto:takashi.menjou.vg@hco.ntt.co.jp> > wrote:

Hello Amit,

I apologize for not having any opinion on the patches
themselves, but let me point out that it's better to base these
patches on HEAD (master branch) than REL_12_0, because all new
code is committed to the master branch, whereas stable branches
such as
REL_12_0 only receive bug fixes. Do you have any

specific reason to be working on REL_12_0?

Yes, because I think it's human-friendly to reproduce and discuss
performance measurement. Of course I know

all new accepted patches are merged into master's HEAD, not stable
branches and not even release tags, so I'm aware of rebasing my
patchset onto master sooner or later. However, if someone,
including me, says that s/he applies my patchset to "master" and
measures its performance, we have to pay attention to which commit the "master"
really points to. Although we have sha1 hashes to specify which
commit, we should check whether the specific commit on master has
patches affecting performance or not

because master's HEAD gets new patches day by day. On the other hand,
a release tag clearly points the commit all we probably know. Also we
can check more easily the features and improvements by using

release notes and user manuals.

Thanks for clarifying. I see where you're coming from.

While I do sometimes see people reporting numbers with the latest
stable release' branch, that's normally just one of the baselines.
The more important baseline for ongoing development is the master
branch's HEAD, which is also what people volunteering to test your
patches would use. Anyone who reports would have to give at least
two numbers -- performance with a branch's HEAD without patch
applied and that with patch applied -- which can be enough in most
cases to see the difference the patch makes. Sure, the numbers
might change on each report, but that's fine I'd think. If you
continue to develop against the stable branch, you might miss to

notice impact from any relevant developments in the master branch,
even developments which possibly require rethinking the

architecture of your own changes, although maybe that

rarely occurs.

Thanks,
Amit

--

Takashi Menjo <takashi.menjo@gmail.com
<mailto:takashi.menjo@gmail.com> >

--

Takashi Menjo <takashi.menjo@gmail.com
<mailto:takashi.menjo@gmail.com> >

#26Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Takashi Menjo (#25)
Re: [PoC] Non-volatile WAL buffer

I had a new look at this thread today, trying to figure out where we
are. I'm a bit confused.

One thing we have established: mmap()ing WAL files performs worse than
the current method, if pg_wal is not on a persistent memory device. This
is because the kernel faults in existing content of each page, even
though we're overwriting everything.

That's unfortunate. I was hoping that mmap() would be a good option even
without persistent memory hardware. I wish we could tell the kernel to
zero the pages instead of reading them from the file. Maybe clear the
file with ftruncate() before mmapping it?

That should not be problem with a real persistent memory device, however
(or when emulating it with DRAM). With DAX, the storage is memory-mapped
directly and there is no page cache, and no pre-faulting.

Because of that, I'm baffled by what the
v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it
correctly, it puts the WAL buffers in a separate file, which is stored
on the NVRAM. Why? I realize that this is just a Proof of Concept, but
I'm very much not interested in anything that requires the DBA to manage
a second WAL location. Did you test the mmap() patches with persistent
memory hardware? Did you compare that with the pmem patchset, on the
same hardware? If there's a meaningful performance difference between
the two, what's causing it?

- Heikki

#27Takashi Menjo
takashi.menjo@gmail.com
In reply to: Heikki Linnakangas (#26)
Re: [PoC] Non-volatile WAL buffer

Hi Heikki,

I had a new look at this thread today, trying to figure out where we are.

I'm a bit confused.

One thing we have established: mmap()ing WAL files performs worse than

the current method, if pg_wal is not on

a persistent memory device. This is because the kernel faults in existing

content of each page, even though we're

overwriting everything.

Yes. In addition, after a certain page (in the sense of OS page) is
msync()ed, another page fault will occur again when something is stored
into that page.

That's unfortunate. I was hoping that mmap() would be a good option even

without persistent memory hardware.

I wish we could tell the kernel to zero the pages instead of reading them

from the file. Maybe clear the file with

ftruncate() before mmapping it?

The area extended by ftruncate() appears as if it were zero-filled [1]https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html.
Please note that it merely "appears as if." It might not be actually
zero-filled as data blocks on devices, so pre-allocating files should
improve transaction performance. At least, on Linux 5.7 and ext4, it takes
more time to store into the mapped file just open(O_CREAT)ed and
ftruncate()d than into the one filled already and actually.

That should not be problem with a real persistent memory device, however

(or when emulating it with DRAM). With

DAX, the storage is memory-mapped directly and there is no page cache,

and no pre-faulting.
Yes, with filesystem DAX, there is no page cache for file data. A page
fault still occurs but for each 2MiB DAX hugepage, so its overhead
decreases compared with 4KiB page fault. Such a DAX hugepage fault is only
applied to DAX-mapped files and is different from a general transparent
hugepage fault.

Because of that, I'm baffled by what the

v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it

correctly, it puts the WAL buffers in a separate file, which is stored on

the NVRAM. Why? I realize that this is just

a Proof of Concept, but I'm very much not interested in anything that

requires the DBA to manage a second WAL

location. Did you test the mmap() patches with persistent memory

hardware? Did you compare that with the pmem

patchset, on the same hardware? If there's a meaningful performance

difference between the two, what's causing

it?

Yes, this patchset puts the WAL buffers into the file specified by
"nvwal_path" in postgresql.conf.

Why this patchset puts the buffers into the separated file, not existing
segment files in PGDATA/pg_wal, is because it reduces the overhead due to
system calls such as open(), mmap(), munmap(), and close(). It open()s and
mmap()s the file "nvwal_path" once, and keeps that file mapped while
running. On the other hand, as for the patchset mmap()ing the segment
files, a backend process should munmap() and close() the current mapped
file and open() and mmap() the new one for each time the inserting location
for that process goes over segments. This causes the performance difference
between the two.

Best regards,
Takashi

[1]: https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html
https://pubs.opengroup.org/onlinepubs/9699919799/functions/ftruncate.html

--
Takashi Menjo <takashi.menjo@gmail.com>

#28Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#25)
Re: [PoC] Non-volatile WAL buffer

Hi Gang,

I appreciate your patience. I reproduced the results you reported to me, on
my environment.

First of all, the condition you gave to me was a little unstable on my
environment, so I made the values of {max_,min_,nv}wal_size larger and the
pre-warm duration longer to get stable performance. I didn't modify your
table and query, and benchmark duration.

Under the stable condition, Original (PMEM) still got better performance
than Non-volatile WAL Buffer. To sum up, the reason was that Non-volatile
WAL Buffer on Optane PMem spent much more time than Original (PMEM) for
XLogInsert when using your table and query. It offset the improvement of
XLogFlush, and degraded performance in total. VTune told me that
Non-volatile WAL Buffer took more CPU time than Original (PMEM) for
(XLogInsert => XLogInsertRecord => CopyXLogRecordsToWAL =>) memcpy while it
took less time for XLogFlush. This profile was very similar to the one you
reported.

In general, when WAL buffers are on Optane PMem rather than DRAM, it is
obvious that it takes more time to memcpy WAL records into the buffers
because Optane PMem is a little slower than DRAM. In return for that,
Non-volatile WAL Buffer reduces the time to let the records hit to devices
because it doesn't need to write them out of the buffers to somewhere else,
but just need to flush out of CPU caches to the underlying memory-mapped
file.

Your report shows that Non-volatile WAL Buffer on Optane PMem is not good
for certain kinds of transactions, and is good for others. I have tried to
fix how to insert and flush WAL records, or the configurations or constants
that could change performance such as NUM_XLOGINSERT_LOCKS, but
Non-volatile WAL Buffer have not achieved better performance than Original
(PMEM) yet when using your table and query. I will continue to work on this
issue and will report if I have any update.

By the way, did your performance progress reported by pgbench with -P
option get down to zero when you run Non-volatile WAL Buffer? If so, your
{max_,min_,nv}wal_size might be too small or your checkpoint configurations
might be not appropriate. Could you check your results again?

Best regards,
Takashi

--
Takashi Menjo <takashi.menjo@gmail.com>

#29Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Takashi Menjo (#20)
3 attachment(s)
Re: [PoC] Non-volatile WAL buffer

Hi,

These patches no longer apply :-( A rebased version would be nice.

I've been interested in what performance improvements this might bring,
so I've been running some extensive benchmarks on a machine with PMEM
hardware. So let me share some interesting results. (I used commit from
early September, to make the patch apply cleanly.)

Note: The hardware was provided by Intel, and they are interested in
supporting the development and providing access to machines with PMEM to
developers. So if you're interested in this patch & PMEM, but don't have
access to suitable hardware, try contacting Steve Shaw
<steve.shaw@intel.com> who's the person responsible for open source
databases at Intel (he's also the author of HammerDB).

The benchmarks were done on a machine with 2 x Xeon Platinum (24/48
cores), 128GB RAM, NVMe and PMEM SSDs. I did some basic pgbench tests
with different scales (500, 5000, 15000) with and without these patches.
I did some usual tuning (shared buffers, max_wal_size etc.), the most
important changes being:

- maintenance_work_mem = 256MB
- max_connections = 200
- random_page_cost = 1.2
- shared_buffers = 16GB
- work_mem = 64MB
- checkpoint_completion_target = 0.9
- checkpoint_timeout = 20min
- max_wal_size = 96GB
- autovacuum_analyze_scale_factor = 0.1
- autovacuum_vacuum_insert_scale_factor = 0.05
- autovacuum_vacuum_scale_factor = 0.01
- vacuum_cost_limit = 1000

And on the patched version:

- nvwal_size = 128GB
- nvwal_path = … points to the PMEM DAX device …

The machine has multiple SSDs (all Optane-based, IIRC):

- NVMe SSD (Optane)
- PMEM in BTT mode
- PMEM in DAX mode

So I've tested all of them - the data was always on the NVMe device, and
the WAL was placed on one of those devices. That means we have these
four cases to compare:

- nvme - master with WAL on the NVMe SSD
- pmembtt - master with WAL on PMEM in BTT mode
- pmemdax - master with WAL on PMEM in DAX mode
- pmemdax-ntt - patched version with WAL on PMEM in DAX mode

The "nvme" is a bit disadvantaged as it places both data and WAL on the
same device, so consider that while evaluating the results. But for the
smaller data sets this should be fairly negligible, I believe.

I'm not entirely sure whether the "pmemdax" (i.e. unpatched instance
with WAL on PMEM DAX device) is actually safe, but I included it anyway
to see what difference is.

Now let's look at results for the basic data sizes and client counts.
I've also attached some charts to illustrate this. These numbers are tps
averages from 3 runs, each about 30 minutes long.

1) scale 500 (fits into shared buffers)
---------------------------------------

wal 1 16 32 64 96
----------------------------------------------------------
nvme 6321 73794 132687 185409 192228
pmembtt 6248 60105 85272 82943 84124
pmemdax 6686 86188 154850 105219 149224
pmemdax-ntt 8062 104887 211722 231085 252593

The NVMe performs well (the single device is not an issue, as there
should be very little non-WAL I/O). The PMBM/BTT has a clear bottleneck
~85k tps. It's interesting the PMEM/DAX performs much worse without the
patch, and the drop at 64 clients. Not sure what that's about.

2) scale 5000 (fits into RAM)
-----------------------------

wal 1 16 32 64 96
-----------------------------------------------------------
nvme 4804 43636 61443 79807 86414
pmembtt 4203 28354 37562 41562 43684
pmemdax 5580 62180 92361 112935 117261
pmemdax-ntt 6325 79887 128259 141793 127224

The differences are more significant, compared to the small scale. The
BTT seems to have bottleneck around ~43k tps, the PMEM/DAX dominates.

3) scale 15000 (bigger than RAM)
--------------------------------

wal 1 16 32 64 96
-----------------------------------------------------------
pmembtt 3638 20630 28985 32019 31303
pmemdax 5164 48230 69822 85740 90452
pmemdax-ntt 5382 62359 80038 83779 80191

I have not included the nvme results here, because the impact of placing
both data and WAL on the same device was too significant IMHO.

The remaining results seem nice. It's interesting the patched case is a
bit slower than master. Not sure why.

Overall, these results seem pretty nice, I guess. Of course, this does
not say the current patch is the best way to implement this (or whether
it's correct), but it does suggest supporting PMEM might bring sizeable
performance boost.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

pgbench-15000.pngimage/png; name=pgbench-15000.pngDownload
pgbench-5000.pngimage/png; name=pgbench-5000.pngDownload
pgbench-500.pngimage/png; name=pgbench-500.pngDownload
#30Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Takashi Menjo (#27)
Re: [PoC] Non-volatile WAL buffer

Hi,

On 10/30/20 6:57 AM, Takashi Menjo wrote:

Hi Heikki,

I had a new look at this thread today, trying to figure out where
we are.

I'm a bit confused.

One thing we have established: mmap()ing WAL files performs worse
than the current method, if pg_wal is not on a persistent memory
device. This is because the kernel faults in existing content of
each page, even though we're overwriting everything.

Yes. In addition, after a certain page (in the sense of OS page) is
msync()ed, another page fault will occur again when something is
stored into that page.

That's unfortunate. I was hoping that mmap() would be a good option
even without persistent memory hardware. I wish we could tell the
kernel to zero the pages instead of reading them from the file.
Maybe clear the file with ftruncate() before mmapping it?

The area extended by ftruncate() appears as if it were zero-filled
[1]. Please note that it merely "appears as if." It might not be
actually zero-filled as data blocks on devices, so pre-allocating
files should improve transaction performance. At least, on Linux 5.7
and ext4, it takes more time to store into the mapped file just
open(O_CREAT)ed and ftruncate()d than into the one filled already and
actually.

Does is really matter that it only appears zero-filled? I think Heikki's
point was that maybe ftruncate() would prevent the kernel from faulting
the existing page content when we're overwriting it.

Not sure I understand what the benchmark with ext4 was doing, exactly.
How was that measured? Might be interesting to have some simple
benchmarking tool to demonstrate this (I believe a small standalone tool
written in C should do the trick).

That should not be problem with a real persistent memory device,
however (or when emulating it with DRAM). With DAX, the storage is
memory-mapped directly and there is no page cache, and no
pre-faulting.

Yes, with filesystem DAX, there is no page cache for file data. A
page fault still occurs but for each 2MiB DAX hugepage, so its
overhead decreases compared with 4KiB page fault. Such a DAX
hugepage fault is only applied to DAX-mapped files and is different
from a general transparent hugepage fault.

I don't follow - if there are page faults even when overwriting all the
data, I'd say it's still an issue even with 2MB DAX pages. How big is
the difference between 4kB and 2MB pages?

Not sure I understand how is this different from general THP fault?

Because of that, I'm baffled by what the
v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it
correctly, it puts the WAL buffers in a separate file, which is
stored on the NVRAM. Why? I realize that this is just a Proof of
Concept, but I'm very much not interested in anything that requires
the DBA to manage a second WAL location. Did you test the mmap()
patches with persistent memory hardware? Did you compare that with
the pmem patchset, on the same hardware? If there's a meaningful
performance difference between the two, what's causing it?

Yes, this patchset puts the WAL buffers into the file specified by
"nvwal_path" in postgresql.conf.

Why this patchset puts the buffers into the separated file, not
existing segment files in PGDATA/pg_wal, is because it reduces the
overhead due to system calls such as open(), mmap(), munmap(), and
close(). It open()s and mmap()s the file "nvwal_path" once, and keeps
that file mapped while running. On the other hand, as for the
patchset mmap()ing the segment files, a backend process should
munmap() and close() the current mapped file and open() and mmap()
the new one for each time the inserting location for that process
goes over segments. This causes the performance difference between
the two.

I kinda agree with Heikki here - having to manage yet another location
for WAL data is rather inconvenient. We should aim not to make the life
of DBAs unnecessarily difficult, IMO.

I wonder how significant the syscall overhead is - can you show share
some numbers? I don't see any such results in this thread, so I'm not
sure if it means losing 1% or 10% throughput.

Also, maybe there are alternative ways to reduce the overhead? For
example, we can increase the size of the WAL segment, and with 1GB
segments we'd do 1/64 of syscalls. Or maybe we could do some of this
asynchronously - request a segment ahead, and let another process do the
actual work etc. so that the running process does not wait.

Do I understand correctly that the patch removes "regular" WAL buffers
and instead writes the data into the non-volatile PMEM buffer, without
writing that to the WAL segments at all (unless in archiving mode)?

Firstly, I guess many (most?) instances will have to write the WAL
segments anyway because of PITR/backups, so I'm not sure we can save
much here.

But more importantly - doesn't that mean the nvwal_size value is
essentially a hard limit? With max_wal_size, it's a soft limit i.e.
we're allowed to temporarily use more WAL when needed. But with a
pre-allocated file, that's clearly not possible. So what would happen in
those cases?

Also, is it possible to change nvwal_size? I haven't tried, but I wonder
what happens with the current contents of the file.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#31Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#30)
Re: [PoC] Non-volatile WAL buffer

Hi,

On 11/23/20 3:01 AM, Tomas Vondra wrote:

Hi,

On 10/30/20 6:57 AM, Takashi Menjo wrote:

Hi Heikki,

I had a new look at this thread today, trying to figure out where
we are.

I'm a bit confused.

One thing we have established: mmap()ing WAL files performs worse
than the current method, if pg_wal is not on a persistent memory
device. This is because the kernel faults in existing content of
each page, even though we're overwriting everything.

Yes. In addition, after a certain page (in the sense of OS page) is
msync()ed, another page fault will occur again when something is
stored into that page.

That's unfortunate. I was hoping that mmap() would be a good option
even without persistent memory hardware. I wish we could tell the
kernel to zero the pages instead of reading them from the file.
Maybe clear the file with ftruncate() before mmapping it?

The area extended by ftruncate() appears as if it were zero-filled
[1]. Please note that it merely "appears as if." It might not be
actually zero-filled as data blocks on devices, so pre-allocating
files should improve transaction performance. At least, on Linux 5.7
and ext4, it takes more time to store into the mapped file just
open(O_CREAT)ed and ftruncate()d than into the one filled already and
actually.

Does is really matter that it only appears zero-filled? I think Heikki's
point was that maybe ftruncate() would prevent the kernel from faulting
the existing page content when we're overwriting it.

Not sure I understand what the benchmark with ext4 was doing, exactly.
How was that measured? Might be interesting to have some simple
benchmarking tool to demonstrate this (I believe a small standalone tool
written in C should do the trick).

One more thought about this - if ftruncate() is not enough to convince
the mmap() to not load existing data from the file, what about not
reusing the WAL segments at all? I haven't tried, though.

That should not be problem with a real persistent memory device,
however (or when emulating it with DRAM). With DAX, the storage is
memory-mapped directly and there is no page cache, and no
pre-faulting.

Yes, with filesystem DAX, there is no page cache for file data. A
page fault still occurs but for each 2MiB DAX hugepage, so its
overhead decreases compared with 4KiB page fault. Such a DAX
hugepage fault is only applied to DAX-mapped files and is different
from a general transparent hugepage fault.

I don't follow - if there are page faults even when overwriting all the
data, I'd say it's still an issue even with 2MB DAX pages. How big is
the difference between 4kB and 2MB pages?

Not sure I understand how is this different from general THP fault?

Because of that, I'm baffled by what the
v4-0002-Non-volatile-WAL-buffer.patch does. If I understand it
correctly, it puts the WAL buffers in a separate file, which is
stored on the NVRAM. Why? I realize that this is just a Proof of
Concept, but I'm very much not interested in anything that requires
the DBA to manage a second WAL location. Did you test the mmap()
patches with persistent memory hardware? Did you compare that with
the pmem patchset, on the same hardware? If there's a meaningful
performance difference between the two, what's causing it?

Yes, this patchset puts the WAL buffers into the file specified by
"nvwal_path" in postgresql.conf.

Why this patchset puts the buffers into the separated file, not
existing segment files in PGDATA/pg_wal, is because it reduces the
overhead due to system calls such as open(), mmap(), munmap(), and
close(). It open()s and mmap()s the file "nvwal_path" once, and keeps
that file mapped while running. On the other hand, as for the
patchset mmap()ing the segment files, a backend process should
munmap() and close() the current mapped file and open() and mmap()
the new one for each time the inserting location for that process
goes over segments. This causes the performance difference between
the two.

I kinda agree with Heikki here - having to manage yet another location
for WAL data is rather inconvenient. We should aim not to make the life
of DBAs unnecessarily difficult, IMO.

I wonder how significant the syscall overhead is - can you show share
some numbers? I don't see any such results in this thread, so I'm not
sure if it means losing 1% or 10% throughput.

Also, maybe there are alternative ways to reduce the overhead? For
example, we can increase the size of the WAL segment, and with 1GB
segments we'd do 1/64 of syscalls. Or maybe we could do some of this
asynchronously - request a segment ahead, and let another process do the
actual work etc. so that the running process does not wait.

Do I understand correctly that the patch removes "regular" WAL buffers
and instead writes the data into the non-volatile PMEM buffer, without
writing that to the WAL segments at all (unless in archiving mode)?

Firstly, I guess many (most?) instances will have to write the WAL
segments anyway because of PITR/backups, so I'm not sure we can save
much here.

But more importantly - doesn't that mean the nvwal_size value is
essentially a hard limit? With max_wal_size, it's a soft limit i.e.
we're allowed to temporarily use more WAL when needed. But with a
pre-allocated file, that's clearly not possible. So what would happen in
those cases?

Also, is it possible to change nvwal_size? I haven't tried, but I wonder
what happens with the current contents of the file.

I've been thinking about the current design (which essentially places
the WAL buffers on PMEM) a bit more. I wonder whether that's actually
the right design ...

The way I understand the current design is that we're essentially
switching from this architecture:

clients -> wal buffers (DRAM) -> wal segments (storage)

to this

clients -> wal buffers (PMEM)

(Assuming there we don't have to write segments because of archiving.)

The first thing to consider is that PMEM is actually somewhat slower
than DRAM, the difference is roughly 100ns vs. 300ns (see [1]https://pmem.io/2019/12/19/performance.html and [2]https://arxiv.org/pdf/1904.01614.pdf).
From this POV it's a bit strange that we're moving the WAL buffer to a
slower medium.

Of course, PMEM is significantly faster than other storage types (e.g.
order of magnitude faster than flash) and we're eliminating the need to
write the WAL from PMEM in some cases, and that may help.

The second thing I notice is that PMEM does not seem to handle many
clients particularly well - if you look at Figure 2 in [2]https://arxiv.org/pdf/1904.01614.pdf, you'll see
that there's a clear drop-off in write bandwidth after only a few
clients. For DRAM there's no such issue. (The total PMEM bandwidth seems
much worse than for DRAM too.)

So I wonder if using PMEM for the WAL buffer is the right way forward.
AFAIK the WAL buffer is quite concurrent (multiple clients writing
data), which seems to contradict the PMEM vs. DRAM trade-offs.

The design I've originally expected would look more like this

clients -> wal buffers (DRAM) -> wal segments (PMEM DAX)

i.e. mostly what we have now, but instead of writing the WAL segments
"the usual way" we'd write them using mmap/memcpy, without fsync.

I suppose that's what Heikki meant too, but I'm not sure.

regards

[1]: https://pmem.io/2019/12/19/performance.html
[2]: https://arxiv.org/pdf/1904.01614.pdf

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#32tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Tomas Vondra (#31)
RE: [PoC] Non-volatile WAL buffer

From: Tomas Vondra <tomas.vondra@enterprisedb.com>

So I wonder if using PMEM for the WAL buffer is the right way forward.
AFAIK the WAL buffer is quite concurrent (multiple clients writing
data), which seems to contradict the PMEM vs. DRAM trade-offs.

The design I've originally expected would look more like this

clients -> wal buffers (DRAM) -> wal segments (PMEM DAX)

i.e. mostly what we have now, but instead of writing the WAL segments
"the usual way" we'd write them using mmap/memcpy, without fsync.

I suppose that's what Heikki meant too, but I'm not sure.

SQL Server probably does so. Please see the following page and the links in "Next steps" section. I'm saying "probably" because the document doesn't clearly state whether SQL Server memcpys data from DRAM log cache to non-volatile log cache only for transaction commits or for all log cache writes. I presume the former.

Add persisted log buffer to a database
https://docs.microsoft.com/en-us/sql/relational-databases/databases/add-persisted-log-buffer?view=sql-server-ver15
--------------------------------------------------
With non-volatile, tail of the log storage the pattern is

memcpy to LC
memcpy to NV LC
Set status
Return control to caller (commit is now valid)
...

With this new functionality, we use a region of memory which is mapped to a file on a DAX volume to hold that buffer. Since the memory hosted by the DAX volume is already persistent, we have no need to perform a separate flush, and can immediately continue with processing the next operation. Data is flushed from this buffer to more traditional storage in the background.
--------------------------------------------------

Regards
Takayuki Tsunakawa

#33Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: tsunakawa.takay@fujitsu.com (#32)
Re: [PoC] Non-volatile WAL buffer

On 11/24/20 7:34 AM, tsunakawa.takay@fujitsu.com wrote:

From: Tomas Vondra <tomas.vondra@enterprisedb.com>

So I wonder if using PMEM for the WAL buffer is the right way forward.
AFAIK the WAL buffer is quite concurrent (multiple clients writing
data), which seems to contradict the PMEM vs. DRAM trade-offs.

The design I've originally expected would look more like this

clients -> wal buffers (DRAM) -> wal segments (PMEM DAX)

i.e. mostly what we have now, but instead of writing the WAL segments
"the usual way" we'd write them using mmap/memcpy, without fsync.

I suppose that's what Heikki meant too, but I'm not sure.

SQL Server probably does so. Please see the following page and the links in "Next steps" section. I'm saying "probably" because the document doesn't clearly state whether SQL Server memcpys data from DRAM log cache to non-volatile log cache only for transaction commits or for all log cache writes. I presume the former.

Add persisted log buffer to a database
https://docs.microsoft.com/en-us/sql/relational-databases/databases/add-persisted-log-buffer?view=sql-server-ver15
--------------------------------------------------
With non-volatile, tail of the log storage the pattern is

memcpy to LC
memcpy to NV LC
Set status
Return control to caller (commit is now valid)
...

With this new functionality, we use a region of memory which is mapped to a file on a DAX volume to hold that buffer. Since the memory hosted by the DAX volume is already persistent, we have no need to perform a separate flush, and can immediately continue with processing the next operation. Data is flushed from this buffer to more traditional storage in the background.
--------------------------------------------------

Interesting, thanks for the likn. If I understand [1]https://docs.microsoft.com/en-us/archive/blogs/bobsql/how-it-works-it-just-runs-faster-non-volatile-memory-sql-server-tail-of-log-caching-on-nvdimm correctly, they
essentially do this:

clients -> buffers (DRAM) -> buffers (PMEM) -> wal (storage)

that is, they insert the PMEM buffer between the LC (in DRAM) and
traditional (non-PMEM) storage, so that a commit does not need to do any
fsyncs etc.

It seems to imply the memcpy between DRAM and PMEM happens right when
writing the WAL, but I guess that's not strictly required - we might
just as well do that in the background, I think.

It's interesting that they only place the tail of the log on PMEM, i.e.
the PMEM buffer has limited size, and the rest of the log is not on
PMEM. It's a bit as if we inserted a PMEM buffer between our wal buffers
and the WAL segments, and kept the WAL segments on regular storage. That
could work, but I'd bet they did that because at that time the NV
devices were much smaller, and placing the whole log on PMEM was not
quite possible. So it might be unnecessarily complicated, considering
the PMEM device capacity is much higher now.

So I'd suggest we simply try this:

clients -> buffers (DRAM) -> wal segments (PMEM)

I plan to do some hacking and maybe hack together some simple tools to
benchmarks various approaches.

regards

[1]: https://docs.microsoft.com/en-us/archive/blogs/bobsql/how-it-works-it-just-runs-faster-non-volatile-memory-sql-server-tail-of-log-caching-on-nvdimm
https://docs.microsoft.com/en-us/archive/blogs/bobsql/how-it-works-it-just-runs-faster-non-volatile-memory-sql-server-tail-of-log-caching-on-nvdimm

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#34tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Tomas Vondra (#33)
RE: [PoC] Non-volatile WAL buffer

From: Tomas Vondra <tomas.vondra@enterprisedb.com>

It's interesting that they only place the tail of the log on PMEM, i.e.
the PMEM buffer has limited size, and the rest of the log is not on
PMEM. It's a bit as if we inserted a PMEM buffer between our wal buffers
and the WAL segments, and kept the WAL segments on regular storage. That
could work, but I'd bet they did that because at that time the NV
devices were much smaller, and placing the whole log on PMEM was not
quite possible. So it might be unnecessarily complicated, considering
the PMEM device capacity is much higher now.

So I'd suggest we simply try this:

clients -> buffers (DRAM) -> wal segments (PMEM)

I plan to do some hacking and maybe hack together some simple tools to
benchmarks various approaches.

I'm in favor of your approach. Yes, Intel PMEM were available in 128/256/512 GB when I checked last year. That's more than enough to place all WAL segments, so a small PMEM wal buffer is not necessary. I'm excited to see Postgres gain more power.

Regards
Takayuki Tsunakawa

#35Ashwin Agrawal
ashwinstar@gmail.com
In reply to: Tomas Vondra (#29)
Re: [PoC] Non-volatile WAL buffer

On Sun, Nov 22, 2020 at 5:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

I'm not entirely sure whether the "pmemdax" (i.e. unpatched instance
with WAL on PMEM DAX device) is actually safe, but I included it anyway
to see what difference is.

I am curious to learn more on this aspect. Kernels have provided support
for "pmemdax" mode so what part is unsafe in stack.

Reading the numbers it seems only at smaller scale modified PostgreSQL is
giving enhanced benefit over unmodified PostgreSQL with "pmemdax". For most
of other cases the numbers are pretty close between these two setups, so
curious to learn, why even modify PostgreSQL if unmodified PostgreSQL can
provide similar benefit with just DAX mode.

#36Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: tsunakawa.takay@fujitsu.com (#34)
Re: [PoC] Non-volatile WAL buffer

On 11/25/20 1:27 AM, tsunakawa.takay@fujitsu.com wrote:

From: Tomas Vondra <tomas.vondra@enterprisedb.com>

It's interesting that they only place the tail of the log on PMEM,
i.e. the PMEM buffer has limited size, and the rest of the log is
not on PMEM. It's a bit as if we inserted a PMEM buffer between our
wal buffers and the WAL segments, and kept the WAL segments on
regular storage. That could work, but I'd bet they did that because
at that time the NV devices were much smaller, and placing the
whole log on PMEM was not quite possible. So it might be
unnecessarily complicated, considering the PMEM device capacity is
much higher now.

So I'd suggest we simply try this:

clients -> buffers (DRAM) -> wal segments (PMEM)

I plan to do some hacking and maybe hack together some simple tools
to benchmarks various approaches.

I'm in favor of your approach. Yes, Intel PMEM were available in
128/256/512 GB when I checked last year. That's more than enough to
place all WAL segments, so a small PMEM wal buffer is not necessary.
I'm excited to see Postgres gain more power.

Cool. FWIW I'm not 100% sure it's the right approach, but I think it's
worth testing. In the worst case we'll discover that this architecture
does not allow fully leveraging PMEM benefits, or maybe it won't work
for some other reason and the approach proposed here will work better.
Let's play a bit and we'll see.

I have hacked a very simple patch doing this (essentially replacing
open/write/close calls in xlog.c with pmem calls). It's a bit rough but
seems good enough for testing/experimenting. I'll polish it a bit, do
some benchmarks, and share some numbers in a day or two.

regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#37Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Ashwin Agrawal (#35)
Re: [PoC] Non-volatile WAL buffer

On 11/25/20 2:10 AM, Ashwin Agrawal wrote:

On Sun, Nov 22, 2020 at 5:23 PM Tomas Vondra <tomas.vondra@enterprisedb.com>
wrote:

I'm not entirely sure whether the "pmemdax" (i.e. unpatched instance
with WAL on PMEM DAX device) is actually safe, but I included it anyway
to see what difference is.
I am curious to learn more on this aspect. Kernels have provided support

for "pmemdax" mode so what part is unsafe in stack.

I do admit I'm not 100% certain about this, so I err on the side of
caution. While discussing this with Steve Shaw, he suggested that
applications may get broken because DAX devices don't behave like block
devices in some respects (atomicity, addressability, ...).

Reading the numbers it seems only at smaller scale modified PostgreSQL is
giving enhanced benefit over unmodified PostgreSQL with "pmemdax". For most
of other cases the numbers are pretty close between these two setups, so
curious to learn, why even modify PostgreSQL if unmodified PostgreSQL can
provide similar benefit with just DAX mode.

That's a valid questions, but I wouldn't say the ~20% difference on the
medium scale is negligible. And it's possible that for the larger scales
the primary bottleneck is the storage used for data directory, not WAL
(notice that nvme is missing for the large scale).

Of course, it's faster than flash storage but the PMEM costs more too,
and when you pay $$$ for hardware you probably want to get as much
benefit from it as possible.

[1]: https://ark.intel.com/content/www/us/en/ark/products/203879/intel-optane-persistent-memory-200-series-128gb-pmem-module.html
https://ark.intel.com/content/www/us/en/ark/products/203879/intel-optane-persistent-memory-200-series-128gb-pmem-module.html

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#38Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#36)
2 attachment(s)
Re: [PoC] Non-volatile WAL buffer

Hi,

Here's the "simple patch" that I'm currently experimenting with. It
essentially replaces open/close/write/fsync with pmem calls
(map/unmap/memcpy/persist variants), and it's by no means committable.
But it works well enough for experiments / measurements, etc.

The numbers (5-minute pgbench runs on scale 500) look like this:

master/btt master/dax ntt simple
-----------------------------------------------------------
1 5469 7402 7977 6746
16 48222 80869 107025 82343
32 73974 158189 214718 158348
64 85921 154540 225715 164248
96 150602 221159 237008 217253

A chart illustrating these results is attached. The four columns are
showing unpatched master with WAL on a pmem device, in BTT or DAX modes,
"ntt" is the patch submitted to this thread, and "simple" is the patch
I've hacked together.

As expected, the BTT case performs poorly (compared to the rest).

The "master/dax" and "simple" perform about the same. There are some
differences, but those may be attributed to noise. The NTT patch does
outperform these cases by ~20-40% in some cases.

The question is why. I recall suggestions this is due to page faults
when writing data into the WAL, but I did experiment with various
settings that I think should prevent that (e.g. disabling WAL reuse
and/or disabling zeroing the segments) but that made no measurable
difference.

So I've added some primitive instrumentation to the code, counting the
calls and measuring duration for each of the PMEM operations, and
printing the stats regularly into log (after ~1M ops).

Typical results from a run with a single client look like this (slightly
formatted/wrapped for e-mail):

PMEM STATS
COUNT total 1000000 map 30 unmap 20
memcpy 510210 persist 489740
TIME total 0 map 931080 unmap 188750
memcpy 4938866752 persist 187846686
LENGTH memcpy 4337647616 persist 329824672

This shows that a majority of the 1M calls is memcpy/persist, the rest
is mostly negligible - both in terms of number of calls and duration.
The time values are in nanoseconds, BTW.

So for example we did 30 map_file calls, taking ~0.9ms in total, and the
unmap calls took even less time. So the direct impact of map/unmap calls
is rather negligible, I think.

The dominant part is clearly the memcpy (~5s) and persist (~2s). It's
not much per call, but it's overall it costs much more than the map and
unmap calls.

Finally, let's look at the LENGTH, which is a sum of the ranges either
copied to PMEM (memcpy) or fsynced (persist). Those are in bytes, and
the memcpy value is way higher than the persist one. In this particular
case, it's something like 4.3MB vs. 300kB, so an order of magnitude.

It's entirely possible this is a bug/measurement error in the patch. I'm
not all that familiar with the XLOG stuff, so maybe I did some silly
mistake somewhere.

But I think it might be also explained by the fact that XLogWrite()
always writes the WAL in a multiple of 8kB pages. Which is perfectly
reasonable for regular block-oriented storage, but pmem/dax is exactly
about not having to do that - PMEM is byte-addressable. And with pgbech,
the individual WAL records are tiny, so having to instead write/flush
the whole 8kB page (or more of them) repeatedly, as we append the WAL
records, seems a bit wasteful. So I wonder if this is why the trivial
patch does not show any benefits.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

patches.pngimage/png; name=patches.pngDownload
�PNG


IHDR f�/u	&iCCPiccH���gP�Y���<��@B�PC�*%��Z(���@��PEl��+��4E�E\�"kE��t�,��qQAYp���?���{�o�s��s��p ��e��{bR�������(������t�����{������i�����r�)�t��e��JOY���L���gWX�\�2�X��y�K��,�����]~
)����s��T8�����l�OrTzV� ����	���$G�&D~S�����Gf��Dnr�&AltL:�5204_g���K�!F��gE_��z�s ��z��t�@��WOm���|:��3��z��

��@(U�	t�0��8�|A�� $��`(E`8�@-hM����<�����.��L��@�����A2���@F�� 7�
�B�h(	��r��PT
UAuP�t������84�
}���aX���0v�}��p4�
����^���O���6<���"�@�]��p$�B�V�)G��V��C�!Bd����h(&Je�rF����T�VT1�
u���E�C��D��h2Z���@����ht�]�nD��������w��aa�0�� Lf3�s�����L`��X�Vk����a���J�I�%�v�G�)��p��`\.W�k�]�
��pxq�:�����o������;�I�A��"X|	q��
B+�a���H$���^�X�vb��q���D%i���Ri/�8�2�!�
�L� ��������&�U�S�{1���O,Bl�X�X����+
��N�P6Pr(��3�;�Yq���8W<L|�x��9�Q�9	�����D�D�D��M�i*��Au�FP����W�4��J�������5�$Cg�y�8z�g�]$I�4�������� )d 
���(a�f�0>J)Hq�"��H�J
I�K�I�JGJJ�IK�a�8�������y"��������="{MvV�.g)��+�;-�H������,L�_~NAQ�I!E�R����"C�V1N�L����M�Z)V�L����$��L`V0{�"eyeg��:�����J�J��U�*[5J�L�GU����������H��V�Q?���>��������1��f�X9���&Y�F3U�^��F���uX��6�m��]�}G�1���9�3�
��|U���U��$]�n�n���C�M/O�S����~��~�>��&	
�
��.�y���i������&�v\�mu����:���G���L�Mv���|253��������������lOv1��9���|��y����-������l��^�Z��a����U�U����ij}�Zh�lfSo��V�6���v���������3�����s-�[���{'�B����C��SG�h�G����f���hgW����<������lq�u%���V�>s�v�u���.������MZ��<x<�x�<S=��xyzU{=�6�������l�i�y�k�[���O�/��������?`P ��x;H6(6�+��<��a��u�!&!!#�Y����� �!a�����a���CB�C�<�����y�5�">���2�6�,b&�*�4r*�*�4j:�*�@�L�MLy�l,7�*�u�s\m�|�G��������D\bh��$jR|Ro�brv�`�NJA�0�"�`�H�*hL����u���?����]�����������dKd'e�o���g�T�c�O�Q���{r�sw��o�l��
m
���Mu[����N�O� ����[�A^i���;�����O�r��R V (�m�����?�Y��r�����[EE�E����[?�X������%�%G�a�%��o��D�DiN���e�����7�Yn\^{�p(�������R�r_�bUL�p�]u[�|���������i�U�-��x4���:���z���c�c���7�7������Q����������'z�������KZ������!'��l�sW�nk]���8�q��/����v=�s�}������vZ{a���C��)�
�<�r������W�_��W>_}A�B�E����K�r.�]N�<{%��D����W�������z��u��W�8}�nX�8����[�[��Mow�����f�[���@��;]w��v��8d3t������y�o��y02*|�`�a����2-<�>�+|"���������~o�
/����?�y�x�?����?'�����O)M5MM��q���b����)/f������������/
M��^�������o����y�=}��na�������>|�Z�Z�.V|��������R���?B,���sMT cHRMz&�����u0�`:�p��Q<bKGD�������	pHYs��qF��tIME�!��IDATx���|T���}�g$#�`d��bQ�����{����k�#n�e���b�\q[[��i7��,�w��f��d����.��"wcc�.C���2�FnWZ��&f���$@H�f�?�s���43�qfF��������s�sF��z��~?�H$H>����L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�L�V��_|1�����u�����r���\+V�H���_=�����UXX8�{9~����`�������[���'�i�����I�=�
L��N���9����
�������^uuu���S*..���[U^^>�y]]]	�����~��)�t��Y
X��p����Ko����y�������HstIroo�:��g��u~���=?C��z��a``@��������^��sXV�0��iSR���������x�
IJy����������������l�d�{"��^���s�������nu�������gu����Yn��Wyy����F�����Vp7000�`h6��������SK���_�={�D�%d�X�"jf]��������7���}�Us*0�F��%��.��$�X���������[�rp��2�Ci44+��PJ}�]ooo�9�ui��p����2���
@bY��a�\.W����G��\c�]����
�b��f�l��_|���={����T]]]:{�l��s��*//����'��2�<�A]�|9�=TXX�+V���t�k��������o���N�:u?����gu��������Qu�����	C������/Q���������u�kV^^��{���W?���u���������������?�5?sXiN���������AN�]���e7�x�g��5;a'bl�����gOR�WWW��x��	�e�����;�����7`ovklG�������:{�l��������w�yG[�nM������o�1n}c���g�I��F����������i�����?~<*\M�tuu���S��i��z���b��9�$96�Ie�Q��s{{{��7�_V���noo�
����~�zm��)�^._���_}���g���������F�w���i������������z{{u�����o��_�����$�T``@����o���N�:��l�S�N�;���������
���1�$6'g��&�{Nd��Z�b��%�<7v�]��p����m��5�V#@3�����:t���K{{{�B?c9sii�9f�X��Iw����3�^y�I�3��Y��������QAlaa��z�����Fc��7�xc�e�c���b=���r�\Z�b��_���g���o������^/�,���������G�u�M�=��7�������Ty������������`.�s3/_�^���5�g?�Y���)���w���0�,..��={��bc��D�y���p��M��{w�=�X�B[�nU]]]���}��������+V���N�6m�
����g�����;�������i�&�����>��3����~KcK�
�����g��y�����x������}O��A86�6���{*k�����{w����J`��S3O�:�KUSQ^^5�l��v���T����#���I���}N�������{���)���!�D
�i�&:t��s����)�#��h����[���n~/����2,//��{k����{��`.6(Lt�={��O��O�z�y������]8��U�I�q���%�f�0�:0Lv_8��A����y������>�����3�����)����z����i���f�h/����G�M����Kv������G]'�c������d����F���l�����1�W{P!�D�~2�]7m�d�_�u�/�!���.+�l��D�������|���{!�
��=����6��&c��Q��b��3f�IS�Ri�2����G}�lx��SO���O��OqqqR������6�{��2�4��r���������|���v?~|�5���-Y����xJa��xyn��l������iY��L���d�}���r����b$B����.�`0��d;|&]�����Vm��hl7�T�'���Q����6�.�>u��N�:���r�X�B�V�J��5'���b���f����-�M��3�K�tHg���b�}�"��+6TJe&�����^�~]���I7�IGl��L;��;������a���
�q�l{{���+..6D+��d��_y�K����f�aE�!�����Xd�}';�.�����uuu��l��}s��={�����:{�l�����5g!3]�n�����0�V��b���=��6�(..��=�r�����������������k�����0���H���[UWW��[�N8��{�������ah5c��>�.��T2Wf��}��LK4;����f��D���+�r��`��\.3$;�r��
^grFpaa�����z��e]�x1n9����:�={�d]�
�l 0����R38�]�k�>~s����`�aPl��`����{{{�����R������tJ��SU\\l�lLu��D�s����x�����g?�Y�����vC`I���<wl8a4
�%�~���_O���xc����g?���={�h�����3����f.�/_N)4|������/������z���6���t�N�uqq��n��={�D��,����X�k�x�����������9w���Ac[XX7�,6PL�Ku���� 2��H�>����pq`` n�L�
��y�������2�_|1.�<{��^|�E566���_O�E9Q=c����@��D`8
���/_VWWWF,����=00��,������)Q��3��]�{^l������?�T����l|ob���.=�������QccgVJ��#&���yJ�d#0�����
��w�b��Y��������S�<�|��:5��SO�=.����/_�R����#�,������=�����hD2��C�&�
x������u�.//�z��:uj�0���SQ�k��LMO�Iyy��t(1]��=���S�f������������������oD����iS���������w�yG�<�����(�L���g�����_�
	w�����F����v����Z�~�JKK� ������PXX�������SOEu*0k-//7�-�����������S���_�um��)������c������0 0�6���	g�����cC���<��p6�;���0�&?~���_�7�o����i���Q!�q�O=����
���K]]]q���M4�.v��������|���B���E��g<�d��3�<3�3?�g����fp900�S�N%�>~��g��d�u���0RRR�5j�z,���p�K$��X�]6��2��W�X��������j�6m�p���M����g�N�����w��u3��l������{����G\�������������kuaa�v������Gu�����x���_���u����C���b���;���0����������k2���%����N\gty;Kp"{��I:�Z�~�y���={����/��2Qh�����A��������`0j���T����=��u��z�)�={V�/_�����fz�{4�:o��U�?��������Q�����������d[$�X]0^|�E��KKK�g�����^�C��7`:z{{����,a�c���p_��!^2�=00������g�,�->fbN(..����X*��Tg�3��`��������������������D"V #��a�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�D`�)���:s���e3��8f�!`Z]�tI�.]����~�i���_�h��k�.���kZ���{���|�}�'#��d����|n��3g���r��rY]JV���!`�444��'��/~�K���7��$m��A�t��M�9s&��X�����Y��}�'#��d����|n�T�[] w�2�i&���KJ-LI�>��GX����t�����{o��5�>����������F`�o���/^���r�Kf�q ;�\�R���V�@�X�Y���3���X�.]��3gt��M�����"���s}��_��D���y�k{����.c{��~>������x����L��c�!����;C:���3��������O�:g����]��a�555���^��C����/V}}�v����7o��_4�4�F�)677k���	����o������d�����a�����r���s������0��
6h�����{�������%oF����G-��y�������Q��%z=����{��{����w�VSSS�c%��G��mhh��o��K�.i���������z����>��9��=Z�o��f�9~�wnZ�3S���|�K���'z��9sF�2��+W��g�M�|��f���z���%I�=�\����7o��'����7�~>�6|��������L�����!j�����W^�����������[������[�x�|n�����?��M�F��N���T>/���$��������}����?����r=��sQ�����9�Q�>sb����x��S�Q���~��_�������7U^^�������.]����ZI��C]�tI6l0��9sF


q�����F�BR^^�K�.����z��7���o���9�������s�
?�����oK��i��
�/�F�c�1:�� b��>�0��y�f���s�e<�x��jjj�_��������C��'?������a��	�#�{Dj��}>��q����x��C����QW{{�jkk�����������������a������_4��,�������f�555���7����������i����d���y���{f���Z�r����u��������)��o*�c����d�@���������6�m��n�2��{�����z��w�_���D]��y$I���
�z�������z��w�_���3�c���Q�3�_B��>|�<���^�6�Y�y��~��~��yd<����k>���]555	_cvCss�YC�L�3g���t%���oF��q�7o�Tmm�Y�x�1�="u��>�����8+�����c��9����������}��{�9����r�����&��i`��~�m��#����C����7kkk���&�~^Lf:��Ot���f��m�


�c�3��1@� 0�U__5{�X&h�����E����Q��
�b;6�\����%��e5v�������`�_<���K��}����7l�`�B;�N�r����3'�����7'\b��s�E����+u��a�\�R�.]7X�����y&����}���������>���+���<�l�t_����K�.i��������0j����+q���~5f�bvL���q���s~�k�A66hNd:>����2T���cC��_"��O�/~�����C�-�]�{�D3�{�9>|8�}�&[.<�����.�5��K��''�n*3C0}��}�	�qif�������F�=�����$O�5�F�?���������l$#8��I�~�3A`3{���B��?����Z��g��s���2�t/�3���t�����DK�����=�:��+Wj���q��%�x�����;�olM�$jX1���33���a���g��DafL������4���D����vs��x�N6�/���`4>8t������BD M���4��OT��
�&&���1@n 0�9���A��[��r�J���X�w����+���o�K����o.=4�����L�rq��M������q,�~��e�Ly�K��>�=>��t��t^��6l�`>�s���"����E�"@�3:�J���|�s�Syyy�L���c�D}}�������nvm4�.knnN��w�}w����0�d�;��/��������{��}��7o�TL�53�olC�3g�h���,IFN����L5^��!����|��?��%��%���7�:�&�e~��I���+����s�=�3g�h��]z��7�
*��"g���cy��>���~I���$M���������q���X]�%2�=.���|������u��M�9s&�,�D����^|�E]�tI���z���TSS���m��!#���W�`u	H���������r�����	�Yg|L��9�^�o���1�J��K�q�r���<����0��q�\n8�/	�6-7~�Y�r��~���_~�^�����rdt����pC��7oj��]�y�����R��'?�����'<�����c��lV���Kd�{\����D��/~����j��DQ�}��3�755���\��o��o����Z����I��l��j��E0�����$�*3�y1��/��g���d��������^���	�9%0-V�\��q,cI��e�cg��jjj��8�q���	Y�������j���jooWmmm��.]�d��a���,c�C2a_CCC����7o��_��y��J�f�U�qiv������}�Y��b�s��j�������/J���J����{��&f�x�O|n�g:>/fK���4����s<^�t�L�y
�>������%�.�K�v���]��k��k���v���p�����=�������������]���O���CZ�x�����E�c��T[[+���K�����x�b:t(�zO>�����^o2�����X���m���o�i>��!���d�`��|�K��>��=�r�J555IR�s>������|��������Q���9�N���7�#y�~^��7n����VCC�TSS����%�����?����r�������lg��Q{{�v�����������ilW���r����fg���v3��{��~���@�{�������������3g%�3��D���Q�x��}����u��as6�T�c�c��l����}�O�7j}��w�{�nsf��7�^���0��llh��c=��s���r����������y1�^y�I�f����Z�r������8^��-�D�.�[.]�4+���9sF�/��_\&����Gd��x��{�i��&�[���cQ�����{����8 E>C�R�C����d&C&C&�0``C�000000000000000000000000000000f���UTT�f��f���GQMM���������|�x��|��sB�������#��f����X555
�B��-��[$�X]RTQQa�N�SN�3*(<r��*++��]�v��~���omm���M�\��<������Y�g�3�<D�1�0�=�Z[[������nuww�!�x3�����[�H$�?FX(��H�z�����������mr555	k�����Y���k���r:��������{�#�<�@ ���Z566��>�Or:��������>>��B����]�@ 5#1��@ff��G�J]f�(,�d�2�]z<v�n2��+++����t���:��l;���.�����������!6<K504�����������-� �a�)))733�bgkN�S---:z���~�<��^�����B��Fc<��r���x�9��������#0������|��|jjj��#G�:$Od�%��r�����3�����a���%%%jllTee�B���=j�;w����D�������	��s���\�
.�K,���	�������H�Y8�;����V�PH���f h4���4�	755%<�������>��Yd���V�0)�677�=��U8�����J555����)�������q:��7@&�������UPP���,�B����p8�.�Y===�F������r�����Y6�g	�\]]����$�&���F��8c�s�����<��3r] ��A30t�\V���;w�hppP���9`��<���"��r?s���[]�WSSc�����S
�����D%�s���x�Y���F---r:�jnnVmm�����l����1>�O������-�1��7[�@<�,46,lmmUuu���a���O8/
���E�TYYu���G.�5o������ �a���|fv����g�UVV���D�PH;w��
�B��***
���x����Z�'������������r���$�a��	�K�3�Z[[%�v�:r��***���U\\,��#��)��o����
G����k��	
�B�5�����Y����i���x������Zy<��~�|>������V���r:�Q�������[���f�(���;���3��<D�E"���E��`0��������r��������>-[�L.���r��e��uii�����.�y��3��C&C&C&C&C&C&C&C&C&C&C&C&C�|��M��d���m���{�!E>����!E��d+����NX]��0���{����@�c�~�2Cd��po8�H_L����I��d.`�$
�b��|�
���f�������IIy��\��d+|X*,�7����s@���}��c��w�V���9b�.��u�[�wZ]���V����o�j�1V\����{�%Y5�o6��d���I��������_��������`�����M:��>=�/A�;������po�&st��d�t���}���d���t,��$�w�X#��+����o_2M:��) 0�.�&I�����W���7:
&k���a��h�4�`�>Ii��G�dCr�dM:��oTZ����s�!d�}��c��a���!9����mL���&@r�&i��7�th�}�&j����%I���*(*��V��C`����}i6��5q��M���}���B`�	���G��Q���M��C���5��M�oM:$����}����0%����M:4G���w��wc��o�C���j�1W����I�������	����}�M:���E`�t�>1�����}�}�6CHQ��{�Dy�>�w�X�y�F��}�����&`-CsG�%��A^�x��qE
[}��l�}�h����@VHj����������4����������N��Hb/a`��P�����s�����d���2�!�����D3�R��/��n��1A^2�x����|4@---UQQ������(����-.�+|8z_�%���0�O��<�:��K��!0��j��~}��_ �
_���[�4�S�"��ZX�Ov}���&D`��d7�������f��_`�������4x��F���
��G`��I���9�J�9G�����? I�������O�:�}�D�2�!���~}4��������}��~
������eE�,CLI�%�4��P������a��	���i�t�+�W���^l��?�	���%�4��X:������0x�������[��>'o�K7��^��.���F`��������������*��@#��_�d�����a�"4:'��p�p���!a��mIR���V���,0is�i�������m�]�o���Ds0=&�p<����/x5��.n�qG(����s=����|�������6'D`��$�s�__�� �� ����u,��P�
6n�<�:�������sW�����Uh�u[}��"0sF��]���X���-�e�>�%��p����:!a���9��sW�G��l�D|������I���d7ag^��{>�������ah�OD=����\3�����uU��O�3��sW��ff;C`��'?2C��%�4���9G����tu���O��-������,a�$L%$4:���C��_�w����o����=o]M�����>�����X�rL��X"<�p�^���v�6�`�>�Y1x���L�:!a�����y:���s�u�j{Z5���'�Z����$i�`�E��22,���7�~}"��$F�����%��������SG(�s]�����"\�,��e�Z�t��lE`fU��M#m[���I4��[�Ks��0r%��o����)u8��m�>|�a�����f%����Ij�]4o��[Z�U�'B`f�pH�v�7"�%����#����/��eK�V��������������,��=�5���N��`v�4��%������s�`�&dt86�&LV��R��?��������4��t%��u-Z�uK���Y��=��y�~)f�!�#�*��7���5�B$d��Cm���p�������e�^�B��KI��.J7�^s������E8C0���{��Gq����=�%��@�I������w�W����_>�@��|v��I?�k�R������sb�D��
_zU����q��dwUY]2���N
�u,�������k��j�B����1G�,# �>�D�4+�.�`����X7nwU������`������u8�U`���������+�g�=��2����1>C0#"�'��7ns�e/;`uy������[�&	/>0OgJ�}�|]|`^��S4o�����3�0�`�E��4�����m���'�.�,��o.7����1=���e���z�5_�
��E���'�d���B`��pH#m[��P�x�s4,�wZ]!fA2�����"t-Z�uKG;o~��9��d���CX0����u,�X�#_gJ����a.3^V�,�B`���_V��-n<��?�����<�#$��p|��n�<S�@=���(���Q�1���w�U�7~�����l���.�l�����8UF�# �Y��#0S�X�p�`��}�����.�d�JPw�:�*��L�"]����0�Y�F�c�1�E`�$<���W����*�W5X]�h�JP��OFu8~�5_�nI��J��-4;�[Z�,�C`���G
w���=�%��X]������?n~@����V�=�k�������%�����g��m����fuyH�����!��������E�Z��^G���*�����C�R���1ZZZTQQ!��&���GyD555
��
�TWW�GyD6�M������Q(��s��<@�"j�m�4�k+|Xy�R�������'���/��n������?�������^�����zQJa�����W�[?��W��m�U��F��� ,�2�H$���&�����N��NgTPx��UVVNx������7����Y�g��d�yf_0TOO����TZZju9H�pH#m[�o��w*�}B6���
1FWW�����l�2�\.��r�������***�� ��37u�:u��	�������������h��=av4FN�1�0�=�Z[[������nuww�!a��u;w�T ���Uoo���?���n3\����{�t����)/,�2���N���^������7�����/�����EI��E�����W�[?��_?��_��w�ff����k���t���[Ng���GyD�@@���jll�$�|>UTT$</
i���
jmm���M��l:�5�a����{�����UeuyH����`�0���K��������a�u���O��������9��������F��&
%���Pi4T3����t:U]]u�t�����7,\�@X�A�~xV����|p�*�}I���u|�p�a������H���������������&,�C���e�����z'llb��aII�����~c���9�t y����a��J��/X]���uUg/���:��������p$��Ei�Cm~�I�q�$"r�a�)))733��>nl3�D�qc�^��d�y������7nwU�^v����$����3���]���O���������p��=���~��ni9]���0�455���W�|v]������M�&�oS�b}�����}U�����[W�����{i]�3X��?�����Y���aiiiQ]]�$���v������������s��<��"�mi�"
���m���'�|�B`��
������?�[WS���[a=�_���(��X�����sDKK�jjj$��,4�#c�]'����/�92K^�_����
G����C���~�������G===V�����.�K��\����`�:�|����q����_4�c��Zy�����5-X�5�����#�{�������yJ�Kg^�����<�K����Q�#6�������. '�	���V�~q����:u'|7�k��6���w�w�+Z�b��lp+\���[C�#0�ruuujjj�4��0
��9�t�d\.��.]:#�r����������rY](��g��	%id����W�huyHQ0��;wT\\������r��e�rr�\Z��_�����?sm��Z�k����7?J��E�am���V����Y���~S��n�m�r�o
IZ�0��f������H�U[[;�c�^�|>��
B���c�=L��l:o���% S���K����3~�\���p��/n�^v@\UV��4\�:����y���f���YfQ6��o]����o�v��x,xW��
iC����}�����5�z�������Y��N��:"'b�e��Y`��]���9�t Z���
���W� ;a!@��n��z��J}��e��z,xW?xW��jq�rl����U�j����#0�Bc�������0�1G�Ucc�������LE��;�s��<�=��A�/����]U��j��<���
�\O�:C��}�^Z���c�G����l�
��5�w��`�f�os�-�D�.���|����$������]�V~�_�G���r:�
�B���SKK�JJJ���=�s��<�/���GEEE*--��H���i����q����ug�.S������>-[��}C�d�j����@.��g�o�v�2����)_cY��n�1�6nV��������Y��-'����z���j~}��3X+..���Q P(�����#G����9�t�u��6��m��9��s���<���
������7�sW�S>�`8j����a�X�����^��/xes8��U@v�@j�=����,���j9�N�od���u����K��9'���9m84�t��w���a)���ur@��m�nU��fm>�O���xQ���qJa��kC��y����':���z��5=�yK������L����c'U���5��Bd�$fK�3�gaa��-z<��<�	�n�+�4aI20;�^	�53�3g.3N�Y����6|xG��vH>��E����y�]��m�
6n��12K���F>�r|X()o�k��`����a�����a4)1�#����e�KH�cdC��p�^EzO���������.`F�>~��f�n���{��"��Y�c��7!�m�C�����|AvW���L��P@�z���0��Ac6a"FH8�����z����A�/�7nwU�������E����e���"4B�
�;>���E�����
w���o������LIG(������@J]�
��mz���e����'||��2n�AH��E`@����i��/���n�=��V���;���������]���������[�_���5C����.3�%��1r�!9,2��F��H����N��OH�N�KH�����]�@��;��x�������;	���2:�����s
�!�j8���@V
���a:�c�J��dPO�{A�.�H���a.7�_]f��X���5��E�����<�	�n������=��q:�J\���7�N��;z���*�������1������0,�� ,�#���v3$L���'�w�����_j���4r�&}���x��V�@F!0 ��/�+<7n_� �������7t;j�q:��8K�9=��?]�t-O:����)II�|#$��10>CrH8xP�K����]U��|�����
������7�sW�S>�h�B�[Z�uK�������'����i8�k��H
�!9"<�p���q��_�������9�hVb��4+Y�,��e�fP8x����O�����%-H�t8�G`@���)|�>n��p+o�kV�r����g4(LU�������Z�,����T���<}RC'����>�HR�1BB:SC`@����i�m�4��9��s����V�r�1���'�Y����u����Y"I�����']{�X�!�����������C
w�����}��B0m|�gv3N�Y�k��{!��r�[(I����������L�z���������l���/
�s�V�!��mQ��-z<��<�	�n�+Y�#���{���}�	�9���'�Z������oh��>��q2�Y��P�p��/-��er�!Y*|�>>,�d_�@XR�7t;j�q:��f%FH8V��_�?�i��c)��F���m;ds8$Ia�����i�d�p�^�����edwUY]�F'��4+��x�,B�^H8�vN��O&}]#$,������2�K�&]U��`BF�#(L�Y�1��	<}r4$����@"0 ����X7nwU�^v���@2�Oa�����g�6?����$���N
�u,����p\�m!!�A���6�;����n�W5X]��[W�������5��\f��\k�%�>n�B��K����s�S���V�\ 0 D��4��%n��p+�}B�wZ]"���>~/�f%�EK��"o�4��x��I�}�X�!�t��q��f�����L7R��]�p(z<�)���	��:B�����8���0��6+���7������EHHdC2�pH#m[�0z<��<�	�
��B0��n����j{Z��f%FH8:s�!l��YE�������&��muy`�����d�2��f��w�:��s��o�AH�C2T�s�"��(n�^v@��duy`�o]����fP�N�c�&�	S�pl��_����1�d���z�����+_��Ueuy`������Eh4+Y�,������YI,:�!&<���W����*�W5X]�����fG�sW���Eh.3^6�b�JPw�:�RH(��IHH�
�d�H�I�;�����7�^v���@�����{i5+q-Zjv4^��<�Y��hH8x�d��6n�IH�c`�!0 CD��4�����m���s��I�t��^G�t���'�nV2������0Yt8 ��Ci�"
������s����VW��7t��^�W���E�n��Xw�:6��������[���m;	��a!Y��d�n���y��E��<�Y�:�n�Xl�m�"�mq�y���l����6+;�0�f%�,��=�r��XF�����%���	`<�X(��7aXh/; [1]���9���h��-�d���'SnV�	S�p<����%t8�$C,�X�p�`��}�����.�9'x�jTG�t�����Oy�D�c� 0���A�/�7nwU��������f%F@�N����fG�uK��<�P���x��c)�����T�m!!�)#0`�E>����{��m~I��V�@N�t���^��0B�COL�Y�XFH�j����e���t80]�E��6�t<7ns����5�� ��
�����
���5��hVb,7�NF���oK�:�i���H�F��H���q[���s����V�@N�t��w�nVR4o�����O�,B�1�p����:��		�4Cf�pH���qa����?z���)��z����v�
�N��,��=9-�Jb�����p��^:�u�����F��(��w(�}B6���
�:f@��,B���Z�t��������f%���w�:�tH(��p<��}
`<�������a���a!I
��*������a.3^V>#��{!��O})u86f��@& 0`�;�*<7n/; ������X}C�u�j������h���x����E(��@�!0`������*�B�t���^��0B�COL{��X�L�TBB����m;	d,Cf@8xP���q�vW��e�.���7t['?~w4(����,B�Y���x�
�>9:�0��FHH�c����i�oS�b}�����}U���`��P@'?~7�f%E�����--��Y���=�L�H�F��H���q���<�	)�iu�������:_mW�����a�"4���0r%��oK)$����v�z�L���������}��B��a�i�"t-Zz���6+��n�����GCB:��L���F��(2�a�x�Sy��9�VW��	��*������a���gm�t����7a�����v�9�L���g�o��[�a! ��
��Zf�N����Q�g���K����m;T�q3��4C�(��W�O~7n/; ��_��<�EG(�s=�B�T�[�uK���p6�����1$���)_zU����q��dwUY]i�����kv5�J�c���/tj��c)��F���m;	�I��)<�����q��J�U
V�@�����h�B��������"�FCBc_�d;!!�����DzO*��7n��p�^v���HJ������2��
�B�c�>��(�����7ns���>auy�5��������1�?t+�Y��EK�u4^Z��y-{M"���r�TB���e*�����A`@*�Ci�"
������?zX�wZ]!`�
�Vg�{��u�S��"g��*(���1��]�t��3 \Vn�,B)�����T�/� 	�$k��0�}B�����r�d��d����tg��	M����h��=av4�F���oK�����m;	 �$i��/+��7������p[]�����
��:���	����y�ni�Z��$�����XnL�cHOV�~�_�����������(��W���eO����I���,�K&������W'<���Z�T����k�K������`��������f���H���d���z�����+_��UeuyfIG(���[�'�C"�v�EK'l�Z�t��~��#W������BBi����/!`�de`�l	*|���q��J�U
V��)�v��:q��7t{�����	�#;�q��1o����+��������RY}�Ya�JP��O����`�����t8�����t������Y�����G
w���o�����������,�[S<q��2�#z�Gf��1d���)��6�t<7ns�����[]��sW�S>�������]��]�L���Nv�w�:�rHH�c�NV��'+�	�N��5��E�n�e+|Xy�R���a�t��&����S23������IL�Mv~&5��J���������accc���B!�B!��$��t�	�A�!�?�*�)���	-�7t[�����Kv�C���8�Cv"�f���x��cI��F�ccoB���20�HKK�$���Z�hPXSS��G�J�JJJt��fi��H[�x���l�����?��W�q��$�_�����/��E�������'j(��o����c�"�qN��Y�v8��m�h�:@����0
���B~�_^�������P���***t���qg��p���a���a�4�����+���R������y��\�R��W������v�A�c�m9��������t���J
��F@h�6liia2����p�`��}U���*���)��j�����
Z]��X�?��}�_4�x������c0�U��������w�����'�7�@���>�}�K������t9�Yy�]�/wi��kI�ck�]��oK)$4:n�AHY"gC��'I:r��3�^�����YG�5@8x0qX���}�V��S������_�@��#���3�-���F��e����?|Ci_�Z�����?�=��!E�3B�T;!a�����@�������%���KR����TII�y0�����7n{�K��������S���������!pg��m���	�L�0���� ��1�n���kR��u_� 7�����	%��Ih�"`���)|�>n��p+o�kV��3��n������HO��UA������B�H�>��H��#��N�I)-�����������N:�v�������L�EE)������8ca*!����p�BB�!9��}>���JJJ�:"s0wE��4��EE��n��OH�N�K�	���N�[��>wl�`X�yL_��L�z6�C���R�#S;p�F�
iphP����{�#E��y��b��g�/E�V���&?'C�����/��FHH�c�]9VVV�������Smm���d���P(��;w�������{��B�;e_s��p�����'�����������6�?�\�J���V�i��@�����>�_�L�]3��X&���e��9��F�8"���71�Dh�?F����vX}[�f�D"������|����omm������WEE�B���N��59W��������������l3��k�B!555����
r:����Tcc����������`0������t�i�m�"�m���N��O��p[��s������;�P�5�Y]����Y`�l�2�f80�5����t���W8�s@1[�Y�_�����KPZZ����f��/gfz�^9rDuuu�����������DG�������EMMMI=v���d_QQa���x������"�������%��v��g�����P�}Ua�4��������]Nx���S������Z�u@&E��p������|9��������Q(4��n��6�M�����Th�v�Z��~3�u:�
��s��~�<���?����}V�0w�U8x0n�^v@vW��/M���G������c�������6�n=a�,c�!r�l�#W�I5����$-(\���]I7
�F����v"+0��Q�9�c�m����Cc����Too���1�����Q�zC����]�@ ���rkX�/���#��U%{��_����������~�H��U�������U�~��2�$C`vL5�����������xNlj�&��1�
�!0�rgI�!��P}>�9��x��zU[[�����|>�����okkk�Z�<v�n��#�6��}��N����UWW��G��x�p��#<HX8��������%{������'�����Z]&d�t�t�&���/�9ZZZ��#����)*,�F�����I�<���A9[�|>y�^������1��R
���o��q��}��<sC��M���q�6���p���n�����hX��?����W	�,�33�����(^�7*���:z���~�jjj������������N�ZZZ����}Y]]5Co����
g��<�/�����-q�6�[y�V���:B���Q���b�`X��h���m�+�����%69���.eB9���I����U]]w�X�������555em`�n�F`h�V��'�����&9r�"'��7���l9@�)��.i8=������R>�
��}�������6wl��!�a����V�	d�+7":�1�������Oh�c�����dc�neee��p���jy<���9�D5���KJJt��E"������Y%%%
�B��s���Tad�� ��4��E�������s����a�+�Zz������N>�=������3�B�mWnDt��O������;z��!]�	K��>���g^f��)o�X^���='���*
�����ug4���4�	755%�/�\v��
O�B�344$IV__j��U�7���o�\�m�DVI3�����������u��3����|@�S�-�Y=c�W����jhh��
0���cu	����2��>�\����6u2����������PN�s)�KWII��!���cSS��=�r`�l#�l=/�`0�/^���sG]]]�~�w�`�t���u�~����������=�/���G����E�a��s�������M���)�Goo�z{{�~!
�V��)���j�����Cg?^��}��.g��L`��x�������	C�P(�������z�jjjJz�v:K���<�g��~�?����O
~W������e%��N��_�7���/�XumH�������O��P`�!a���j�I��_q��*_���������D`(I������PEEET���|>��I������3���5��L��0v�g��7U����z=`�����QQQ��������q�vW�~���~���Bz������	w<������������eb]]]�����e��r��.�Y��s��������r����f[���Nw������%�Ir��i��<m{�>�^�P��o%)9655���D�@@r:�Q3
�@T��v�����D"V������QKK��^�Z[[>���I���x�B�����r���=�p���q[�f��X]^�������c����Jx������[�y�uV�
��������i�����������
����2#,����T[�*++��1�44�����������H�F>�r����V��nuyY�#P�_�_�����c���Fw�����W��2�Kf�����B���MnbY.��c�"92�n���l
�l���;���	�!
�]�=�Z[[�`-
���B~�_�G���7�[�v�9n�
�TWW���������;����<�o:�$G>�����p�^��N������3o�L���=5������[umH���*y~�[�9����cI20;X	�.~�0�	��r�n�5��-N�SG�1����by<9�N��~�B!3l���#fg�
��k&�-��b�!�?��0,�s� ,L�w���?NxlC��^|�7���oX]&0m�		%i����B�<9�����a��%2���w�Xuuujjj���Y��x������Zy<��~�|>������6j�����W]]m���T]]������/`�� {��mQ��-n<��?��������7t[u?m7,������=Z���au���]�	����o��;��c�z�������k�������X���@�����P��%�6�m����u:������l�s�cIr�s����q�����������uUu�����~���������0��$�,�f?sH����Nw�t����`&a�sIrl�cC2�N���PH���X�8,\�@X��sW�Uw�q��
��E�~����~��J�����������bH%+CI�������	�a�*d�p����^����d_����e��w�X�m?�������O�W�~��2��L%$��&_��k����������^��z<577[}�D>����{��m~I��V����n�����c�8��x������W�����T )WnD���BO8��	'����$566�����:�=�%�����g��m����fuyY!x���N7����q���
����y�uV�
L(��p���?v!a�60�U[[�����L�H�F��H��{��
V�����L��s�������+���;f�W�y�O�����LS		�������������a]]��~�l���
����
���*�)���	�0�~��2��~����&�8�w#:�1������p�ee`���'���	�L�pH#m[�o��w*�}B6���
3Z��m}����u'�������������}V�
��������N�<B�����! ��/�������'�uUu��:o~w�����Hsd�tCB�|�6�����|����6rJV�6T@.
w�U8x0n�^v@vW���e�sW�Uw�U���[�?���+�<_�c��.��T�aS
	�����5Yke4^Y@F	&]U����h����w����Z��k�W����,$Q��HR����|�
��~!@F	*��7n���������e�d�+��+O�����.sP:!�$m,���"Y��{<�KL�H[����p������2�d�~��
�|�k�W�Y�nH�q���1�������! �D��4��%n��p+�}B�wZ]bF�l������7j_c�B��=a������9��`�������c���w���a�8����W��~�-wY]*r����Nw�t���� $����4��E�������s����������~���h�M0#�
	W/�k��|m{,��0�,3���"�mq�yk^#,L�o������:B��c�~������U��.9f�!��5�Z���0[,�@�(r�Gq����=�%���8���{�[���v�1c�B�W��
6n��T��+7"fHx�'��y�����0���r���|AvW���e�c��j8����c�����u�~��;47���._b����#$���YUx���h`���U%�����8
��'���fx����e�+@��3	W/�[}�F9�|>9�Ny<�I[WW'�����Z���9%�{RE=�q�6�[��V��Q���p�#r<���� ����t���z�SBB�������B^�W����>�����fY��M#|9n��p+�}���2������������m��� �!a�G#:�1��y��sOV��@@�@���P($��7����P(d�m��1�H�i8��7�9�;��0cL�_a���Z���_!��nH��o��5y�X�/��yV�fYV�����X~�?�x"�,]L�q����H�F�l��=7�i����W�	M5$t>O�dmd�i������������)��=�����:�����H[��'��V8�V�����p�K�0������#��I�BB$������F566�_�l���0��p�^EzO��_^P�O�?auya��

��p���oWk���X]*2P:!�$m,�����wEmm�JJJ�.����z����o��W�G~[EV�&��p��!���;Z���T�q��� ��n\c��yr��Y}�`9��m�V8xP�K����]U����zz�.�r�W�t�-=�A9��+47�$�BOXo��)!!fEN��PH~�?�s�N'�O`D>����{��m~I��R0hu���h�BI���^�~MEM/�_!�w�'���:�1�+7	1{r"0L�;���`�E��4��l����V����.�r��������������K��Y]*,�nH�z�]��k�c������S��z�]�,2��F��H���q[���s����V�h���+l8���?[���vX]*f�TC��k��|	!!�ON�^�W���?P>�O>�OMMM�x<�y�i8�����B�;e���'������������
��+7"fHx�'��y���
9&��������t���N�G���V��o8���-�����s������B�$�_��+�����`H7$\������GH�Y3gCCuu�������D`� |�>aXh/;0����P@/��mo]�;f�W���o�����.3h*!�1�p�2����9f��N����T�*��;�*<7n_� �����,s��U��C��4����_���}Vv}��R0��Ft�cDo��)!!������/P8x0qX���}�V�g�����;~����_a�������Y]*���}4���I�GH�L4�C�����Ib92LA8xP���q�vW��e�.�}C���������	��:��in�#�
	�m��&O���~8�����D`���TQQ��9���V�
Y)�������q��-�����D2�����A�o��� �M5$t>O��D�6���%%%jll������:��6��m��CQ�6�[y�R���g���V}�������������j���`
Nw���n$�s	��r����x���:��JJJTRRbu����C
w�����}��9&�_��gk5��K@�		%i�c����j9��u:��z�V��k8���-���E��;��>!��mu��*��
�mZ����
�L�!��5FH�'�|���LIN���5��l|X(���a�����WXV\�%?��~�Y�BOXo��)!!�����ZZZ�������q��#�����Z9�so��+��W�O~7n/; �����f�d������_����du������Nw�t���� $�������E555	���~��~������Q���V�/|�U�����+_�sa�d�~��
-g�B���nH�z�]��k�c����r&0<z��z�^y���N�~�_G����WMM��N�*++�.2V8xP���q�vW����.o�L�_�W����@DE��;��^gu��1��p��|-_BH��%g���:IRsss����rdcbSS�!�#���p���q��-{����5��W���k����W�k/+o���r��+7"fHx�'��y�����}>���*++']j\]]m�qTRRbu��Q"�mi�7ns���>auy�&��
W�7���z��d�tC��Kl���}���9�
��^/�!$2
�C���N�=,����Q��W����ZX���R���}����iZ!�1�p�2���d��	�`L��O�V��������z�����������
���2�w#:{y��~�D���$
%u!!���	=�ZZZTYY��$V(RKK�<��^���@�����������yM6����f\G(������
�3�+\�_EM/+u����)�w#:�1���Ft�cX����#$����$566���B:r�H�0����������.2F�s�"�'���ed{�KV�7�|����o��
W,_�%M���`�����q��i��<m,����<�o�J9655���D�@@r:�Q3
���{�F��]��:�H�-� �/�+<7n_����*���q��W���������. �M5$t>O��L�X&g~�|>_���P(n-<���W����*�W5X]��Jf��������K�W0�Nw���nry
����v=�z>!!0�r�'����� �DzO*��7n�V�Y��V�7������w��^�����Y].@NJ'$�����kI��������TEE9m#g~�h`����i��/���n�=��V�7�������e�-wY].@NI7$��&����yr�������V���3�! 9��5��EE�w*�}B�wZ]��Iv��EU�hn0M.������N9$0{r60���
�B>�Y��������saa��.��-��gu�Y�BOX�;�u�cXWn�&����:�|>����OGds�H�E������'ds��.oFL�_����W_��<9���
6n��\���nH�z�]��k�c���@�������NMMMV�+��7aXh/;��a�d�~�L���}����������r��TC��k��|	!!�ir"0�Bjii�$UWW���V%%%V�#|�^����q���]UV�7#&��pC������*��kU����� WnD���BO8��	������_���Qss���@F	*|���q��J��/X]���l��������nh��r<����d�tC��Kl���}��@�������B�T�so����/�^v����]2�n�����4����hS		��������
i����
�����	����)|�>n��p+o�kV�7�����t�@K^�����nD�;F����sX��^�W---�_
G��H���q���<�	)?�����?�w�%<f�W�d�j}�e�-wY].@F1B���Ft�c8��	�����$577k������Pmm�����.	�1R�co\X�|��k�TX�7t[�m?�c��	����pQ�>��|&���1���k���,_�����
3$+���
�|���B��jjjTSS3�u"��[�@Vi�m�"�m���N��O��p[]��	���������.�����bu���jH��|�6���@��I��XJ��j������v����'������U�q���X�t��g!����&7q����������Q�Ph�����p�`�������*���6��W��F=�M���NH(I��'$���������(�K�&]U9&�_�<�:}�e�+sN�!��5FH�'�|��� de`�'<�����q��J��V�7-���p���ha�>���5z�z��O		L��	'j�2��)�����JUVVZ}��H���{��m����.oZ$�_��kCr<���o�au�3�BOX�;�u�cXWn�~9�#
����:z�����U]]muI��H�F�����n��OH�N�K��d�+t�[�%�d�B���
	W/�k��|m{,��@�r&0lmmUMM�ZZZTYY���jy�^I��������Umm�$���EG�UMM�JJJ�s �
�F����P�N�=��aa����.�����Wr�TC��k��|	!!���L`���b��G��:�t:U]]���j������I���������USS����t��QC�o��0�}B�����pJ&�����n���[��m�U�#,9����^�	'}!!��d�D"��5E{��G���+�s��5�PH����x<:����fSII�������gaa��-�P�c'd+�lu��
����QQQ�JKK��M�_���a���i��!-�����b��Y���K}}}Z�l�\.���9���K�JKKUTTdu9�2�����������dH��0�|v�+�.�@@^�w��P�m��x�CII����U]]�l6���_B�������#��f����X555
�B�zN6�d�����a���@V���~�����?L>����Q�`���]�B�������{�j��
h�kw��;CI���������t����������\X`����dII~�a����ZZZ���4����***���N[ZZ������,��3�s��< ��;�*<7n_� ��������~��o���d������[���LBc���e93�@��O���@ 0ipf���4���B�PVJG�U]]]R���s�9���W���Www�����L�9�t�-�����*�W�`uyi���}���0,\4�������n�`�f-i�OX�N����|XxlP������Lk&�����59�ict=������;����|>�����Zuu�����;w����jxb,���sgR�#}>��~��N��9b.�.))Qkk�JJJ����^�t�����lT���q�vW��e�./m�������b��&����p�=�yKv}EE�|��& k$
	OwOz�c�M����_�OH�R9��SYY���fI���***d���***���"Ijnn6����3l2B�L����v�Zs�d2uaYeee�>�Fi����9�t�
"��H���q�6�;�����������:�I��Y].��������X���.���<�o��S{VWW���DG�5�A���Tee�jkkURRb���~s|��)������okkk��z']�m��8���2�c��2�s��< �E��4��l����V����������Uq������
mK�@V8�1���Ft�cD�w#I���o��5yr>O�����rD�}2y�^y�^577�3�<��a`$��z&�x<jmmMi	��f �]SR���t�����L�o�H�i8z�[���aa~v��X}C�����.A^4�W���������LK���d�tBBI��X>!!�����R��'a*�6lI�d������M�k8�����B�;e�pV�������V�%����U���U��4�9���r��n\c��yr��Y}���C���,5YN�3+�"�����k��9�t^"��]�����\0�Gnj���u�����y�uc��i�R�/hu�)���w�������c��z��5-kd���o�?R_0���t��U���
����k�������20M>���w.��G���P�a�������S����Z8o4\��.�Y}C9��9d�x@V�1���~�***R:�����������B�����#�ku�?�}#�q��,���7~ER��%������3}����_a�p�>�����eROv��Mn�����oO�B&
��~X�r�|��8��S�����>�|Y�J������*�oD���+e�,~����a&�z��.�@:�M��g�������Z�$=��*L^+�}�h���.0�G��������Lc�+������������gu�@��s�����5o�<��?��r��e�E����?'��j��'��B�?.����w(���SyVhC�-��lt���3'�3�l�
����0	^�w��%>�O>�OMMM�x<jll����������79�r�d\.��\sS�s�������W��e����Y]`
:B��S��t��O�����
47fXWW����T\\��]�2�Hr�\**������.��u�cX�;�u�F�{�^f��5���&_���$-��V�$~�����a2���N�Suuu�x<i5�V^�W>�o�!��mII�������LT8x0n����}U�����Xw��7'<6v���M�������Yu�F�	/���>/>$��6gCCuu�������4�C#,�9�1>v�n:�d�y@&*��7n������������|��u'������p���[7���j�5�^����������	�Iv��mN�S�R����������	�����H��7�s��<�j���	�B���U3��n�+���a�������n(o�KK��+��7�.��+7":�����g������PRa��%6�z�>�g�:���z�>�Bs���[��������x
�TQQak�PH555
*))Quu������+E��4�����m���'�|��%&�#����UG(��u�`X�9���;o)u����@����.���������C�=O���es�We�2��$��~���H��Z�l8r����]+�����by<�B!9�N9rdZ����+D��4��E��������YN�_��kCj8��
�5�9���r@������}4���I��|�����p���'�����f��#P����Z���u%%%���Vuu��N��ZTWW����	��K��l:�u�!�;�&��'d+|��
��p�y�����[���-���K��`Z3	����~�?���=�TX��o��������3�&a�D"����P>�OI=���D���sr�!�
�4��E����Cy����p[]����nk��o%\�,��W�t�-�}�e�s��{L0TOO����TZZj�-9���K}}}Z�l�\.���9�����R��k��;��1���k���|�6��S�r?s�����'���Qkk���+))1;��l�x6aXh/;�aaG(�}������q�
��p���6���e*����[N0��t��g!����&7����.'>=�N'pd�p�^E>�Q�������*���T*�.��'��au� K�J����		`��I
3,|�^����q��*+������;~��������3�������U��.d�tC��k��0O��6�orF���PHMMM��|�����z��xT[[+�3;:��~��A�/�7nwU�^v���&�7t[/��m��������
��W���V�������z�SBB�@9��~UTT(
���|��|:z���9B�\3.<�p���q[���;B�����u5����
���T���������@�����a��������r&0�B��s�B��JJJTYY)��c�&��|jiiQ PEE�����i`�D�������s������V}��P��&c�+��^��o��~�`B�������qM��=�OH�,gC#4:&���^�W���������WKK�jkk�.@���i�mK����V������Y��~���������.d������k�BB�J��G��$566�;s��t���Q��|���ph4,��!�)��&�_�$9�I�����d�a����!���p��@����������D^�w��y�^9�N�|>�K�k&��'ds���0�d�+�9Z����
�)��p���?v!!d��	%���$��9����Q m�������Cyk^���0��
�W����/+o�������3	W/�[}�	�L`XRR"���P(4a3�@ �@ �t��_�O�������./�T�+\T���&�a�w#:�1���������	�^�ZZZTWW����qWWWg>�C�s����q��U
����./�d�~��
U��O��`�W��j��%!a�G#:�1��y����r&0���VKK��-������06!"�BBIDAT
������E>�ON���'�E�b}���U%���./�d��t���������*������,J7$t��i��<m,����<�o0E9z<9rD;w�������Icc#K�LY8xP�K����]U�����8��WX���e���_]&�7�in�1�����<m\�3�Z�C��$UVV���������N��^�jkk��x�.@��|�#�;������'�����nh�`X���T������������F>	G�;��07��'���Qkk�B���~�JJJ�M`ZE��4��l����V�����Ea�Br����n
D���]�������)=����		`��O���������R%%%��B�N��6��m��C����aa�3���d�+�$��/i��V���q�FDWB��
O�u]	'=p�l\c��yr��Y��fQN�~�_---�F�%��)���������phI�~�+ E�����Kt��}*�[������~ycL
����
��AH�r$0������c3b8���-�|=�YXhs������~������i?�����u]	G}}�����~�F���$}j�����+'C��#���P($���Rd�n�m�"�mq��U
�
�����'���%<>v����vhQ�>�B@��,����U�������u�q#,]n'$������t���U��s�jkkU]]-�3s��^�����������.O��~�/����#�;�_���}Z��+V�@\#�G�_���o���;��|~�`��D`(Iuuu*))���W]]����&='��=DX+��W����q��2&,�}��^>����+t|�^7[]2 G��%z|��
�~u�M��z�����K�|��k�?@�������Y]�L��d_�`uy���w�X�m?��X�~��o����BOX�"Q_gZ�^+8��T�����@/Q���������***�����	[[[�.@	*��7n����d/;`uy�����:����o�~�7���z�+�����7"�r#u��/>�s��E-�]Th�[���_Z}�L��	it`�DzO&n��y����7t[�N}+��
���V���d�8&�R���;��E}�y>��t~���	c��~�B�	C� V��M#|9n��p+�}B�����D�Mb�+\T�O�����^�u�v�%��gl�����:9���������'�x��+�����-�p�_6�;e�pF���N}+as���w���kZ4��������WIH5��K�zW/�kQ�-����t� 7�L`XWW���&�����C
w�M��O�V�����nU������������-wu��4�g�Bs
�z���K����a(RKK�$���Z���*))��,�b8���-�����s�����������;~�����&�������,��1Y�^�Q����E�'��;��@�r"04�+�x<jnn���)#�&�e,
��n����vB^4�W������$I��i���XV+��m���N������L����Y�R����'?�����UeY]�uBn8��V]������*����Zd�T;����z�2_:��\���:��6$�]������,
;B�����u5���kCj8��
������o�Ls~�P�n�Z�[��T0������k����`T��^�W---�_ )��A�/�7nwU�^v����]m�K�~;a'�
�;��_�j�`X���T���in������h�g���u��I�!R���$577k������Pmm�����.	@�
*��7n�V����0�N�472��F��e��?:�d��+**����B���QMM����D��R�D�������s��j8�����$}�L������K��m�eu�,���������;*�S/�@������o�H���q���<�	)��A�����O����;�h0��N^�c���9*�����^g�kd��:�������#r���y�
T0� .��S/bee`����P(du��ph4,����w����ea�x������wM��
)u�����	r�d��������{����|c���.���i��er�[};�Yz<�K�M&��'ds�g���P@�N}+as����in�\�������������/���N��ZV���|>9�������:9�N���Z]6���YX�o�;���5K�B�������K>�yK_=wC��Z��+ZX����
�nz�f8x�c��rR�y�N��e9VTT������u��655sD�b}���^v@��4��|����n����*�����!��&�~�w#:�1����?���.��S/�pOV��@@�@��_�Ph����8{ �/��W����q���]U�^O2��m��4�g�Bd��\f<Y�G�^`�de`(��(�����'�>�@n_z5qX���}��ZK��m����u�j{��E�a5��lnR������e���b:��^f���<? �de`XRR���Z555�u���Qss���`����X7nwU�^v`Vk	������v�N���
��gz����
6n���47AF��e���6m\���ey�<���@f���P����h~m�����@��|�#�;�����Y���Lsd��Xf�~8O����3	d��
c��������2X(�����g��m���'f��c���n��q;!�L�l�=���&�(�2c#$L��e�����jsF��_������H�F��H�1
����aa�s�j�n�!}���	���in�Lq�FD�G���Q�]�s]������������|���d+��x�4�%��G�B
�(Y�����<��
�t�p�n�3�
NJZ��&��&����lIb4�.����H���rw��p17h����H�q� �W�hlG�g��?�s"y~hF��h�����#�����}��[���`	KJ��sA���*!����������gU��I�C���BT������%��2c`�"0P���R��d�q|z��a���2/��<��� kq����������vb^�r��X�T��K��Lf�;7����p����5g%����Z�HR�������v����2���k��g�1�T�Z��N�c�2��M��l�5/}�W	�ytB{��k��r��W���T�G�E����n��V���D`�*��;���hw^{������h�@O�mV%����Zu��7A���@��:����aa�.97���>���|��g���q�����e��o�,3P.��Jz�W����v����i��?>yN���d�J�w�`L���W��[��gO��Xb���J����B�e��G`�j�c������l~���/����vk���*��P��p�Zn<,30W����T��[3����<+���z�!#�=?|T�g_�������}����M��	
V�e�-o�,30W�<���J
n����
�n9�?P�����_��>������u�����Mr}z/�M�W)��������k�v5!��"0��%
�_��5,�iyV�+�����*!��d\�?F�Zn�����	�b�1�jC``�JJ
n����T������^~�@O��&�t�sc�y�,�M��e���!�+5tg�������a���9=<�UE_;��m����:���	�������T�1��Xf`�!0� ��;e�z:����_��]e�n�J�kI��������K�����X���
<�R���/�L��,3�PXp���*=����l�U��p��k���Z����Lq����s�{Yf`1#0���G{�~���vg�.9������������<|V�?F�������7Y�e����R�x�e��CFz�W����vG�Me
�v�=���Y�Y��W��MW��]�G�2*�2����5�J��
�!��L*}loF�������*�u��Lq���T���p�e�Cg&��&%�i��+����Y��]�k�O���>��_1c��i���)m��\���;�W�H�r�q�u5���a�1�E��@e%��a�j�r^�,a��������f�������~nLM
���a��m������G`�r���5-���j)�%��Lq��e-/~�dZ����:��,u�*����Lfl���DY��o�D�o�'���������(nRe����I�@������Y�Bg�~9��H�����O�o���������=nj%�M��������p������
��l�U�������BZ�?F�?�����7Y�J���Z^�2c��!�y�~���aa�.9������<��� kq�rS�G���q�W���j����5���e�P,C�&=��������]r6�/�������WsVB�=��~�CZ�k7a�P�e���j�����!00/�SO+=����p���a_I����/j��Z	�ytB{����|��&���z����z��J��x�f�9�R�;,eg&��3���jQM��R��d�����/\���i���*�X��gL��_Zf���U:�X�����B���`�����.����W4�8�0uuui���r8jhhPGG���{�j9�c&��&%/�P�.yX���O��znL�>v�V�',�'������M������=q^��:x4�S����iYW�;>�L�?y����������\KXe��%&�����o��z��������#��[�������/�,{X8>yN���d�J�w�`L�}�V?Aq�r��[�"�
�@������$I###�x<����x<.�����>��n��q���+����CU{�$i(5�M�����:\-%�L!�������u��J?�Eg�2��C)%&��f�v�C�u,3���a�f����*E�Q��~��n���}�������f�����V��qE"�|��;��:�M����vg�~9w��CF\���@��&NOj�wN�����;o���X4%�����f|�uu��3z�;��n����,z��������f�566V�;��e�`~���Z�jU��,z�9�����p	��T�P�������p��v�$I���Uy�#=�Y��0��}�{�������������������i8rQ�>5������yA�\,*,�����[�����:��J��zNX�oz��b�o�K������Omr-_��u����8�^z�S����v����,,�W	��'������b�j75U�qT�R/3���F�z�@5!0\B�P��v+���_�XL^�W>�O�@ cv�L!��~ye�j9����6��s���\c�@O��&��J��q�oiU�a������n��V�P��+0������F�F
����7�3���<`���P:��^��;3�WD���s>���9=<�UE_;��m����:�-����&x�dzZH8V8���F-�j*}K�"0\"�����ti�nww����d����
�����
)�5[q�����C��BW�j6?1����}]]?xD�����mkI���V���L��o���X��e������q��#0\B����Pww�ZA����J��PH�������7::����W��|�������E���K'�����OK:=��??�}���\j"c����z�{g������M��^y���cA8?������W�N�����s���}�z���6_saZ����I'�����K�/��?�W����o����n�`ttT���z��,�9T���F�X�������p��f��v�����P(����Y����W-�]����/������������fbZ{��R����<+I�F����g~yH����m7�UG|��n�K�o����_�|�^/�tit�^�����:�;VMh�����_;���6��t9o2�\�c�O����������E0�/���5�\S�.���6��g/M.����X�r\>n�{��D����_i�k_P�eaa��m��o<�����t�o�<�8�����~2�O����\��\Y�GQQ�{y����z�J_h�jWZ�Z����^�������LI5�\o�*�D"�s��i����_��a������+�� P6'O^��v��|��JwX�s�f��5K`��|>���h��*���x��������Jv���4��C�/el���������t�}=������������r��J?��IL�:<���#�:q�����v���k�vue�C8::j��������?^.\���b�ed�W]u�V�ZU���c(/�%���C�pX>�O�H� &�J��*�V�fn���/_�[-�avRCw�Lf�;���1��p|���<����W�������SK��Ib���#������P�P�����*YAY,�:�0��aIR[[[�q���Y��Z�L
��8/=�)�������r6���y���>�/kX�����>����$���������>��:p�b��p�j�v4����,�3�B�?y����2�B@�����6y<�����i��a���2C^�W�@����z��c����x<.��3��j:�I���hoF��q����C����elk���~�6����T�����`^�h���w���w��=7/����r�Wf�1���$y�p��������W,SCC��^��n�b��f[��������i���q�a�7�j9�I��*��#���]r6���y�9������n>�{Wm���v���^����G&u�hr�}w4�j��e{!`�a����z522�`0(���X,�h4*���`0�H$��b���������0J�f�
�|/`����G{���hw4�4���������w�u��%�8c���{A{�|���U���-u�lB�B@)9L�,��&�%�L*5�MJN7������g�Zw���<��b_�7����V^H��/N���w,��:��U����Z��RW���GGGu��I�Z�J7n�tw�E��W^������YC�d���Vo������<`�eeI2��r��W��SX���{����"c���K�M��$�W8x<��/$uxhi���@` ���5,T�[���*,2����>�����mNO�K?{�����]�K������.j�x*�~kW;�u3A!`~��',�iyVWK��|���u�[	s2c�������������	
wnY���
�|7
`�7�B31���f��
g������;o�������IN����&}lo�����_��?R���2���w#������W������7U����������3�kK�;`Kw*=������O��]E�k|����#��3?����BZ�SiG�/eq�B���u5��}ujYWS�.`#0 IJ��H���q����[���'�������.�2c��DR���R7��Q��01a��P��P�(=��������]r6�/�\CF\����SB3�m8=�G�7����*}�%��0u��E}�hR�	�B@�#0�8���Jwf�;\-rn�W������|Y	gfq�����S���kv�q�o�$�	
w4�j{s�6�qV�����X����RCwf�;\-�iyV�u|�o��?������%k���?�E-k����<g��;�,����Jw��K��Tjp��4�o�u>t(�gN�8��O��L�������3�������%(T3C`)JJ��s�a���9��@�d�������^�|���.nRhP��wh{s�v4���!��$
���|����7�B�������}]��V�^�Idl�pzR�1�#�}Q����b���[���'(T?C`�I�t���`F�s���������O=��5��M6���__����;>Z�[����)|!��C�����XB���2�e�;�������s|���Q_y��t�6����#t�������������E
O��o�j��n&(,^����Tz�7��y����_{.�'~�?�������t���E5k+}�E)&(��e�v4��&`q�;_`	H��fw��a_A���ot�[�e�v��W���PU79x4���I�B.�w��"��Uz�3��q�G�l�?�����������MV^H�/S7���=����<���#�:q���A!`��;a`3���BW�j6?1��C?I����2�m8=����&��>^��,H�Aa�����:����t��C`�2�J
n�hw�ZT���T��{�?��#�~�Q�����_�����77W�6�JL�:<�"(���"d�q�RX�4�o�u�y�������
��?��
��'j���.��&�	S�\���&�� (���b�4�~ig�����Y9�X���/�������u������\��M�	
w4�j{s�6�qV��,(��b�4��&31�������p��<t|�����G�X���m+/������~������
wnY�����6�!������:����pS����N�����W\���r��������q�o/��3�������%(�0��"���y���vg�~9w�<��/����d�N�����q,��?��W�����U����Z�h�#(�@��"�>�W����vg���a����~�w����g�i�����=�U
k*}{�b���[���'(��@�K��*��#���]r6��y\�3������,3P�q�*}��,��&��S:�BR��
(7C���G{���hw4��7,|�k���8�u�g������_�[�t)(|�G5x<�w��������R 0���T����v��E5�?�������b�gt�=��m����|X����������[�iG3�P*|�
T!31���6)iLkw\�N5-�J���c�����s�w�����}������{���G�:<�$(���n�6I#kX�Z�����=�?9���o�7&����uE��<���#�:q���A!��w�@5���<+��%�����#����rg�����=�=�U������������S����������"��m2��5������O�7�F�,a�i����{��!1a��P����������a��i�WdZ��H���}gm2��>{�P�����'&L8rQ�>�Tb���������N�G{3����+g��im��Q��A]�����I�����y-nRLP���V����i�����R#0����dw��a����?�����7��U�3�6���}�����������;�,��������X����J����l�%g��imG�������:��%,��B����y)nr���G&u�hr�}	
Xx��<������W����f"�������J�\����
����/nRhP��wh{s�v4����Tj���v��E5-�J��7�K�����������������e�k�A��-ur��P��Tjp��4�o�uO�/��O����_�q��I�von�ji+[?��t�����I�Pz�s�������_���RN=����7<��S?��������]�����TC`�HJ
n����T����I���|���jY�~�R.}���j\yM��WLP�s�2�h�����D��[������r�Zd&:�O��k:�%,l�Y��-��WB>x4���I�B�����p���C����r6�R������/o�T�J�p�
��M����������3f��
X\����p������k���q��/�/��g���>��{��	���>T���������S����>@PR�@�G{��������O�<�u}%�5}w���}\���v�����������C)�B@`TJz�W����v����i�~�_���>�cMY�B�V�o���n�����0u��E}�hR�	�B@`T����Z�|����/t���x]']��M~���'��_���I1A���Zmo���5����!0����R��2������o}�K_^���.����������uXXlP�s�2�]���#����G��/��Ic��Z�R���z���7�P�J����}����Y]��S�L���������%i(����a�����?��7��z������_�%
]�mo����:�B�8C`>$
���Lfl:�����_�s����N�������.WlP�sK�\�������;����|���O��c���^�J�7�WT%���)|!��C�`v�2Kw�<�tF���7����Zg�g���i����x���&��Sz�G5x<�w��������F`�Q��^�G{3����{��k�d
oy���{{Aaa1A��-����!�#=�$=�����d�m��
�����|b�h�ol�s<����$A!(9R����Jwf�?|��o��Y�)����I82�g�����"MJ�L*}l����Y�����gC��.g��}1oq�B���u5��}ujYWS����!PBfbP��mR���FSW�sgn���2�oZ�v�����&&LJ�yE`�J�����u��~G	�.c���-=��`Fq����G.��G�JL��E`�B������:�������������	
w4�j{s�6�q
���H
n������������1����c���?��\lP�s�2�]���-�E�����p���0t�=zf����\5W�����J�'��:pdR�&g�A!�/����;���t�����
'Wg���]����������\��u�`���~�;,N��}�V�������m�����+g�j��w/��R'W=A!�_��,�G{�>�W�������������7�h�����Y��C�����,��,��;%������}J���������{V���m�LPC`��UB�d~R?9�UR*��kW;�s�2�hf�������Oy�-�;2����=u�N��1���`!#�f![X�L���A�\�.�1���\����B����k.\������*�-�jt�����������!0����X�$-����[{FXHP��!0���T�v�u������m#(��������:p���v4�j{s�6�qV�{�F��E�0uuui���r8jhhPGG�����wn����!�RP���+t���	@�c�!�x<.���x<.I�z���b
����b���������]��s�2mZ[�����]��a:���v��q�|>���i``@###vp���1�kl�\KXC,:�hT�XLn�[}}}r���$���H$"���X,�h4Z��,8�Xt� ����-n�[�@@����_��,8�Xtb���K3
���]h��o!0��3��I6V;K�2Q%��L3/_�<+?�J?;T�[����	��=����IC
����]e��uaRW��RZ��tw�E���II���ni��JwX�s�jW��~'P�^�E`�%-�������I�P��T%�������@�T����C5_{�V-���%�l�0��6�������5��6%�J��"MMM�������Q��l�����Pr,I�����$�.~��$I���]Xp��XA�^�j��rd�E�������g]~�%�5o!0����ze��~�������qy<��9@%8L�4+�	�����Z[[�����*��0��nE"�$d�2������������n����@ ����B��a��Co!0`#0`#0`#0`#0`#0`#0`#0�L8�����������������x<���0���������p���A2���T�x<���9���1�����b����6v����Fs�x�g����V�����W8.�8�P:�4�JwP�����~���n���iAa__�������z����������V�������;�>�1������R(��<u�HROO����co@�b���~��y<�a�9���'�8�PQf���E��z�D466�������!a������+����illL�?�uttT������kZx��
(^(�����n��������M�������7�x�gE��k�SOO��n�������3�c�eb0�$��v�cccY��x<�$3�m�H$�qccc�1�H���,XS��$3�G;�P���1{l���d��;�@�nc��������'���)��z�Y�c�%a�!`N���z�^�����X�0����������q�����we�I2�K�k�fM����2�����a�!�����������$��������y���IR��z�P>������|���M,S?�Y�<O�}�w�r^`)���R<Www���fb�����\a��V�1�
�=�����S�k8�
(f������}!��S������:�=_J`����W8����qv��xf�V��P($�������~�B��Xo@�|>��^��P{{����F���
I�7��a(�P(d���!o����Z�,u�x\r��Y�E�:&������3�h4�P(�H$2-�`�����'���h4����|>��X,&���`0��C2�P>�0�M8VWW��K�X��\d&�T��
��dww���T>�7`���vI��W866&�4���c�<���Yg�����vO{a4��������3a��!0�E8��|�����]�^(R4U[[[��(�x<�H$�@ `�@@��+�=�0�~�z�B!y�^
�4M���(�K�������Pr����.��e�?Q�X,������"��
�.d�l(i��Gc��tuu�0y��i��=�����D$]��9�&�7�8������*IX8��]�b5u\�b1�jS�lIo�[��1�xr��4�Jr1Ko@&+t�U�����}8.���7`�%���a/�����Z�d��)���W9��T#kLX�Z����*�n�`0����v<0���Y�%_@o��}�U�%����p8,����������C�TV;�G�K�o����HN�o�7]�7�x^�W�x<��Gk����xf����7`~0�0gS�B���3�>����g�ph-7��KI$��kjQ!�m�7G�7�x�x�5n$������P<k<�z�a����7`~�$������
�	��y�0����y�a���C�x\��J�@	0���M-v���:-��|�L}��
(�5���]�����4�]��q�7�<�i����z��������}}>�]�N�������pg-�2����`f�hT~�_�����
(�ajoo��B��#���X,�w�0�����a���ti���n{�1��ye�!`N

��x<Q �����3400��;��o@������G>�O�x��B����s�0���Y�c���F��,����0��a��Co!0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0`#0@Y�b1��~���Jw����������B��T�x\�x���"0@Y��h4�h4Z�������~�zC�Hm�;��<���`��!I
�B��(�!��x<uwwW��J�$��Z:lF��R�{��b��Z��T�����`	��b��~�b1������A~�_


jmmU�c�����_�p8,I7��i�p8��������!�a���Rkk�]����������{���jmmUCCC�{�.������7������R���g����|<����i����r8��e�����Qp?�<1��D"S���xL��mJ2����`0h�|>S�)�����v\OO��M���z������e]#��n���D"���g����L;���X��y�^�X��;��$����]o���@�{��������]+���!_�|?����i����~\��4�_^���#�!0X��J�����n{����i��922b�Pi`` #8�u��C����X��3-����@ 0m[$�:��������2�zS�������������t���1; -4��k��U����iw���u�����>�d�@�,ES��gZ�(��i�3������/0��u9+��&Z�e��������`M���`�2uf���,f
�,��O
�fz�SC�B����*�LM3w`�k��>�/��&(��0X��@�v��+IvQ�u����>U0�����&I��og���v9��-��7�o������������^ww�"��FFF
�����z�\������3���vK����x<>m[0T$)�k
�^m�;�����\�B�����������w.k�]��pRz+�
�B9�`XAS)*�Z���M55L��=�B�O������i>������P8V8���Q[[�|>�����!Jb����c��j������L��<�������
�
�$]�e��uf�%����%��2�n6FFFd�f�_�H�����,�g���������1E"uww��+C��:::*��X������m����6�k�{���^���c�.E���|���b�F�y�](�z��O�0���*��_�p��}��g�O,�g:ZK����FFF������l��F`��������U��
��b��a�z'��ke��0�}j`h�>�y
��������1�,�L��B��b���e[&[�>��3�%����U���Y��3Z	��K\GGGF(�����Tk�7g��]]]L��g�}}��3�*����X,���x<n���������F3��F�Q;d�Ua�����~�SM�������x��3BC�0���%����\`@�Xr"��)��d��n�����`��f[[�����g�q===����|���1%�^�����Z>����:��v���������u
��7��7��r��G"��c�^oj�%��@�d����f�}����~�����+���L_��"�0X�zzz$��P���/�������Yf�@@�HDmmm2��-�����������O��z�D�����900��������������^��~=���`0������|P ���V4U,���UOO������}��g����u�������������M��s��iV��_�hT~�_�d}��[�l��v�566V�[���Y^@u��V�Xfz/����g�e��uy�`*�B�K�P������2�K����(���(H0������ ��+���x<nW�
�U����e�B!;8����)�`��	K�w�������������������������������������������]lC�|��%tEXtdate:create2020-11-26T19:23:16+00:00?�7%tEXtdate:modify2020-11-26T19:23:16+00:00NR@�-tEXticc:copyrightCopyright Artifex Software 2011���1tEXticc:descriptionArtifex Software sRGB ICC Profile� tEXtpdf:HiResBoundingBox620x384+0+0���tEXtpdf:VersionPDF-1.4$1jWIEND�B`�
pmem-simple-v6.patchtext/x-patch; charset=UTF-8; name=pmem-simple-v6.patchDownload
diff --git a/configure b/configure
index dd64692345..91226ee880 100755
--- a/configure
+++ b/configure
@@ -867,6 +867,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_nvwal
 with_gnu_ld
 enable_largefile
 '
@@ -1572,6 +1573,7 @@ Optional Packages:
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
   --with-gnu-ld           assume the C compiler uses GNU ld [default=no]
+  --with-nvwal            use non-volatile WAL buffer (on a PMEM device)
 
 Some influential environment variables:
   CC          C compiler command
@@ -8598,6 +8600,203 @@ else
 
 fi
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with non-volatile WAL buffer (NVWAL)" >&5
+$as_echo_n "checking whether to build with non-volatile WAL buffer (NVWAL)... " >&6; }
+
+
+
+# Check whether --with-nvwal was given.
+if test "${with_nvwal+set}" = set; then :
+  withval=$with_nvwal;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_NVWAL 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-nvwal option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_nvwal=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_nvwal" >&5
+$as_echo "$with_nvwal" >&6; }
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+    freebsd1*|freebsd2*) elf=no;;
+    freebsd3*|freebsd4*) elf=yes;;
+esac
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for grep that handles long lines and -e" >&5
+$as_echo_n "checking for grep that handles long lines and -e... " >&6; }
+if ${ac_cv_path_GREP+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  if test -z "$GREP"; then
+  ac_path_GREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in grep ggrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_GREP="$as_dir/$ac_prog$ac_exec_ext"
+      as_fn_executable_p "$ac_path_GREP" || continue
+# Check for GNU ac_path_GREP and select it if it is found.
+  # Check for GNU $ac_path_GREP
+case `"$ac_path_GREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_GREP="$ac_path_GREP" ac_path_GREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'GREP' >> "conftest.nl"
+    "$ac_path_GREP" -e 'GREP$' -e '-(cannot match)-' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_GREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_GREP="$ac_path_GREP"
+      ac_path_GREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_GREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_GREP"; then
+    as_fn_error $? "no acceptable grep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_GREP=$GREP
+fi
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_GREP" >&5
+$as_echo "$ac_cv_path_GREP" >&6; }
+ GREP="$ac_cv_path_GREP"
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for egrep" >&5
+$as_echo_n "checking for egrep... " >&6; }
+if ${ac_cv_path_EGREP+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  if echo a | $GREP -E '(a|b)' >/dev/null 2>&1
+   then ac_cv_path_EGREP="$GREP -E"
+   else
+     if test -z "$EGREP"; then
+  ac_path_EGREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in egrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_EGREP="$as_dir/$ac_prog$ac_exec_ext"
+      as_fn_executable_p "$ac_path_EGREP" || continue
+# Check for GNU ac_path_EGREP and select it if it is found.
+  # Check for GNU $ac_path_EGREP
+case `"$ac_path_EGREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_EGREP="$ac_path_EGREP" ac_path_EGREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'EGREP' >> "conftest.nl"
+    "$ac_path_EGREP" 'EGREP$' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_EGREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_EGREP="$ac_path_EGREP"
+      ac_path_EGREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_EGREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_EGREP"; then
+    as_fn_error $? "no acceptable egrep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_EGREP=$EGREP
+fi
+
+   fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_EGREP" >&5
+$as_echo "$ac_cv_path_EGREP" >&6; }
+ EGREP="$ac_cv_path_EGREP"
+
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#if __ELF__
+  yes
+#endif
+
+_ACEOF
+if (eval "$ac_cpp conftest.$ac_ext") 2>&5 |
+  $EGREP "yes" >/dev/null 2>&1; then :
+  ELF_SYS=true
+else
+  if test "X$elf" = "Xyes" ; then
+  ELF_SYS=true
+else
+  ELF_SYS=
+fi
+fi
+rm -f conftest*
+
+
+
 
 
 
@@ -12962,6 +13161,57 @@ fi
 fi
 
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'libpmem' is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+fi
+
 
 ##
 ## Header files
@@ -13641,6 +13891,18 @@ fi
 
 done
 
+
+fi
+
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
 fi
 
 if test "$PORTNAME" = "win32" ; then
diff --git a/configure.ac b/configure.ac
index 748fb50236..460a227fe7 100644
--- a/configure.ac
+++ b/configure.ac
@@ -999,6 +999,38 @@ PGAC_ARG_BOOL(with, zlib, yes,
               [do not use Zlib])
 AC_SUBST(with_zlib)
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+AC_MSG_CHECKING([whether to build with non-volatile WAL buffer (NVWAL)])
+PGAC_ARG_BOOL(with, nvwal, no, [use non-volatile WAL buffer (NVWAL)],
+              [AC_DEFINE([USE_NVWAL], 1, [Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal)])])
+AC_MSG_RESULT([$with_nvwal])
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+    freebsd1*|freebsd2*) elf=no;;
+    freebsd3*|freebsd4*) elf=yes;;
+esac
+
+AC_EGREP_CPP(yes,
+[#if __ELF__
+  yes
+#endif
+],
+[ELF_SYS=true],
+[if test "X$elf" = "Xyes" ; then
+  ELF_SYS=true
+else
+  ELF_SYS=
+fi])
+AC_SUBST(ELF_SYS)
+
 #
 # Assignments
 #
@@ -1303,6 +1335,12 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [],
+               [AC_MSG_ERROR([library 'libpmem' is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 
 ##
 ## Header files
@@ -1480,6 +1518,11 @@ elif test "$with_uuid" = ossp ; then
       [AC_MSG_ERROR([header file <ossp/uuid.h> or <uuid.h> is required for OSSP UUID])])])
 fi
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 if test "$PORTNAME" = "win32" ; then
    AC_CHECK_HEADERS(crtdefs.h)
 fi
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 13f1d8c3dc..79a479fcf5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -21,6 +21,11 @@
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <unistd.h>
+#include <libpmem.h>
+
+#include <stdint.h>	/* for uint64 definition */
+#include <stdlib.h>	/* for exit() definition */
+#include <time.h>	/* for clock_gettime */
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
@@ -43,6 +48,7 @@
 #include "commands/progress.h"
 #include "commands/tablespace.h"
 #include "common/controldata_utils.h"
+#include "common/file_perm.h"
 #include "executor/instrument.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
@@ -795,7 +801,7 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
  * write the XLOG, and so will normally refer to the active segment.
  * Note: call Reserve/ReleaseExternalFD to track consumption of this FD.
  */
-static int	openLogFile = -1;
+static void *openLogFile = NULL;
 static XLogSegNo openLogSegNo = 0;
 
 /*
@@ -970,6 +976,189 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+
+#define PMEM_DEBUG
+
+static int64
+time_delta(struct timespec *start, struct timespec *end)
+{
+	return (int64) (end->tv_sec - start->tv_sec) * 1000000000L +
+		(end->tv_nsec - start->tv_nsec);
+}
+
+typedef struct {
+
+	int64	n_total;
+	int64	n_mmap;
+	int64	n_munmap;
+	int64	n_memcpy;
+	int64	n_persist;
+
+	int64   t_total;
+	int64   t_mmap;
+	int64   t_munmap;
+	int64   t_memcpy;
+	int64   t_persist;
+
+	int64	l_memcpy;
+	int64	l_persist;
+
+} pmem_stats;
+
+static pmem_stats stats;
+static bool stats_initialized = false;
+
+static inline void
+init_stats(void)
+{
+	if (!stats_initialized)
+		memset(&stats, 0, sizeof(pmem_stats));
+	stats_initialized = true;
+}
+
+static inline void
+print_stats(void)
+{
+	if (stats.n_total >= 1000000)
+	{
+		elog(LOG, "PMEM STATS COUNT total %ld map %ld unmap %ld memcpy %ld persist %ld TIME total %ld map %ld unmap %ld memcpy %ld persist %ld LENGTH memcpy %ld persist %ld",
+			stats.n_total, stats.n_mmap, stats.n_munmap, stats.n_memcpy, stats.n_persist,
+			stats.t_total, stats.t_mmap, stats.t_munmap, stats.t_memcpy, stats.t_persist,
+			stats.l_memcpy, stats.l_persist);
+
+		memset(&stats, 0, sizeof(pmem_stats));
+	}
+}
+
+static void *
+pg_pmem_memcpy_nodrain(void *dst, void *src, size_t len)
+{
+	void *ret;
+
+#ifdef PMEM_DEBUG
+	struct timespec start, end;
+	int64 delta;
+
+	init_stats();
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+#endif
+	ret = pmem_memcpy_nodrain(dst, src, len);
+
+#ifdef PMEM_DEBUG
+
+	clock_gettime(CLOCK_MONOTONIC, &end);
+
+	delta = time_delta(&start, &end);
+
+	stats.n_total += 1;
+	stats.n_memcpy += 1;
+	stats.t_memcpy += delta;
+	stats.l_memcpy += len;
+
+	print_stats();
+#endif
+	return ret;
+}
+
+static void *
+pg_pmem_map_file(char *path, size_t len, int flags, mode_t mode, size_t *map_len, int *is_pmem)
+{
+	void *ret;
+#ifdef PMEM_DEBUG
+
+	struct timespec start, end;
+	int64 delta;
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+#endif
+	ret = pmem_map_file(path, len, flags, mode, map_len, is_pmem);
+#ifdef PMEM_DEBUG
+
+	clock_gettime(CLOCK_MONOTONIC, &end);
+
+	delta = time_delta(&start, &end);
+
+	stats.n_total += 1;
+	stats.n_mmap += 1;
+	stats.t_mmap += delta;
+
+	print_stats();
+
+#endif
+	return ret;
+}
+
+static int
+pg_pmem_unmap(void *addr, size_t len)
+{
+	int ret;
+#ifdef PMEM_DEBUG
+
+	struct timespec start, end;
+	int64 delta;
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+
+#endif
+	ret = pmem_unmap(addr, len);
+#ifdef PMEM_DEBUG
+
+	clock_gettime(CLOCK_MONOTONIC, &end);
+
+	delta = time_delta(&start, &end);
+
+	stats.n_total += 1;
+	stats.n_munmap += 1;
+	stats.t_munmap += delta;
+
+	print_stats();
+
+#endif
+	return ret;
+}
+
+static void
+pg_pmem_persist(const char *msg, void *addr, size_t from, size_t to)
+{
+	size_t len = (to - from);
+
+#ifdef PMEM_DEBUG
+
+	struct timespec start, end;
+	int64 delta;
+
+	if ((from < 0) || (from > wal_segment_size))
+		elog(WARNING, "bogus from value %ld", from);
+
+	if ((to < 0) || (to > wal_segment_size) || (to < from))
+		elog(WARNING, "bogus to size %ld", to);
+
+	if ((len <= 0) || (len > wal_segment_size))
+		elog(WARNING, "bogus persist len %ld", len);
+
+	clock_gettime(CLOCK_MONOTONIC, &start);
+
+#endif
+
+	pmem_persist((char *) addr + from, len);
+
+#ifdef PMEM_DEBUG
+
+	clock_gettime(CLOCK_MONOTONIC, &end);
+
+	delta = time_delta(&start, &end);
+
+	stats.n_total += 1;
+	stats.n_persist += 1;
+	stats.t_persist += delta;
+	stats.l_persist += len;
+
+	print_stats();
+
+#endif
+}
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -2478,7 +2667,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			 * pages here (since we dump what we have at segment end).
 			 */
 			Assert(npages == 0);
-			if (openLogFile >= 0)
+			if (openLogFile != NULL)
 				XLogFileClose();
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
@@ -2490,7 +2679,7 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		}
 
 		/* Make sure we have the current logfile open */
-		if (openLogFile < 0)
+		if (openLogFile == NULL)
 		{
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
@@ -2536,7 +2725,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			{
 				errno = 0;
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
+
+				// written = pg_pwrite(openLogFile, from, nleft, startoffset);
+				pg_pmem_memcpy_nodrain((char *) openLogFile + startoffset, from, nleft);
+				written = nleft;
+
 				pgstat_report_wait_end();
 				if (written <= 0)
 				{
@@ -2637,11 +2830,11 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		if (sync_method != SYNC_METHOD_OPEN &&
 			sync_method != SYNC_METHOD_OPEN_DSYNC)
 		{
-			if (openLogFile >= 0 &&
+			if (openLogFile != NULL &&
 				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
 				XLogFileClose();
-			if (openLogFile < 0)
+			if (openLogFile == NULL)
 			{
 				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 								wal_segment_size);
@@ -3070,7 +3263,7 @@ XLogBackgroundFlush(void)
 	 */
 	if (WriteRqst.Write <= LogwrtResult.Flush)
 	{
-		if (openLogFile >= 0)
+		if (openLogFile != NULL)
 		{
 			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
@@ -3250,7 +3443,7 @@ XLogNeedsFlush(XLogRecPtr record)
  * take down the system on failure).  They will promote to PANIC if we are
  * in a critical section.
  */
-int
+void *
 XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 {
 	char		path[MAXPGPATH];
@@ -3258,10 +3451,13 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	PGAlignedXLogBlock zbuffer;
 	XLogSegNo	installed_segno;
 	XLogSegNo	max_segno;
-	int			fd;
 	int			nbytes;
 	int			save_errno;
 
+	void	   *addr;
+	size_t		map_len = 0;
+	int			is_pmem = 0;
+
 	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
 
 	/*
@@ -3269,8 +3465,15 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 */
 	if (*use_existent)
 	{
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-		if (fd < 0)
+		/*
+		 * Map an existing file.  The second argument (len) should be zero,
+		 * the third argument (flags) should have neither PMEM_FILE_CREATE nor
+		 * PMEM_FILE_EXCL, and the fourth argument (mode) will be ignored.
+		 *
+		 * FIXME maybe check the length and is_pmem flags here?
+		 */
+		addr = pg_pmem_map_file(path, 0, 0, 0, &map_len, &is_pmem);
+		if (!addr)
 		{
 			if (errno != ENOENT)
 				ereport(ERROR,
@@ -3278,7 +3481,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 						 errmsg("could not open file \"%s\": %m", path)));
 		}
 		else
-			return fd;
+			return addr;
 	}
 
 	/*
@@ -3293,13 +3496,17 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 
 	unlink(tmppath);
 
-	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	addr = pg_pmem_map_file(tmppath, wal_segment_size,
+						 PMEM_FILE_CREATE | PMEM_FILE_EXCL,
+						 pg_file_create_mode, &map_len, &is_pmem);
+
+	if (!addr)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
 
+	/* FIXME check size too */
+
 	memset(zbuffer.data, 0, XLOG_BLCKSZ);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
@@ -3318,7 +3525,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
 		{
 			errno = 0;
-			if (write(fd, zbuffer.data, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+			if (pg_pmem_memcpy_nodrain((char *) addr + nbytes, zbuffer.data, XLOG_BLCKSZ) != ((char *) addr + nbytes))
 			{
 				/* if write didn't set errno, assume no disk space */
 				save_errno = errno ? errno : ENOSPC;
@@ -3333,7 +3540,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		 * enough.
 		 */
 		errno = 0;
-		if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
+		if (pg_pmem_memcpy_nodrain((char *) addr + wal_segment_size - 1, zbuffer.data, 1) != ((char *) addr + wal_segment_size - 1))
 		{
 			/* if write didn't set errno, assume no disk space */
 			save_errno = errno ? errno : ENOSPC;
@@ -3348,7 +3555,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		 */
 		unlink(tmppath);
 
-		close(fd);
+		pg_pmem_unmap(addr, wal_segment_size);
 
 		errno = save_errno;
 
@@ -3358,11 +3565,15 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
-	if (pg_fsync(fd) != 0)
+
+	pg_pmem_persist("XLogFileInit", addr, 0, wal_segment_size);
+
+	if (false)
 	{
 		int			save_errno = errno;
 
-		close(fd);
+		pg_pmem_unmap(addr, wal_segment_size);
+
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3370,7 +3581,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	}
 	pgstat_report_wait_end();
 
-	if (close(fd) != 0)
+	if (pg_pmem_unmap(addr, wal_segment_size) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
@@ -3411,15 +3622,17 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	*use_existent = false;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	addr = pg_pmem_map_file(path, 0, 0, 0, &map_len, &is_pmem);
+	if (!addr)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
 
+	/* FIXME size */
+
 	elog(DEBUG2, "done creating and filling new WAL file");
 
-	return fd;
+	return addr;
 }
 
 /*
@@ -3642,21 +3855,28 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 /*
  * Open a pre-existing logfile segment for writing.
  */
-int
+void *
 XLogFileOpen(XLogSegNo segno)
 {
 	char		path[MAXPGPATH];
-	int			fd;
+	void	   *addr;
+	size_t		map_len = 0;
+	int			is_pmem = 0;
 
 	XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	addr = pg_pmem_map_file(path, 0, 0, 0, &map_len, &is_pmem);
+	if (!addr)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
 
-	return fd;
+	if (map_len != wal_segment_size)
+		elog(PANIC, "size of non-volatile WAL buffer '%s' is invalid; "
+					"expected %zu; actual %zu",
+			 path, (size_t) wal_segment_size, map_len);
+
+	return addr;
 }
 
 /*
@@ -3852,7 +4072,7 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 static void
 XLogFileClose(void)
 {
-	Assert(openLogFile >= 0);
+	Assert(openLogFile != NULL);
 
 	/*
 	 * WAL segment files will not be re-read in normal operation, so we advise
@@ -3861,11 +4081,11 @@ XLogFileClose(void)
 	 * use the cache to read the WAL segment.
 	 */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	if (!XLogIsNeeded())
-		(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
+	// if (!XLogIsNeeded())
+	//	(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif
 
-	if (close(openLogFile) != 0)
+	if (pg_pmem_unmap(openLogFile, wal_segment_size) < 0)
 	{
 		char		xlogfname[MAXFNAMELEN];
 		int			save_errno = errno;
@@ -3877,7 +4097,7 @@ XLogFileClose(void)
 				 errmsg("could not close file \"%s\": %m", xlogfname)));
 	}
 
-	openLogFile = -1;
+	openLogFile = NULL;
 	ReleaseExternalFD();
 }
 
@@ -3895,7 +4115,7 @@ static void
 PreallocXlogFiles(XLogRecPtr endptr)
 {
 	XLogSegNo	_logSegNo;
-	int			lf;
+	void	   *lf;
 	bool		use_existent;
 	uint64		offset;
 
@@ -3906,7 +4126,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 		_logSegNo++;
 		use_existent = true;
 		lf = XLogFileInit(_logSegNo, &use_existent, true);
-		close(lf);
+		pg_pmem_unmap(lf, wal_segment_size);
 		if (!use_existent)
 			CheckpointStats.ckpt_segs_added++;
 	}
@@ -5308,7 +5528,7 @@ BootStrapXLOG(void)
 	/* Write the first page with the initial record */
 	errno = 0;
 	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+	if (pg_pmem_memcpy_nodrain(openLogFile, page, XLOG_BLCKSZ) != openLogFile)
 	{
 		/* if write didn't set errno, assume problem is no disk space */
 		if (errno == 0)
@@ -5320,18 +5540,20 @@ BootStrapXLOG(void)
 	pgstat_report_wait_end();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
+	pg_pmem_persist("BootStrapXLOG", openLogFile, 0, wal_segment_size);
+
+	if (false)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
 	pgstat_report_wait_end();
 
-	if (close(openLogFile) != 0)
+	if (pg_pmem_unmap(openLogFile, wal_segment_size) != 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not close bootstrap write-ahead log file: %m")));
 
-	openLogFile = -1;
+	openLogFile = NULL;
 
 	/* Now create pg_control */
 	InitControlFile(sysidentifier);
@@ -5605,11 +5827,11 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 * segment on the new timeline.
 		 */
 		bool		use_existent = true;
-		int			fd;
+		void	   *addr;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+		addr = XLogFileInit(startLogSegNo, &use_existent, true);
 
-		if (close(fd) != 0)
+		if (pg_pmem_unmap(addr, wal_segment_size) != 0)
 		{
 			char		xlogfname[MAXFNAMELEN];
 			int			save_errno = errno;
@@ -10373,10 +10595,12 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 		 * changing, close the log file so it will be reopened (with new flag
 		 * bit) at next use.
 		 */
-		if (openLogFile >= 0)
+		if (openLogFile != NULL)
 		{
 			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN);
-			if (pg_fsync(openLogFile) != 0)
+			pg_pmem_persist("assign_xlog_sync_method", openLogFile, 0, wal_segment_size);
+
+			if (false)
 			{
 				char		xlogfname[MAXFNAMELEN];
 				int			save_errno;
@@ -10405,37 +10629,21 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
  * 'segno' is for error reporting purposes.
  */
 void
-issue_xlog_fsync(int fd, XLogSegNo segno)
+issue_xlog_fsync(void *addr, XLogSegNo segno)
 {
 	char	   *msg = NULL;
 
+	/* XXX not sure if correct? */
+	size_t		from = (LogwrtResult.Flush % wal_segment_size);
+	size_t		to = (LogwrtResult.Write % wal_segment_size);
+
+	/* flush until the end of the segment */
+	if (to == 0)
+		to = wal_segment_size;
+
 	pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
-	switch (sync_method)
-	{
-		case SYNC_METHOD_FSYNC:
-			if (pg_fsync_no_writethrough(fd) != 0)
-				msg = _("could not fsync file \"%s\": %m");
-			break;
-#ifdef HAVE_FSYNC_WRITETHROUGH
-		case SYNC_METHOD_FSYNC_WRITETHROUGH:
-			if (pg_fsync_writethrough(fd) != 0)
-				msg = _("could not fsync write-through file \"%s\": %m");
-			break;
-#endif
-#ifdef HAVE_FDATASYNC
-		case SYNC_METHOD_FDATASYNC:
-			if (pg_fdatasync(fd) != 0)
-				msg = _("could not fdatasync file \"%s\": %m");
-			break;
-#endif
-		case SYNC_METHOD_OPEN:
-		case SYNC_METHOD_OPEN_DSYNC:
-			/* write synced it already */
-			break;
-		default:
-			elog(PANIC, "unrecognized wal_sync_method: %d", sync_method);
-			break;
-	}
+
+	pg_pmem_persist("issue_xlog_fsync", addr, from, to);
 
 	/* PANIC if failed to fsync */
 	if (msg)
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 87c3ea450e..2836d82132 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -50,6 +50,7 @@
 #include "postgres.h"
 
 #include <unistd.h>
+#include <libpmem.h>
 
 #include "access/htup_details.h"
 #include "access/timeline.h"
@@ -100,7 +101,7 @@ WalReceiverFunctionsType *WalReceiverFunctions = NULL;
  * but for walreceiver to write the XLOG. recvFileTLI is the TimeLineID
  * corresponding the filename of recvFile.
  */
-static int	recvFile = -1;
+static void *recvFile = NULL;
 static TimeLineID recvFileTLI = 0;
 static XLogSegNo recvSegNo = 0;
 
@@ -602,7 +603,7 @@ WalReceiverMain(void)
 
 			XLogWalRcvFlush(false);
 			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-			if (close(recvFile) != 0)
+			if (pmem_unmap(recvFile, wal_segment_size) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not close log segment %s: %m",
@@ -617,7 +618,7 @@ WalReceiverMain(void)
 			else
 				XLogArchiveNotify(xlogfname);
 		}
-		recvFile = -1;
+		recvFile = NULL;
 
 		elog(DEBUG1, "walreceiver ended streaming and awaits new instructions");
 		WalRcvWaitForStartPosition(&startpoint, &startpointTLI);
@@ -896,7 +897,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				 * process soon, so we don't advise the OS to release cache
 				 * pages associated with the file like XLogFileClose() does.
 				 */
-				if (close(recvFile) != 0)
+				if (pmem_unmap(recvFile, wal_segment_size) != 0)
 					ereport(PANIC,
 							(errcode_for_file_access(),
 							 errmsg("could not close log segment %s: %m",
@@ -911,7 +912,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				else
 					XLogArchiveNotify(xlogfname);
 			}
-			recvFile = -1;
+			recvFile = NULL;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
@@ -931,7 +932,10 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* OK to write the logs */
 		errno = 0;
 
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+		// byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+		pmem_memcpy_nodrain((char *) recvFile + startoff, buf, segbytes);
+		byteswritten = segbytes;
+
 		if (byteswritten <= 0)
 		{
 			char		xlogfname[MAXFNAMELEN];
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 221af87e71..d63e5522b4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -287,8 +287,8 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
-extern int	XLogFileOpen(XLogSegNo segno);
+extern void *XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
+extern void *XLogFileOpen(XLogSegNo segno);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
@@ -299,7 +299,7 @@ extern void xlog_redo(XLogReaderState *record);
 extern void xlog_desc(StringInfo buf, XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
-extern void issue_xlog_fsync(int fd, XLogSegNo segno);
+extern void issue_xlog_fsync(void *addr, XLogSegNo segno);
 
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
#39Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Tomas Vondra (#38)
Re: [PoC] Non-volatile WAL buffer

On 26/11/2020 21:27, Tomas Vondra wrote:

Hi,

Here's the "simple patch" that I'm currently experimenting with. It
essentially replaces open/close/write/fsync with pmem calls
(map/unmap/memcpy/persist variants), and it's by no means committable.
But it works well enough for experiments / measurements, etc.

The numbers (5-minute pgbench runs on scale 500) look like this:

master/btt master/dax ntt simple
-----------------------------------------------------------
1 5469 7402 7977 6746
16 48222 80869 107025 82343
32 73974 158189 214718 158348
64 85921 154540 225715 164248
96 150602 221159 237008 217253

A chart illustrating these results is attached. The four columns are
showing unpatched master with WAL on a pmem device, in BTT or DAX modes,
"ntt" is the patch submitted to this thread, and "simple" is the patch
I've hacked together.

As expected, the BTT case performs poorly (compared to the rest).

The "master/dax" and "simple" perform about the same. There are some
differences, but those may be attributed to noise. The NTT patch does
outperform these cases by ~20-40% in some cases.

The question is why. I recall suggestions this is due to page faults
when writing data into the WAL, but I did experiment with various
settings that I think should prevent that (e.g. disabling WAL reuse
and/or disabling zeroing the segments) but that made no measurable
difference.

The page faults are only a problem when mmap() is used *without* DAX.

Takashi tried a patch earlier to mmap() WAL segments and insert WAL to
them directly. See 0002-Use-WAL-segments-as-WAL-buffers.patch at
/messages/by-id/000001d5dff4$995ed180$cc1c7480$@hco.ntt.co.jp_1.
Could you test that patch too, please? Using your nomenclature, that
patch skips wal_buffers and does:

clients -> wal segments (PMEM DAX)

He got good results with that with DAX, but otherwise it performed
worse. And then we discussed why that might be, and the page fault
hypothesis was brought up.

I think 0002-Use-WAL-segments-as-WAL-buffers.patch is the most promising
approach here. But because it's slower without DAX, we need to keep the
current code for non-DAX systems. Unfortunately it means that we need to
maintain both implementations, selectable with a GUC or some DAX
detection magic. The question then is whether the code complexity is
worth the performance gin on DAX-enabled systems.

Andres was not excited about mmapping the WAL segments because of
performance reasons. I'm not sure how much of his critique applies if we
keep supporting both methods and only use mmap() if so configured.

- Heikki

#40Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Heikki Linnakangas (#39)
Re: [PoC] Non-volatile WAL buffer

On 11/26/20 9:59 PM, Heikki Linnakangas wrote:

On 26/11/2020 21:27, Tomas Vondra wrote:

Hi,

Here's the "simple patch" that I'm currently experimenting with. It
essentially replaces open/close/write/fsync with pmem calls
(map/unmap/memcpy/persist variants), and it's by no means committable.
But it works well enough for experiments / measurements, etc.

The numbers (5-minute pgbench runs on scale 500) look like this:

          master/btt    master/dax           ntt        simple
    -----------------------------------------------------------
      1         5469          7402          7977          6746
     16        48222         80869        107025         82343
     32        73974        158189        214718        158348
     64        85921        154540        225715        164248
     96       150602        221159        237008        217253

A chart illustrating these results is attached. The four columns are
showing unpatched master with WAL on a pmem device, in BTT or DAX modes,
"ntt" is the patch submitted to this thread, and "simple" is the patch
I've hacked together.

As expected, the BTT case performs poorly (compared to the rest).

The "master/dax" and "simple" perform about the same. There are some
differences, but those may be attributed to noise. The NTT patch does
outperform these cases by ~20-40% in some cases.

The question is why. I recall suggestions this is due to page faults
when writing data into the WAL, but I did experiment with various
settings that I think should prevent that (e.g. disabling WAL reuse
and/or disabling zeroing the segments) but that made no measurable
difference.

The page faults are only a problem when mmap() is used *without* DAX.

Takashi tried a patch earlier to mmap() WAL segments and insert WAL to
them directly. See 0002-Use-WAL-segments-as-WAL-buffers.patch at
/messages/by-id/000001d5dff4$995ed180$cc1c7480$@hco.ntt.co.jp_1.
Could you test that patch too, please? Using your nomenclature, that
patch skips wal_buffers and does:

  clients -> wal segments (PMEM DAX)

He got good results with that with DAX, but otherwise it performed
worse. And then we discussed why that might be, and the page fault
hypothesis was brought up.

D'oh, I haven't noticed there's a patch doing that. This thread has so
many different patches - which is good, but a bit confusing.

I think 0002-Use-WAL-segments-as-WAL-buffers.patch is the most promising
approach here. But because it's slower without DAX, we need to keep the
current code for non-DAX systems. Unfortunately it means that we need to
maintain both implementations, selectable with a GUC or some DAX
detection magic. The question then is whether the code complexity is
worth the performance gin on DAX-enabled systems.

Sure, I can give it a spin. The question is whether it applies to
current master, or whether some sort of rebase is needed. I'll try.

Andres was not excited about mmapping the WAL segments because of
performance reasons. I'm not sure how much of his critique applies if we
keep supporting both methods and only use mmap() if so configured.

Yeah. I don't think we can just discard the current approach, there are
far too many OS variants that even if Linux is happy one of the other
critters won't be.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#41Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#40)
Re: [PoC] Non-volatile WAL buffer

On 11/26/20 10:19 PM, Tomas Vondra wrote:

On 11/26/20 9:59 PM, Heikki Linnakangas wrote:

On 26/11/2020 21:27, Tomas Vondra wrote:

Hi,

Here's the "simple patch" that I'm currently experimenting with. It
essentially replaces open/close/write/fsync with pmem calls
(map/unmap/memcpy/persist variants), and it's by no means committable.
But it works well enough for experiments / measurements, etc.

The numbers (5-minute pgbench runs on scale 500) look like this:

          master/btt    master/dax           ntt        simple
    -----------------------------------------------------------
      1         5469          7402          7977          6746
     16        48222         80869        107025         82343
     32        73974        158189        214718        158348
     64        85921        154540        225715        164248
     96       150602        221159        237008        217253

A chart illustrating these results is attached. The four columns are
showing unpatched master with WAL on a pmem device, in BTT or DAX modes,
"ntt" is the patch submitted to this thread, and "simple" is the patch
I've hacked together.

As expected, the BTT case performs poorly (compared to the rest).

The "master/dax" and "simple" perform about the same. There are some
differences, but those may be attributed to noise. The NTT patch does
outperform these cases by ~20-40% in some cases.

The question is why. I recall suggestions this is due to page faults
when writing data into the WAL, but I did experiment with various
settings that I think should prevent that (e.g. disabling WAL reuse
and/or disabling zeroing the segments) but that made no measurable
difference.

The page faults are only a problem when mmap() is used *without* DAX.

Takashi tried a patch earlier to mmap() WAL segments and insert WAL to
them directly. See 0002-Use-WAL-segments-as-WAL-buffers.patch at
/messages/by-id/000001d5dff4$995ed180$cc1c7480$@hco.ntt.co.jp_1.
Could you test that patch too, please? Using your nomenclature, that
patch skips wal_buffers and does:

  clients -> wal segments (PMEM DAX)

He got good results with that with DAX, but otherwise it performed
worse. And then we discussed why that might be, and the page fault
hypothesis was brought up.

D'oh, I haven't noticed there's a patch doing that. This thread has so
many different patches - which is good, but a bit confusing.

I think 0002-Use-WAL-segments-as-WAL-buffers.patch is the most promising
approach here. But because it's slower without DAX, we need to keep the
current code for non-DAX systems. Unfortunately it means that we need to
maintain both implementations, selectable with a GUC or some DAX
detection magic. The question then is whether the code complexity is
worth the performance gin on DAX-enabled systems.

Sure, I can give it a spin. The question is whether it applies to
current master, or whether some sort of rebase is needed. I'll try.

Unfortunately, that patch seems to fail for me :-(

The patches seem to be for PG12, so I applied them on REL_12_STABLE (all
the parts 0001-0005) and then I did this:

LIBS="-lpmem" ./configure --prefix=/home/tomas/pg-12-pmem --enable-debug
make -s install

initdb -X /opt/pmemdax/benchmarks/wal -D /opt/nvme/benchmarks/data

pg_ctl -D /opt/nvme/benchmarks/data/ -l pg.log start

createdb test
pgbench -i -s 500 test

which however fails after just about 70k rows generated (PQputline
failed), and the pg.log says this:

PANIC: could not open or mmap file
"pg_wal/000000010000000000000006": No such file or directory
CONTEXT: COPY pgbench_accounts, line 721000
STATEMENT: copy pgbench_accounts from stdin

Takashi-san, can you check and provide a fixed version? Ideally, I'll
take a look too, but I'm not familiar with this patch so it may take
more time.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#42Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#41)
1 attachment(s)
Re: [PoC] Non-volatile WAL buffer

On 11/27/20 1:02 AM, Tomas Vondra wrote:

Unfortunately, that patch seems to fail for me :-(

The patches seem to be for PG12, so I applied them on REL_12_STABLE (all
the parts 0001-0005) and then I did this:

LIBS="-lpmem" ./configure --prefix=/home/tomas/pg-12-pmem --enable-debug
make -s install

initdb -X /opt/pmemdax/benchmarks/wal -D /opt/nvme/benchmarks/data

pg_ctl -D /opt/nvme/benchmarks/data/ -l pg.log start

createdb test
pgbench -i -s 500 test

which however fails after just about 70k rows generated (PQputline
failed), and the pg.log says this:

PANIC: could not open or mmap file
"pg_wal/000000010000000000000006": No such file or directory
CONTEXT: COPY pgbench_accounts, line 721000
STATEMENT: copy pgbench_accounts from stdin

Takashi-san, can you check and provide a fixed version? Ideally, I'll
take a look too, but I'm not familiar with this patch so it may take
more time.

I did try to get this working today, unsuccessfully. I did manage to
apply the 0002 part separately on REL_12_0 (there's one trivial rejected
chunk), but I still get the same failure. In fact, when built with
assertions, I can't even get initdb to pass :-(

I do get this:

TRAP: FailedAssertion("!(page->xlp_pageaddr == ptr - (ptr % 8192))",
File: "xlog.c", Line: 1813)

The values involved here are

xlp_pageaddr = 16777216
ptr = 20971520

so the page seems to be at the very beginning of the second WAL segment,
but the pointer is somewhere later. A full backtrace is attached.

I'll continue investigating this, but the xlog code is not particularly
easy to understand in general, so it may take time.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

initdb-crash.txttext/plain; charset=UTF-8; name=initdb-crash.txtDownload
#43Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#42)
3 attachment(s)
Re: [PoC] Non-volatile WAL buffer

Hi,

I think I've managed to get the 0002 patch [1]/messages/by-id/000001d5dff4$995ed180$cc1c7480$@hco.ntt.co.jp_1 rebased to master and
working (with help from Masahiko Sawada). It's not clear to me how it
could have worked as submitted - my theory is that an incomplete patch
was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For a
pgbench on scale 500 (fits into shared buffers), an average of three
5-minute runs looks like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065

NTT refers to the patch from September 10, pre-allocating a large WAL
file on PMEM, and simple-no-buffers is the simpler patch simply removing
the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want to
keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty bad. Not
only compared to the "ntt" patch, but even to master. I'm not entirely
sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've tried
tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much faster than
NVMe storage, but still much slower than DRAM (both in terms of latency
and bandwidth, see [2]https://arxiv.org/pdf/2005.07658.pdf (Lessons learned from the early performance evaluation of IntelOptane DC Persistent Memory in DBMS) for some data). It's not terrible, but the
latency is maybe 2-3x higher - not a huge difference, but may matter for
WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2]https://arxiv.org/pdf/2005.07658.pdf (Lessons learned from the early performance evaluation of IntelOptane DC Persistent Memory in DBMS),
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying, because
that's how we write into WAL buffers - each thread writes it's own data,
so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls / time for
each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

xlog stats cnt 43000000
map cnt 100 time 5448333 unmap cnt 100 time 3730963
memcpy cnt 985964 time 1550442272 len 15150499
memset cnt 0 time 0 len 0
persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100 mmap and
unmap calls, taking ~10ms in total. There were ~14k pmem_persist calls,
taking 10ms in total. And the most time (~1.5s) was used by pmem_memcpy
copying about 15MB of data. That's quite a lot :-(

My conclusion from this is that eliminating WAL buffers and writing WAL
directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not the
right approach.

I suppose we should keep WAL buffers, and then just write the data to
mmap-ed WAL segments on PMEM. Which I think is what the NTT patch does,
except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL segments,
mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does that,
and the results look a bit nicer:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065
with-wal-buffers 7477 95454 181702 140167 214715

So, much better than the version without WAL buffers, somewhat better
than master (except for 64/96 clients), but still not as good as NTT.

At this point I was wondering how could the NTT patch be faster when
it's doing roughly the same thing. I'm sire there are some differences,
but it seemed strange. The main difference seems to be that it only maps
one large file, and only once. OTOH the alternative "simple" patch maps
segments one by one, in each backend. Per the debug stats the map/unmap
calls are fairly cheap, but maybe it interferes with the memcpy somehow.

So I did an experiment by increasing the size of the WAL segments. I
chose to try with 521MB and 1024MB, and the results with 1GB look like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 6635 88524 171106 163387 245307
ntt 7909 106826 217364 223338 242042
simple-no-buffers 7871 101575 199403 188074 224716
with-wal-buffers 7643 101056 206911 223860 261712

So yeah, there's a clear difference. It changes the values for "master"
a bit, but both the "simple" patches (with and without) WAL buffers are
much faster. The with-wal-buffers is almost equal to the NTT patch,
which was using 96GB file. I presume larger WAL segments would get even
closer, if we supported them.

I'll continue investigating this, but my conclusion so far seem to be
that we can't really replace WAL buffers with PMEM - that seems to
perform much worse.

The question is what to do about the segment size. Can we reduce the
overhead of mmap-ing individual segments, so that this works even for
smaller WAL segments, to make this useful for common instances (not
everyone wants to run with 1GB WAL). Or whether we need to adopt the
design with a large file, mapped just once.

Another question is whether it's even worth the extra complexity. On
16MB segments the difference between master and NTT patch seems to be
non-trivial, but increasing the WAL segment size kinda reduces that. So
maybe just using File I/O on PMEM DAX filesystem seems good enough.
Alternatively, maybe we could switch to libpmemblk, which should
eliminate the filesystem overhead at least.

I'm also wondering if WAL is the right usage for PMEM. Per [2]https://arxiv.org/pdf/2005.07658.pdf (Lessons learned from the early performance evaluation of IntelOptane DC Persistent Memory in DBMS) there's a
huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of avoiding
writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).

I'll continue investigating this, but I'd welcome some feedback and
thoughts about this.

Attached are:

* patches.tgz - all three patches discussed here, rebased to master

* bench.tgz - benchmarking scripts / config files I used

* pmem.pdf - charts illustrating results between the patches, and also
showing the impact of the increased WAL segments

regards

[1]: /messages/by-id/000001d5dff4$995ed180$cc1c7480$@hco.ntt.co.jp_1
/messages/by-id/000001d5dff4$995ed180$cc1c7480$@hco.ntt.co.jp_1

[2]: https://arxiv.org/pdf/2005.07658.pdf (Lessons learned from the early performance evaluation of IntelOptane DC Persistent Memory in DBMS)
performance evaluation of IntelOptane DC Persistent Memory in DBMS)

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

bench.tgzapplication/x-compressed-tar; name=bench.tgzDownload
patches.tgzapplication/x-compressed-tar; name=patches.tgzDownload
��<ks�F��J��9o]L� EJ���"od�rT�WIt�u9�$V�E1����c@��nrUw��H��������8�ag��35O;7�����Lt'tu*�n�����?��������g����������p��w��]�����;�������\�n"�wzo�\���c2���O���og�z2���'��I7�n���i����l��X>��e����v�������7R�����:��K��j��/Z���D���9m�{(~�iK����`>��U�t:M���^m	��G/]��;{��.��L`�52t�Y�)�CYi�xQ`�s������������(	\m,'B���K��K5=K��n�n��Z$���� ��cD�����i�mq��R�����w�W9�;�� ������f�+F�T
�<�T�h�~�K0�@	o�F0��c�~�j"\�J@4�h,.��>��r�R�E$a�T1��
����������c?�^*��+^�F:s������1��t���K���4�'FR,?Me$@c�������D10��h}�	1���BFz�H���W������'
YG�w-$$���b�%�<`�<����x����:����&P��C��@`�O�����<]:,���]�Y���^�b�>J��h*Cr�e����B��.s
{����:_*�s�-\�.F�i'�m�H�<w5�_�E`3��A,��Z����|�������t�������Y6�D�p|�jq��j�����������������]���F*�I�6M=@q���5��_��}-��`&�G7���
� �����5h+�y|�������r����������`z�.���xT���O�Y����V�	5H@%����f�Pq���K[AtA���np��
�0%(P$}d}���V@���C��h��t�����GD��T"K�`��
�����RA��x���������0N��'h8���>��D���d@ACPX`���e���n��HK� 5���hE��vA�A4�w��X�A�c��{�7i!����
��Q��T�<��2�9�
������i~?'I�t4���]��� 
 ����g��|��(������5��l����^�����Qo�����M�o��S�Ei� �����'�����|'��W��^9���x3�Q��x:����n��	+�*��e�8�� }
���2�d ��
���	��G���c(@����2�i��I��"�	��*�eL���"����X�!��� C�9�TY�������pA�?Z���A��J�D���������K��	�(r��'�Gj��8S���p�����s�Zz��xL�lAv.C6CA�)0-�	�y���R��E��5�un,�y��'���M�=��#��gtGt��~��^:��U���
b�bnIDN��U�������~�n[0���#�Zr����Zw���]t1��~�y����d�G�
@$�[���(�a��#y�~e������R��1��Ue�|G���	r�=�K�����Ayd�A���@�%.�.��������vx�����������j|�{��7�����~��W��?��
�-�he7���z����j2*���^��m+���4(��x��Ycyz����$"+f3/6���o��z��A�����!�b�nbk��{"���.�]�'�P�O������lH�$���47J?�88i�w��D_�V��������-��~C��^9�tCTv��j���4!���9�9+���nsV[i�j3�_�0���Z�0�
����*$�[�@5V�f4j��Nc�+'���5J���<����Pv}�z���5��7���l�=��z�Fy��Ew�9��������&���E,�sl��F��xO��M��B�Z�c�x��Ap��o�u�<��WG<��k�*Vqq�m�"u����N]s��y��y��������o>�o>��	t
�3���z}w��M�����{�6��o�}���`c�a�k��_@�������\�?~4�Z��a�G�>�)�v�%jx���;�{��������9�A�:�9��(���&Io+��m�l��n�8����I���Di��he�0N���9�7��%r��6�_ ��1b�]��1ga58�m�)1���\w�
�
�7q�'���f������"X��&2�%��������@?�@{K�x"����� ��<���������;�]����G��<�l&�r�?��\���=wr�t�$p��Ux���R����ao�;Vt��ok~�(a�8�=����B`��������#��������<�����(����)`O���c������V%�\���\���Y`���4R���J�6���?���f��������*`��11��=������c�4�c7�9[�a�	a���p�Hi�<t���	����Af����������O�45y:%p�mb������
�9K�j�[�c�9��%�����!k��	�qo6�x-!����
|G����, <{{��-��x�)�"��V�O���r�#��q�r���)�s����;P
�<}t!p���?7��M�u����LAD�!�tZ�"��[�jD��D%CO%	H�UT2nVHU��u\����g��2�Z{�-�o_Y�;�Z�b4�V����:�<�&�a���ZS�	��\UU� &8���[�����J�fX�����{��G�q�t	���h�m���R/P��|4V��k������`�r�!��wZ����h�5���2���������J��J�`P���������W�����t|{�h	��L�K3�{�@���)�I!+����S�o1�v�6�1���r�l�$���iE����g���YkSR
L���c�5��8���,"�F�\�c�)��5��9�"|m�gd������	Sh�e�b����+A�TP]C�7{���72Qt�A��^XN|�T��_$&'������������-��}�j��i��:��������7����?���.���iD��8�A|S�1
5a<$��	YL�(;���f};��~�$�q>�2<�c�"t��lg��pJK8I�^j4p��E�8������|��o�/?���;�lW{����s�
;�@�\�}\i.)k�(+�>{�����7lK��6���*�N�Y��.���X&�H���8G�H%�y��`�C���aA8ep;^�G%~?����������~xvs=��pvs��4S!*���;�{r�'}{��������YNJ`e���C��^`���� �X�A��f2�x%I�
�}�P�8�L���f�{tD
�pw�A�|��SU�R��M����$n����D�6��g5C��&��V1�a�P�6h�(�u:!�jE'�U	�`����q�JGa":�.���Sb��}g���fX#X��}���`��$�72YT��0Ex��n@��h5sR!�5�X�]xA�M\�����'�������\bZb��R&+����z��m��se�@NJ5�P��o0�	2�|�d��r���i�Z��FI����(&T]��D�?q������ �5��,'b�1�x��4�H>t�:U��� x������E^h�����a<�:	3�L�F($�t�z�Ie�r�~,i�'�
y��.�ej�����0�G�;@@h�M�0��_�x���cv�.���0��2���VA�S85��]�D�o�w���E�����Y%��K55���V�DP���x�2��=U�:��-D���,;��a�o��4��BA�
:�08c&�j����t`U��j�J�U#��1E��)�����	��k��MejUZ�`��E.#!�"fx>�(G�U�!��N�x~��+�\���)	�bv,0��r���A��0�<S�'�
���+�����P����/u������u���gW�?�����}�9��u`��_y��L��m�,mt����#����9�>Z���8��9���i�m���q��M��yO��.R�k���rWK�F19�gB���$�������P�f�jBiT�~a�X��0�
L�������0*V�	5O�PZP�j]p�LD}��L�7oL�b35�@�Dm?�DpM�m�_���y����X�uA&�f*�hsG�*���h������h/�e�����*���*������d�+�@dK���j��8��L$���������XVN	�ig������mjMY�S�������4�9�a������T�	+!��J�6o
S����N�Vj�^��@����vM����Z��.�	R�0��T�����
�"�:/
'ds�V$"/Oe9 8X)�bZ�x�9�n�j��t���J��{��rx���"�-�,�����-9mI�������rN����5��p|���t�I�k�y'#	�s
-A��(QG��s���~E�:-I�������C��Y�l^��Ho��c*�[�j�Y-Y(�	���/� �z��/����PQ������EA)�G$�*�
PM�����-t����oA�E��;����c��\��,�6�Q��]�b�j�NM&� w]q�������f�8�Y��KL\�������gkS�Z����F�\H�����O���yy��T��%y�)��nU����>��&�X��6�3�Jb�r\Y3]P4tC���4�E��(�wZ-N���%����
o��SZ���k��"��c��'
_�/���r�����tIe��V2f�r3���$$���A7{�6Bg�z�bds���z�B����7�,�����q����Xdk���2!Oj��������LO�t(�a2s�f�Z"g���������&���L�IT@�ke(��1����
/�I��3�5�K�3��H�h�\��L��]A)���b�<�����J�nD^�V�Xn���,�	���X���7�
7�b���R����������Y����T����/����;������;��km<�oj�E~�k�c��N$��*9'��$���v^1X��#��
��V���������eg��Y�������;�I��#d��~B�Y���)j9�a�R�;�����#V�A�nASK
7���I��w>�?5M��������K����1q(V
�c��Z�,���
xG�tCa+�c{/M�o�����+�����"�r��!��A0�#~e4Wl�Iqg�|/��j&��������P�����Z��0%23��b�R��L0/6����	c�	Dr/d��\�c�-��q�����%��u�Do$1�w_3��n��~��7�3BZ\>���1���3M��E�����MBG�]������D����j?���W�Wq���w��T�7l��L��6'Q��
y��]�P��Z8�s��h ��/�����T�F�% �+w��4�#���K��?���Y-0}���#]4�]I�)P��t��p8VE�L;��v��'
�&�Y��L���U��������d@	�w����(�N�B�T�`h�B0;�! Y�@�Br��j:��n��[(��_��i��X��e&�"�4A�js�-:0��^H^)��P���a��t��9���-L�x�x�sz9M-b�hgE�ANv�up�����,j'��l���nO�/���n�����P��8�,������}��`�����_��9����f���?��"���^z��+���~m����x�a?#�&����1EB����2���'��x�;!���QB��u���2�D�����R��&�w5���`�&�u��P�
�&- ���a�=~�),Wx��HM���?��wG�.�������,@DbM�K����!(K^��$VH�D����{R�<;;���XdwUu�S'>g'����Od�}�Gn��i��*G
�)�Eu� a)�"J�C�]�s�0o��7G�G�
��z��L/L >���*�_������:�Y")6���3"�LC-�>�=
E�%br�Z"A�"�j��n��E�D4F���+-�* �;t�������/�t���64���1D������F�i�(
�4]��!�9]�2q��=����!
���� YLf�h�\4L���j��3���lb;k��tW��������d��d=����,�p;�������;������$�&������
��i�����|�y*
�N���GQ�7����V���!]U�t���q�t'f����z����Y�t�K���E����%���J�4�y���%;��Z�#|���d�-r��\��XxB��>|Y�F���Qu<��:�@�+�8�{;��&�����n\m�*�$s^q��5j�.���NW��}��!}���.�W���i�����R�Yh-G�B��U��#�(V8�DoB����/Q�K���� .�������Z���h�������eU�A+�U8��n����}��_�A�(K�n����8������Z�����^5�_�/]�CI�`AeX�Ab�!,�2�g��}�d��Wx����������)+�+_�;�GW��e������(��kD���{Hr�F����L����xc��`�1p�\zf�CyK�Q��5z,�&w�Kl/y�$	.��-�p^s&dU�s �W�Ip��h��_�����4?�{���W;�QT|��Q
��$�vp�z��X������%0V��j7�\M��G�'o[�{�s.��8�pp��	�������c�e=I/�0���{$_$�<c$-=�3����=�72
=�9�f�I3O�%O��`k�0��1��!�U�7�_y;�_y?�d7
C�{9�"R�E=�(����5�(��� �n����M(�W��$^��������l�`d��)�J�0����M(9HH���$;��KL�\#h���cK����<����Y���	�����w�:�;Mt��I���n�?��"������s�i��yz3��[9��4J.i��/k=�r���K����Z����m����U�]"z��|s��I�x	.��)!t1q2)�G��lJ�J������;���.j�"�q�z�%�~=�7�K������p��������FL����N��U��Q���}�S�M�W'��6Pig9��/��L\HDovNsa'���#8�5/|x�������o�N��p��[0p0��r��}�A�g��L����k�h��GR����9o}>�}Lwyr���}p�1��:��j|�?z�<����:��8!�(�PX�,
���%��q&�(���� q)��_9D?`��S�%d�M�$��������{��n�����b�<��(��}�t��w�p��Y��j��CVB�!�=�01��
�vL�9%���Z9����+
�[.Qh��!�6]jG�r��f�W~C$uG��N��v4�D���f�IV��XB��r2���8=�Nl��@���P6���s	��I/�����������\d!�����y
znAbtl"��'|�\:,Y����������y��n���M���bF�1��l\�����!�Z�,{�\�6�b�G����m�
 �^��n�_�e��lH=R�����%`���E!Xp�MS�&��J�iq}�����iw2�K�|1;�za��iq6���I[�?��05l��'<5B����Fbue�[|dM��
� �'�{���E}��r2dY���������L��y���'�t^(��V%���wb1��u6	��g
"�?�{����i���y����|�����7�������F�����G�8������+~��v��!����j2��@ ���]j��#���H�-�([W�~���7����F62�k�+��?e������9��l�?��_#7��xh����aD�	o�]�����
�R��D��X���l��F.����8�
?�������4�s��&E����lJ�Dw�������4n_k���N��
�Y�MD\A����4������j�����C�3�������X^�����J�Hx��e��7ooz������Z��/���4�����M�K��$(�	�'w���N��O��k�I��77����}��C�8&J��E�8��Z[��1"��l��`d��s�A`a.c�hK������ebU�����h-������W��&`�4��}p�
7@Y��Z�
:�ant�"/#��E���:��a��{XtJa�6�H����:���^n����:���>}erK��N�=�C
�'���6F,i��4��s��E��#&s(�������]bP��V"�,/
�q�;���e���^�[���������������YP�OK�fCZ��X��������&�bHN���+H@�z���s�yO9`z����
��&"�)�o�,�DN��e������O���2��"s�DZ;�K��%��9&<#�=��guzJf�K�<���Xc�	o��yF��ai��%��y�]�y�D��4"c�#��
m1�!�jo+#X�/~8_��l���}�{��p=~����+E�LQc�����Rbf��Q)-V$��L
 ^-�,�#��W]�U�x3�^&�?�T��*=��R�����i�1�g}�,������\�-��[�M���Cm�l������I���\�G�����V��Z�����zu�PYW�j}����z)�Q�p��p���f��o�����1�T��	hF��,#~$t�G�Ph��m�NCOD��*)��|�q	^ y��9?�o#��o$������[,��DFe �����~L8��c�24��h��}�!xEQJ�E�Z��j�/66�ZMo�{-� ,'����f�z�E��>kOg1�Qv�B]?���X�M��7%�,d��5�%�\����C1���(���Vh�9���EHG��I���MW7@�f��c>���������;P7�cc^�c���0!28Jrc���O1`�����E���QI���S���/	vX\"���Kf�n%��RI�pd��1���-a�����`L�B������-C�5	"�������0���D� q�]0�dD�b�Y93��=��	,����KI����G���8����a��FB}��%[���{�('a�'��7���~W��i$��o��X5������/	�(x��m���f��g}�?{e�%^f%�hW��Vl5�z���8�S�Q���,(�x'*�Q��w��� ���.GT~��}�8�'p��
Oq>n���h_.�Y�)����ls%M��!{7q�OTK��4������i%-�	�$?�\r��G:6D������%~>/]�+\�Z�&��]���l�����yk&}���h�x�xki��a�����s(_�O��
��Fr'�jf�	�,>�k�.���B��#r]{#/_B5�l�s�5#�p6�-8��
���>�	��!�4�p���0*�K��"LqO���LXe�(�F'�2�!���`/yi��ap�����u��l��-h��	�l0�m6(3�i��)��k���I_���
�6��+-���g�;��Q"���'�<c���9
�P�$�d������0|���s���p����m�@-�������~�kHc���������,['�����0�Q�:�,�x�%������AM��(�����s�}D��k��<���[�y+Q���
�#'L���`|�>�>P����k�L�*��U!���!�����t���KhW��+u=�t
�CuD�����%����h���� _}���7�`u����c��8�%��/���q^L�A'N�O��\��Y a.p����$B]kt���hed�����Y��2+�0�������Y{,hcF�;���g�	����?�?e%}(���$5��WZ$�������z��[I�Y�UF�zsVS�&*��B��"oe
=�����$l+���_�`:�R�*�
�C_5O�]�V��L�a5�{s�c���9"1ye�$�a�����c�������sb�"K���4\#�8�-��������9$��}#.6��
���=AtZ��Iie��|�7�l��l��d� g��
9K�t����lV4�$��BAg]��h��`2�?��5�[�M--�yiQ���/Z�'�.@~�Z���]5�n�Y7�\U��SP���� �����=��{���-?�};h���j�?�}|s�w�8j���}D�Wz��l�������	�6�������F�hqdtB9�m�{�,�rge��P\�j����>�a����9<�8��;��?�u�R���"�	�e<�%�������;0�~�yLH�5S[�������@7���.�~C�q���S��������,�s(3��N�������,T�N� Z!��g��^�g��^�g��{�;�w����\.�!�g��ks,�c:���S�BG^~�,��<�Z�N(�z��cE��{�#��(J��� �/����KT�
����($+
����	)W�W�^F���I
��lLKj>m�G�o�<�����Q?�����eA�p��8_�����;���F��q���)]7��RE�%��� i;�w�����U)l�DT2#�RH)�����EZ�i5p��E�����1�p0�Y$�>��g�v?�gBI|4�]�������d�h���H�
���yd:��u���5q:�4�p���	��h65u��5H;\�1�a�v��h���Y��n��k�&��g�����y}M����4��
�R�k�F�sd�I�S��-�
	�����3����9o�l�s������d7���n
���Y�I��d(N}7/�����pjN����y�b��qfs9I�f_'��Q��/Yb�m#�: ��pCAU�f]p^�,�O8��4G�<+R�����'{�\��Uf#���:��D��@.�Ad6�=6"�~6���0M!��z�4m��;G��B���7
9���iDu�B�����+u�~]��@�oah� 'C�������� BW��t������Q�l�&��{<Wg�p:�������d��="�$��$������,�$�yh�l��������N��l���o6(��gS�E�����99��:c�_�SwP[2���LFC
���|�G���O��*#%�2W�-$����/I������\p>�w�����v��
�hk����OV�G���:
F��Z�N�/S�������U�(�8�;�ho�����y#-p`�C�C�@���7Z�n�q�1�<���Z)|��,�~�9���n>�2���s���D���&�����.0l=���	{���>GT5�E��.��������7����2N�V����D�Nl��P��wDP2v}H�8y�!��(���V��.`[aH,� >J$�u�;���?$��Wb�g��S�Rp��q����uz�8�
��Z$>[�Q���`&=���hX3L@���h���<�vw�&t������#7{'c�{]�,+{�?���f����'����j��r.��}8�b\n�'�Hf��������B_S�������zE��Z�>�E����nQ�����B��E��Zy��i~��ry�P�k��� ��!l���t?����Y%:�� �0����,p�nt���;�h>��y�+B��2��}�9��b�
!���U��d��;N�X�p�%X�-"m
iUt������C>�]�{NP_���)��t����n
���-�)��_MZB�f��m!���E��
�s,1�
GE�<�"\��GU��l��H�80����0�!�x`	�S�
;�y3viw3"zB��/G�;
A�6�Y]�;	,-8>���q��@����b�q�6������Ja�8��l���G��K]A��q��c:�E�oS�<9S�7#J)�]g�m�D�f�\	)��%���<�B�(�-SAe�R1/��Z�<w��q6H�LM�J���f��Kj������<������f����]�#<���[��l�F���tyO�=o���6�'$j���s
��(T�<
J������=&#�|�������K�a!"pHs5N�I�1�00�~B�'��#lnO�'?g��?g�4B�#��s$�q���� ,\Ae"��g��zi���0c&8K�UU�����f��
u.�������M���D��ZK'�",g'�qTW���<�����L������6-��xIy��F[
Wmt����`���[�}�e2��y��b���-&D�&�++�����(�9NlGduk}Su�k[��B����P}oxwqt%.(aN��n������V��O�B;!����\n_��7�x����9�9�!�|�������0����.V��y\�I��8��!�(�=v��W�8��#`��&wE�N���x�3��3���M�N��l��IX���>N��h5�KTG���N�����v�f�jni���<��s����2��*�k��Lrv�<��z�w��a�A�P�[�A.N���hm�!@�P� ��E�B��/�+�6_�a"����xo)��@��h�9�!�N\������%��Q����7��E�+�s�����0�MC�}4�O��uLDb�R�m�D�=]*��@�nG��v,���4�	����)�i�����E���[F���&z��?w���nD~q�;��7]����O%3��SqQ�\��t�6*1�d�������N�.�_@���+��<����v�����m�I�|L��(w�<y3������.l���Pn�*��MGE��{��]�c9��L�"��U!Y�9���k�n����BoJP�p�p�/��&P�����U}�h�� �
n)R0$�pD����I���������*L�*�������l =�����I�����/<z�k�V��i��R�EI��68�-*���j+�����i�Kg�_%bt���A
�eS+,N�6o�`���'KbWc�{aQ� -�Z����2�V$�6u)��)`���XG��w`!���diz�&@�=��@h�:��#b��L'mg0chP'#����S���������0��tm�ta����8�v���y2�/m�,NU��1������D��������EF�o�f�asT����A��E�N����.��4��G�
L�,<���4\�t��(,��n�G�]��R���f=���'�)�*��)�8�TC������^��kD��]E^C��y��O:�(��Y��������e�Q���z���%4���i�{�d@�(�����G��}��1���6���/Q�����\-���@BI'�	����3�%��A��U�*&>~x4O���a��������L���0�����.��6���������]��������H�����lg}6�I!��
����	�p���e)'�d{���]���n�I���%^�i(2��&�Zf��R�"\�l�1�zp�%�~x����*��2���`�&��Q{��q�j����m��
���j�Z ���h4m��@Sofl&��XL��c��8�������S�vV������.o�NHO7PO�]����-�t���B�D�2�w/��>%F���Q:���A6"x��(3,
~�?�L�:�-���a��
�G��/&���;���W�K����2a�D7^ T/����!|H�o��
��!1�@	�iHID�p��E+g	�)^���E�0����>��D����:9��L$��	���,������"t�>���,��vq��{�����#���������{����8���*IM����$�zy"���u0sl��`�����_9�E�C��sb��q�9�{�q�i�a��(p.N��tT���6��:���&�8{V#���_1g��u���&������p���������O�^���j?y>����.�xx�NxI$���E����}��^
���C�uG�$Q��s����L��t�ld��&��K�?���Q�����0���XAd��x��J�a e��s��&�PZ�LDH�^A����o�D�[�o�J����W~�l}�`s���e�C`�U�72��M@�%E��G���d���zA�
	���J�������D�m���V.T�q*1T��o�	��c��}�����	g�x+��R���G�Di�"��
p	c	��q��,�g��l#�8v��gr���%b��!p����� �������%�����H������+]�	-rV=�
��}�0kZ���`�%�%mkr,�J��:�@��G�\����{�in�=~�u���0�+U,^#��Z8��I����5�m��-\���eK�2lT{���nu���R�Z�low+��������j�X\����|��G��U�����l�,�r<�i`�=��%�?p��~b3����@��$:)���t'OXA�^|�:_���_����O6G�5-����]��Ue����("����Ed�T��%\����'��S�zC7ZF��?���|����$�U!}�vd�\?3�}��. �7"jn�?��y���Ed�@��"��+�4F����Q����:�[�\/����j�?*���z��Y���G���Y���*?���fx)�aw<�����K�s},sq�+c{kc3�\���Ri�V��T���{]�^rMT�I�X'���������xN��t��[����3���W��g��%�	|����7�G�PD]��gtk���J
��rE��k�����X���i*���;�.����h�
��e��;`�2�N��"br����J�������p+�B�?�a�Wo8dx��oV��^&�6�e��],V]�A|'OG��E�H~q�:�><��CyD�l�E��q�^�Y-�Jb�b�T�E=��������2O��Q{h�^nI���57��At#�r<E&RI:�������
F�u�g�hm�I����#��E���Cx���`�s�2>mI�(Q��^�=����
�5�>������~}��[h�)���V8�Aox.JlB��1��JZ�BG���,PH�q��M������oTg�G�jcf��Sx�h�#�d��g�H���Y�	�(6��)HV���� 	�O��H��_�uk���YbA�^���`���b���W������N�9:O�)�q\X�Ce�5$u��;?����P��DDD���1h�������u��p�A�!��.�T�;���{1�����zT���/��LRj��Udt���m
�G�\���d�����v��O�F�1*q���W�?MhG6_�gG[r���H���@�$�,�_�	L}�R��d���U�����M����s�Kg���Mi�����_�-B<���9t�4�3����K�q[v)��5Z��y���L|�&��?�W��3D��2��ZP���&���g�#IxF���W�W��~I�@=��z������&�x���
�<���z���V����}[���>�����s�n������+_m�%�~�B�����B���D��{GX����b��6R��(?��,Y:�9}B���PY����[a�Ac+k��zY��X|�}b��e������9�r�&�S�}��%a�)U#��8��q/��k�v�]��:'�oLdIA��R[:Y/����!����;wP��	p	��G���I��v��|�\�f-Gr�I�P�W�Z��y��/X���.3��=��&����k�G���
���86K:v��L@����@`+��W��C�$�O����~O��v\G���!�jFg��V�FBo���v�<�I^�;^��:��q�����9�^���9z�`l+%5��N�]�m/����L����*��	6����%�D�=�<��2�gI_:�fXY\C�\ZX���%G��z�M7�=��qn�?���E��Q������%������Q�P��0E�� ��Wn���`=�ESU%��F�,�rt���B�#m;��En��{!�'a��%C�S�bB1�p��:0��4�t�7��?'e������\e�=.��D��#��D`w�y�0(K	�K2��d&�ll�0��i'�L��A��W$0r2?�NKC��?��.Ik��N���qpz���������h2={w@xK�f����W�5�GaAQ�[n*��i}�����`�)�$jH�yV��'6��H����-�K��@���7-��������w�Vd+����#�%iP�xSs�
df�"
���x1�N����Ll���l�+l;}�#!D$����\�y��w�9<o^������������Q&{T����6���L%��|�������;&M}5���'G2��AD9������"��Y��;d=�T��bI���i��7�P:���<���2�/������tag�*,A�h�j�&����H]x�wC���"��+���*���r�Z�#.H�c:�^|�����8�:��"��w#	Fe�r�I�l����������Z����;i[Gn�p����o;���[t�i�a>�-�a��e���H �[B!�W���
�����"���br�X�w�#T|5�����hb|�;����~fZX�'��`����1���K%��&��)��=I�'��\�!���4|4���1����N']�ll�<���-������Nj��A�?�������Y�R�t�9�a���F�����j�V��U7*e�*��M�������Z����fl.���f�Q�����%�(����9(�])j�G6�:��$����N�%�G��E�����O�,�X>i�vc~�O�H��|�4c������������]�m"GW������H& |�gr��f2�}N& ,��L@<Z����8���\]/�#7��OOD�����-����x
%r�i������eR�����_<O��">�P�}�So�YJL�������"a�����������?4)IB
���q}�>C�j����Dm��
�~[e`��8UK�co�W����c�cA����8W��u�)������?�~������q��iq���.�Z�8,��\tL��W��+��23�Z�Z���C�C?�j40.�]��nX�aF�t�"A�5w��=�ke�w#]�NvU%( 1�V{M#V^`x}����a����(�_v10g�A���^��d��k��������jL�H��-������k7��D���|`���'
�<	[eDf�Y��
���M�=�^Is�j�5� �{ca�O[��������������(L&��t
��Oi�����������=���)P���)�9?@�q~~z��a�y$��>+(_So�t��n���j8��;��~���$��[���2�7��&���u��[O�aVn�^DL����Q������G�eE$�����"��z�����\�C�`�h�N��t�9������u�� �\^��n�XB���e��UX�h��*IF�9M��Y�&b�]�n��B������
K��k�A��+�X�P�V	����]u�������BL���.�u�����&�Fv)3:�,'X�G���lr���$$~J��R�b��}����d@��i
#C�e�B^p�{�u��A-���>���g�m�W[;_����d�?dH)�M������O����4�>��`/�a��k��L`���'{�Af���0��*:<��\��%y�y��Gq�30<!��R��?%s��LV��_�k!F=�/�����������F�P�P���S��?s������Y��L�Y}���\J�����ZzNM�����=)��$�N�)���}yl�f��;i���z�z�Z����.l�����VZ}��r��+�3�?���Yi���nV6��5Q�.����;��0�ZWQGSs��S���]�J����Y�gGUv++6a�l�O�)A@���{�}��
�U~}��t�b���F�1=4�F��3�$or*�I�Z����dE���0�1��5M���[��S�����2S��k^�^�c��T4�5�bqD���>K���?�c�X���@,������yq�w��Bh������1.|W�� ��-���h�P��4l00 �}p���^�I���u+2����G�89m����*�����;D`�j���6:3�?�y��9,��#a�R�<�`>h���Wq�SM#V.;y�$LJ3s�D!��g��#d�a���K�')������Dv"�Ep�����nM���I���#��������uUL��U��,P��d��`=Y��$�h*#]����&�����>4` l�J���zU�z������������Z��_Wi�eR%���6��nY��8;g-�j�6��4��D��b�����"���D��H��Y����(��\q��������\�;r�	q��q�~��Kl�?rm��ZH��fa�J�~c��R��9i��F�h�[�q��i?���@[�D?�m���7����}�o�$�X�r"��
3V�3oX���4=�n�����%No���1���Z��������p'[�O��K`%?��������i /�����.��^���>����K��=�s���`�b���q���/F���X������K����L}����5���_���DnN�9d"�����+�^efC�.\)	7�Q�
_��r7�?�b'���
df�
�h&7����ajj�Rf)���Y�W]^
�X����p��+kk�����{	I����$^��s3~�9�noF�@�-eooz@:�W�3����'W�2����_/���NE�F��������������4��F�������Ri���������������rN[����BMWH�B�����K�V�f:���	5�f�!�D����?���
��6���E���TBbl�c�C���)�>�3���J���PC�g ����
�;nf�p������D���K���v�
n�X�:��t
cZYR�mov�:���<�{C7<S!�@���1�DI1���kU������`��6��z�lF�{����r6����tc��m���T� S�]��q��7�9B�3�;�i�;G�+Z�hL��E�`Z�10-�����W��,�:���Y(�Z�c6���Q���;���������NF�����>�
Wb��{b�v�W]nb������X��i���	�x��Z"_7��(�����h������?`�E���I��[�N@7Hf�/1zv)�~O��b�r���O>���We��1ab!i^�������fNU�m��g���l��`z�x���c2����8��6:�f�pD�t##$G�'#ig@�����Yj��c@�(�����'B�THGP-�C4w�d����|��cnvj	G���5��$����]l�"����]kq����>P�| `�m�m| L!������
�mW�O�<�T.����)����S�tL���_�2kD�o��6f����R�����OM%���B���������4g��m�������Hh��R��$�n�y7n���i�'�U�=<}���^�=kNyJy?C�A�>����n��v��)��:u�$�[�6�{w��f�V�56>�����GM���N\n�Hg�l@7"�%���Z����6��J����
\�A%.&�ae������	�?���]=�CN~���t�3��t"1|5�n�j�1��X�RdMj0lB��XO�r,�5�sB���q>��%��n��L7������h��3O>N<9���w�>���3��$��6�r�pj��u�����fP�mV#x} �2�6�Q�%.wg5^���ls����I�xqnv*jG��wYts�R��&l��K+��b$���9w��X�`�5���at��<ZX��.�?�����3�<i(��p:}�������j�zu�b��U*�����Y�3��?���Q�{�����-�1��\%#;2���/J�LM"�r�t�q��]��6�]�����~_�L�I�t�BY�+������OF����=I��!�������;I��Vg�[m�jM�CtI�������j�	�S�&&�������H�	��g��T�-4r��FC�m����Ty����)�y�S��A����8%��n���X��{���� 9\w��Q��)����:�������1��t����?�n�g���8������j�T{4����j�8�g5�B�
����`<��Q��=���nN5gcmE�?"f����M��>�v�LX�7J��;��/n����	ru4�f�*�}{�a�����N6�������^��cE
�NY�
+�^���������*�M`�P�����@b�HI�����9��a.�n�}t�N���^�|����0IL�TO�`��Iu�]�
�8/7*�}���z]���n4� ���4������c,������:{P��}�C��M0����vca
m)d���OpnpN���'8��2�?B�:?��0��,��:�;V��&���A���J)���l����^~=�!$&��r`�{.���Q�'y��N��4d%i�x\���p�9%��7�h��Cu	G��GH��hD�K��EQ�p'�f4��#�����2W�/�>4y��C�\5f����:9�@��q���nA�����R�������M������N�k���z������;�O�Ek�]��/��.`�{���} /.������#d�7�)�X�1
%-{���Qn�[J[���){�K���#����dXN��I�P.�����V�1uxj����k���_���0�����dd5W��	j����G��0�j��!F8�fA�$�����b����T��y�1#����L^��
o'�[�!j32/+�Zys�ZY����;��-.�^��7�(B��p��A<��!}���t��O�f9��e�&����h"�<��]4�UE����4�8PW!,�*+��*��c�=g�z���}��3����3���zY���V�J��M��u���z](lv�����b�^�,��JP�.���z}�\��+��m8yS�/6{_���CU�+{f������C6�����N���]P5��p;�
U�|Y��*:�0�t������o"��B@GUq�I�-�Z>�N�Xa)���{�;����rj�9��)�3S����0	���Z��Xc����Fr_��]3.�"���
cm'aJu��H<q8�M�S�6��a��������D./��������
�1���z/1sw���%c�V@Z�\���Yq��k�Szm0Xs���nb���f���p?L��?�%�?����::
�E����u�v�~���4B����]���=�R�W����v����'��/�8C$J�������z�-�1����.��%�xV���n�"���9<j(�5 #\OV;E�Q"�PS�M�y�m�4����RA�!��MJW�h��m	q^bB��*�J�U����_�4��=�N���:�G���P���l�x.��j��#?9��s�m9e`�o�*���f��%�7�R\s/o�/��/�xN�����R:S�h��K�o���sKTZ4���Rn���>�q�����K��(-��w;�(:�z�f����@�+'�J�����'FB�c���u�p�����e����@����L���������z2I��A�P�R�C����kXulK<	������_��Lr�qC9��C'�J�a9������o���!r>6���B�?���F��P�s�����N�Q�E�h8��i�J�
�j�!�D<������D�)C���G;�[U�Y�>��|q��'��"C�PH��0ny �W�>�n��uiJ������m�^�G�>���V)W��u�u$0:��M�]$�U���v�J�E�����~	e��u��@��Q�����2��'���C�C�y�����u*)��>���U`kHI���M	g������� F<�/-�Q�KD���I��1����t��-��v�����~��j�PC������Ci���5]��c������L����h�9G	����9�V������p4����>M���i�;}t@)��_gO�6 ��=������=�����w���Jz��
�������RHP�BIB2*��F����@g����9%��S��96��`Wa+�Q�46���S��8�#<��&g
�9��L�0�f\8�f���]�R��P|QY��{��%��R�B�������T2q����A�.������u	�z���2��F��?`c����]�c����t�\��Y����i5�t<�~�W5��;��%����"L��x
�b�e�)eY)�n����-l���DZ5��a�N?�`c`�
������z�"0��a��	qh
�m6	�� 3��E�S;����`��#����,
��Y�����P7�Q	-X���_��Vui�E���W����e��Q����.;�h��\_G[���o�I��\���j��i�{�_�U�G���_�qjM�t���W(���_�4�����V���X�h����qU=������)���DvP���|B��U��E��_.��w�WU��3Q�QT��x@(9%�I����f�"��*l��hd��F>�H|�#�	)��?�/�xh�`M�����
��$<��p�Z���y�
���
	#~�MFC�d?�'=�j�6^�r���\+T��~��V�y��RVV��R��o�a����6^~������Q���������� ��O�R�F4�����\�+�^}�����j�`��K���F7vt
t1�@�}�n��[x��-#���k�P�_3�4��U����C	�`v��Ew�9���[��\%!)W����*�d�������5�;;�������w/�+�����!���
�p^�8}��X��C�����30v^��Z�?��v�3���y��M��2��������_*g#����e�}��_�o��*9���4�����t�
Mu�Z-x�j�%\)`5�9�� 7p�F v�2va��U�����+/�n����[�EN�+��0��f���l	,���!gW�T���x*D�C�8��#�U���%���L,����|&�����q��U@��C�~2��9n�n$}������������@�-��
|�u4�����)^��Ce j�d�4)�O���g���yA[�y����-�gG�ohj5g{�{�����
(���J�u���,��}_��������uj��>�c�3�!����d���`���5~�������]S���C���a����x=|��p �������^�}+��i5!)��1v�i'��:a����K�����0a�:D��+�<j3.�_�������~���5`������_:�/�6���^���N�g/�p��(��#{�\�����[��U)7..��Q��E�@O>jf�r���F���L���������qj����1��9�9\6I�bp�o���E8�CR���U��\}���������]B���)�x#�#m�������-������^$L�IGJ/�s��U4�����S&fv��Z,�T��^�J�Wm��T��E5���h424��!���s������}�5������,��l��T�9rw�������`
B���4��%���I���v�����m��}���R�"�������2T5��Q��$
�]�9�����	u���$a"��K@�}��2�=�#�z��,���2�b��{�(:���P��iew�W�1�F������B�Y��cy����M�*6��l�����(���@�G?���u����Ud�5R�����lI��I��X��=x��#��F2W��'[���
�1i�?��4�cMdv����f<H����$���S�o��4�PF$z_,�D�?������2#����I��F�+i������ )�Z0UT����o��Q������p�e���K����eZ��C�C�E���)��������$_����������]���bd��`�������sd>~��P���H-���\Q?�������(:�_��	�]�j8�'�s�noPZ�J��Q)�oj��(��n\�%��G]��.�����J?K��b-b4S������e�����fC�Q��K����5�	��j�~�p��=�3�D����k1�u9����}� hn�����o
������"�a^�=B`��N�A8�����zA���*�0�m��Fu�_��P!,��1����_8E�q{<������d-8q�Y��71u���=��C�	���j���,����<pO�*E���K��w�a��0wZ����O���{a���%������'w��$���'>���F6�g�������&�[x�%o��Q���Q����@2������bh���HbN�1����*v���w$��tV/rMC�&���z..��H�(��,�oy�U8����w����L�k�
�����d*dRM���
=�l+O����{��:�������Qd�WA�p����h�\W�3Y��(����W�.y����ys���d���h�����Et��o{�Z�\����R�����O1�_u����F;�,�:����V�������j>>"��m�m�j�E��������"L����������������������Yc����j���o[���;<y���A6\�I>i,���0d��%l���
3dh���;��1��RPh���t����.25��kA�nc�e
���G���#�K���r	Cfyj��,K[-��<��'���O��
�&����q�\cR[[������0��(��7D����?<�[6���4|�N�NXJ��������qV|��G_c���[�-q��
���2��y��kX	<��{�{t ��������;�t�ioJ�&�{����{h1�������K�J8���6�����&��TLS������`���)`\�eJ�=������`�Z*]�+��J9���L�Z���������B�q�N_�g���U��B�a',�����<��b��"=�M�t$	��/d�W���C4��]��o����]��&9��<���!�F��RWvA�eQ���J����\&� ���H��:a$���g$��3I/�,<\Yq�SN�n�9�Xx�|���&$g;9U���(V��m��mt	�������������m�^P��5�::����Gf��>�o�M�����}f[�<��98��M	`(~�
����F���������zC.�^�����9c�����w��A>TkT6&������cgZ]�#��q{����P�7m����k�Z�K"���,�v�{&������0t6�*���d��5���	�`QG�.���������c��O���ts&�����p��n�C���ZOJ�I�R�([�����8������L������)��!J���[?h`<���]�I�'*��5'���%Z����`������?o�]4~���
\�w	�sL�|�o2�:�<?F�}��*�;����D���v��,��4�e�~
&��4�!
��[�EBS���I[�Ab�^i)2\��C�����yo.���d;9���+1?w>��o�q���������YE�k�����=��K�:�t^�������LwwTF�H���>�u���v��o"�o<����o���>}�;?���
�cz#=1 ���-����|�9,�MLm���He����G-9�|�����ZY��4��a}��������4��D$\�JBd���4G�T=sS�1z&�f
^-1KB���?=��s<=ze�M���yw����y���Z%O�_��5�sn;F�e��)���S�q}��H���F6J4]o������c��p������a�gAD�e9X��\��r�akc3�\��5�����v���\Zv��r�R����V��1��0B�{�R����\����D�=m��5�W�IR
����4������N������
@�
2x����@�?++r��5���<����!z.'�2������l6��I8�G��b���_Kc,:��-�~|!)�u
�&q|>�"�HL����<�b�B?�I;�A�&������

���o	bY�F�~�S�&~s��Z��V���`8Pvg �o�O�[{'?"�V`,b��`R�S��Z@;`yG0i������0
�S��dt+���l4���go1]�T�8��`�c[�y��1Wa\r��x��� ��+0�WH��*�����|��c���M������8gt��68>��+cz�$�pJo��`�����A�9��m���'������JfS��q���l>�N���:b?�?�{��,20:�p�+/�������y�VY
f�7�����L%Nq3��,V�t(�"Qo�/iH��\��H%�6P�6�Y#+c!����y�����������$�!�2��&df����w8�������G��m�l�o�)�T&�	N�G�t;n��z�~��S
��8�	����H�O�'��:�6�r���C"X�+S���������4�o�����F�O{RT����$�q�t|��Mbw�oF}�B�EB�M�eU������F��_s�t���M�,��A2�}>������LWF��82�?,��h�6��=�Bc�x/r�:E���/�
�.��~7�~n�[/�+�5�W����_����<d~Nx2�X�w*�I�?��g�������-��}c�q�S�v��������%�Xi�C�6
,t.�m$���b����Hi����$��u�����!$���
�@���(}���m�����!$|����1Ea�G/j.mX�s�MCP��n	��`��:�T����h����CwB��������`l+%5��N�]�mo�:��P�M�&�:�^he2XW���B���Lxp��8�&���C��f72J���(]���h��'29�"���oFh��Fj�.�0cN6:�!�1������$�+�3��K��T��S�� ��XKX�>��0�-�#[�$t�E?+��)B'sOr�j�.�(8��d�)b�B�2yW��nBx$�;/���(�e��5���M��w��w�!���\L �GsS�\�D��
�h4�>{w�~���tSo-^�{L`���j�����-��������
Q������l�F��h���fe�%����v�����4��3*����V���5e�2�
/��/�v�ot1g���e_�Nn!BA���!��tX|�G��I��~�K�����8����
���eF���)��������zz�c�|AFa�9#����>�u��q��d3���d�~�.���#�n��6��������A��#�a:�/6�LK++B�l�,�`�NO.Z�{h7A	���*z%VL�K	�$��M����(�B�N"{���%��^�3"�|�M.
����SM����4�:9��l|���:!�����L��R�0;dXq10p�^�"6�����*OV<�]�2z�>�bE�\=������^��u_=�����5�����^�RP Z>�x�]�Er4�����_U�U2�������G�:�kv��a����:�@*��M��s:�X�Z�Ll���flz���Tf+Z�2i�6�<��)��VV.����������F��G�����7o����upx~�#�	?���t�+tT��NK�l`.���g�����&��}@W4
&��b�	xr��_�8�6#*a������������������������(D�zL�����/w�O�
���������9�+0U@Z��r'�{�PU�����������Z�U�U��+�v�J}��:��^P����G�J3?���*(7���1�Y��Q�#�����F����z�Z~�i��}���IBXD#�����lZ)\�wh��'��sRRVC�AE�7��b�<lL9��&{-����Jnf��`��Yc��Mv9gk�=�p����yc�Q��)��SF�����K�S��rbG/�rN��Y��@[�2�cN'�)�,�P�`(m��fg`�l��)~���3[��cI��Z�������h�e��Y����N����`�t�'�=T}���	������������q���Y�1G�Mit��D?G{1w}���+�p��!��dc��������Pz+�i�o���6�_=�8����"��0Ii��dX`����5��h�a-�b�SJF�g���K��O�d	�I����Yz�V�1t�v��	�����C�D�a�d�6�x�6����h�'�Zq[|�{(2B	Yi�4�#{Eh�	�4$BZ>4��uC]4���e{2��)�C��q�i�H��$�����M�Q_:��70)�0��%�����l��+�d�a��0�z�~������_r�T�>�kV�������S��W��:�������7���gk$j��������c����q�~N���yz�?Jp .i�@���O�8���w�'�nnn�FO���v;�"����U���K���DC�Q�����T}'�Y�$����n]��JJ��%�`��$B�J��t�F����4�4����+�&����4;��R��'JRB8�l��&��.sj&O1V�iu�fr������5S�D�d�lR���B*��yRhC�\F���H�t��A����s��)��'o&���(�i�FrWHs
����<A
]��+�vn���+]�M���1&�g4kt@^�4�^<��"����6���#������0ik]�7����Y�j��?�7= �7Y�����!<u�i%>)V�r$.�@p�����k?�h�P��!
�)40���H����g��=kJk��{C�q,��������`�
�9��a!��w�{���.^�t4�r�d�x�Ep���}�!B��$���ZC��S6��2���y{�ea��=M�c��l��~G����t���$JN���a6�ex����Si3Q`���!���i���p����s���z��`j����@�<k�t�*H�5�U�$��-4'g2��5��S���[d�	y4�t��]�8=8��H����`_�4son;��XB����aJ\VF�}QP6�]/�q��Y��^m�m��������y�1m�"m����P�/2�`���x
i0V�'*��^2�3�����,���8I�S�:���p;�'4������gX��E��i��e*	B���e��]���o�~�f� UAR{s� l���n'#~����w*�����h�h�&����$�{������>��oF�]�.;C�z������%�	������x���7��fCJ}�����K�F����������Z����]G��@7an���C��JTB7���"��,H��������;������oU^��d��]{��M��Mj�nX-�X��hJ$J����\�&_�q6'M^:GH�|JlDN�����n�����l>��2�����	h���=i�A��N��^������������}a�Cxu��&X�f�(�8�J�.�|">��H:���C��"V��w�cA��t"Z�.�M}$� �/?s��U�,��.0��4���H���{���}9��6t�������l��G'SB��
E����|�m�g��C��3.�.�`O�#�K��}55�����&T;���N��S�}�r����==I�e����d\���".�\i�#�����{i*���M� �L�~?bT~��~���(�qU���W��%�}6����3��(���=��z�a�-�hY:E��i��
����z���q_�	0/z��R������"k�g}R����1�`�2�.�p��I����Wq��n�;3�H�����XL�"r�o$tM
�f=�(	����g=���/����~�BfR_�HP[��6���)G�����_��xMr���� �k��!��3C�3�:N�F����s3q�����XN�H~�Cq�L7f�	�
I�!g}�m���j�����o��79�02$Y
�|w��:�zg�kG��$K�)��+�}��l��KV]��yK��us������d+�%F��?����;�(�
�BP�&�ib?5`!��t��Yb�����3�fv,�:E#1z	�j-��#n
6?��3����L\;�v>�B��N�
����z�el���,�����*�(n�&������SH@ABDP�SR����v��h
�r;��!.��6�U��A,���p)>�t�h�
�w��:m�
P�=�m��i~�v���
zwgt��<��b
�� �)�:m�&&Tj�v��5w'K���pe��:�AZE�9��n��oW�`�8��*��x��0�����=����O���Z���W7��v�[<.�F��������|��!�����|�&j����$���1rx���"�������T�f�����2!?W��� �l[:�Y�����X'�&_�fOD�$4��rR���MJS� ^��N)+����4��OB�����s�g��ult��[~c�w�%(�X$(�O:{�S��WJ��e=���H���v|JKw2�>���3�AV:�*x����CC(��1�#�{��N�
���NS0��Q`�T?����X�t�yQ�3�a�t�Nt�2����.����P�m�=i��Z|�K9��o�|D�������pe}�<t�{��_9����<������`x���<��H,h����
mm��h��Uq|�#�	�9a�_Q����rT��H��X��9f-��-�=�4�!�I@v�>�V/�������U� RV��o�)2~�O
0�[��CR!"�'�~N��D�b������!2�F�i�a{<AUy�N�8q��%�@8���� AZ�����<������-��w{�
vm���)I����k&��N�n�|qy���� e`���C���.	�Y�l��}���]N'(����E�
�Z�i�4O�x�W%����6���mH��%:^��E��LsE��`&$�=�4R���(P�=�:��$�9H�5��0aVb_�U�Z�^���
��!�]�	ua��!�z�YO�E�FK,�����(�&s�`�fa�3���aN1A��CL���~� ��qv�.r�j���p`�������!�CE#i�Q��1�x�s�CT��s�^`�?"J��!����PV,P������Bt��3����<"z��Ch$>w)|.,J_G�}��;�s�>R�,�����04vn�����y�mV^��i�1f�rJ��w�g�����(`C#F	�O��^�h"��	�$,�d�R�`������V{�n��K����x�W������Z{�/��NNO����	��Y���^}qp���q+5����.
���
R�����$rsN\����/U���l�j�=eH�*,���$w�-���=�o��.���~���_��a *�]�?`)h�\%?����Vas�����<�W�H�gD{e8�� ��v��cd�������g�N�9�f�&R��S����[q�]�x'��w5� ��������gN7�P0�.?7�t\p������Z������F��������f�8n\4�[��{qk��
��l���-F�B�c�" ����}����v!q���M�C��y�iF�!6�q�P;�����)���w����u��F�Pwj�Y����[XCQ��t��&��Y����}0�O{c��d#K���x�������"Yb�!Ar�+ g��H5���#Y���tN3`>�y��A�`����v���H9����B����O +p�������xS��eN���&���9#b%�?���0���H��M���������w������x?��%n�^ep
K��KD�P7"r���j}��l��i�}�=~�{ip�VT�d ��G����L	�w��f�����n�`����h�s���� (fK�C��l��fq�`7���������!�Hh9�u��Tjp�����6���q�b4*����h�����[M;(Q�r�J	wCY��=�x��RT?�Ns��{X]c�5%��R��=5h�8��'�����Y�8����X
��j�_�<'��*)�t�D�]�M�/,�
8g��{SY
�-~�	JY��s����{;�i
��o��HO��^�Ib'(�g[J����.��&2
P��CPz�l�y���u�"1�/��H��'��Nr0�q������k#~o8��N�CvM8O���,A���G�{��2�m��\�;��>ey8�d���s�3Dh���>E��5��������*9<h����ZJ ��V�<S���$������3*NgC���#�^���~{���`?@
��A8������l�*7@�_#h7u�Z������u�����}�,�pPuhS��&�A��fQ[fh�\zh�����]�����n
8��J��%�3Q�~)�6�
�2\
���^
+����]�� t���V����p���s����O������M�6�#��Y&9m�Z�0Wp'3GA������9�5<iU�N�n�&9��>�Pd���cg .|`#��2#��]����.Kw�!�W�S~e��R�ZrT_=��/-��)����=��xv�2a4�5T��^�������Z����f1;���H�#N#����?����`�f�e��*�H"t��1fA(���=&���d��-��}�Z|�c�F���L�������Da(|�����'J������o%����G.�=s���F�����%�-���*"�����P]AC���wL��&��������'��E��E|fw�[|�q�1��rA��.�����qS�{���k$By����Gf��`�3'�%��T�t������������.��LD�8am��Q���j��n��,���K>4�����.���	k
5�<�;��K����c���gR����~(5g��M��?�������>���r!����%|>���9��k6��!\`r[+o� ����ur
Ah�&�cyCnySa\p�3�����F�+�lW��3����"W|l�+.h�����;��k�N9��kT�&2]-d�Z����9�h5~h�L�1������3��5d\����5��m��?��h?'��		�]�.�k�����d�p�9�-{-f	��_t���9,#��A����H�m0b����������'�f�����1�G�cr��O#�xJ���'�v	��� x��bl��eg�%�����P�7�������5�h��TS(�b�4T��9ut��1���	�	��G�G�89m��FQ�#�.�eEOG�������9)��F���sgm"���F�cfe4��f�>���#ax
t'������2�����3�r";��^��+"�������"��7
��@oQw��D	���C�Xz}���]�'�!�,$"z�e���<g��{��{����Kw{�<��������J����*J}H��/+����x
�u�5���b��d/�E�i��~g~�l�u�S���v��G/N�y�4�Ja3���l��=��������7�W�(oVur������Q)YD&���0��sz���
&a�i�����%���y���#V���R���;��D��P)�����N�QH����S0q�j����A���E|x@
��U����tK��jPk��4\�D0�k��9ty��"q���>�n;@�h����g���&�y�]+Q�DP��A�I���t�V�!������I�!��*�u��%��5�V'@!��"���]�]E��tP�@R���Y�`�\�XB��
�t'�`6��P,U�K��$P/���+��T���������3��m-��d���9���2��t��%�9`���F���@�^u
�����q��{�������$�������)���s?i�}7T>�a��x�����!w�sv�?f!�D%
o��=���JK�|]_u�����r(9,���t���
f�=j�����<���9hi����1^������<Io�zov�k���
���T����Z+I�OW�����u������BC:�B����������3C�����()q���R^x���sh��W7�����"�����v�v�r=�IoD���
U��8��F�[�����S>����Ly���[v/gbxu�8���b�mnl������s[[i3����+L;���gr�Y���<����\�k��&����cM=�F�E�"�u�sw~�v7������J�L�Y��!O8���G]@��[����n���Zr����2�^��3����Z-�V�������'�,���������`f=
�E�����I������������N��R�2Ih�	f����)W�N5�W(}�z�/�Eo�t��Y�r��x��d�2i�^'�8p7�fTl�v|�m����N��r�7�[������4�c�
���5��y2^��u5�m��(�kB�0D�nv:$nSh$��L8&��:K���&�@_���!��16��e�h%���"E`�4�@">�B
��Ah]�n�n��`96�F^���fG���P�GN;��#���p��7C�C�_�As��A��de�n�$��Ty�-�D�&s���S �;#��)v2D�J�]������4
|5`: ��b��w"�NR?�
N��&E/#��MlT*9��^�3����.N+^�6�5�&*���:�`)�Z�=9�:�~����j8��KS/>�/����&�[
)z�Fl7��f���Z���I���2�����'B2-GN�Ko��=@������P���m�}ov��.t#�iKMb���.�Vo����I5��$i8�pj��9-�5J����b2�04Gt������\����>{����#���C�
_�p��Z+�k����� �	p���g#pd��1#������ ��g��`dz�$�����=_`g_�?���U�7���C'P�h�,,�VU%�8n�b��5~pS�;�`�CJ,�O���]nB����r;����U+V�0G�w��f�E����&����F���*}I� �l���+����|�����Y�+J|��TPE]Xs��\7}�[~&���,�t28A/|��r<{~��<i��m���U��(��S�m�w��h1������0	�3n�e��fpm�Y�`�DU������'�����9nkI�I���7Bk�}e���_�Iq��JtpZ���di��dh��"������3��c����jY~��qv����u�An%t�-j�E���#"�w]���8�r��[
�YN�k�?#��+��Eh��	���t!��8��4���]�>u=�������{.���4���k.�]��b#J�����O,�G��#$p-��z�_���0�&��������nm������b���T���=���:/��6����;����_t���/;I�?��zt��\=���a���S�,P�X�w�����������
�����T����L�x;j��h���hi���=��x����~���v?�z/k�p��jI�mh�;:Lgu�
f�b��K�21On���6?�%y�A���(�wd�
s���Em�����82��P�������l��~9%�%����5	E��,��o{_i���?� �M�c��
�TXi4	8do�0�Zwu�(�6��AMZ�p=�N+%�J��S�eG1��o��$�3g��}�{K����S+�l�+����j�y�o�
���g����^��"ULjI����Q�������+�)��d���~����G��=j���QW��59��y�f���G�N�b'�9���~�k}����r����T_�y��nB�yoJ�4�{�>��������mx9����:�'�����$@��\8�-��f����x �������
���@�[��*�Sn}��0X��*�v�"b�����JW�HW��01'�|\7�;{~�.�r��}��f�h�0�m�N�?+�Dl���d;�.om�$��
�|�Hrk�
�B���a������X���M&t����B�+������QS����J�'��|N���:�+1dcR:O��GA�	�B�����\��V#c���(�B���G�h)�<5�R6���\|@B�{+��I��r��]�^#n�Z)q�f6��.���ZD��(���l�x9�C9���d�E��6
���]�/���{���DN[������l�q���<����X(9�p^�����E���Ib�v}}�u�����p��C�O?�&�za�T����K�:����4�x�������������3���
���K�hm�"O�7m��x���N��8�/�iW�n{��2-��}��������s�����7�����<��cj�O^	Ac�+]w}��jt
�+����H�\$R�i:/Y�%�N���B���`L��$L�<)	1(a@g�>/�Z�x�>,#��8Y_���'�A����%>Lf@���J\��Y,
%J�U��z������06����]�7y�lo*���C������$Bl���V7�e�s��6F7=�����XF�#�Xo�"�$������������c����������dO�i������g�����Z�44������)�k���C��R�|�[�Z�^_����h������?�y��������qN����o4�@l��/a��l���2Z����
�9p��E�����_z��)�S.�EN���=��\��1����w������t����>fU+��\�=�LB����,"4�x�Cy?�|9����>	��O;�������>�1�&2��L�PQ����T��^(N.l	A?/��w�MGS�]�[��1j��_������q�
�B��S��r����''��������p�+���^z����F��	��t8�3|���Gg9�c�^��NJ<�:6��*?��Q��^i{�Shf�f���u�k��a�P���&��@������bb�����I�Pb5d��v`r�������7���m3�Jp@)|��wt��Q89`${2����N�T0p�J�@���3��?��x=�\�3��VE�����B���"�Srz�p����/����hf�
�Y\a���~��I�N�}���������a��e]��`&
�K���]O��w���%���D��������d��"��3w���ea�S�������#�d7��K�/6����T�8�bx7����������Y:�'I�@V�)R�s���c�C��{�q2�}w��M����`A�'�[��� �ivKJeWP��x�P�^�?��
�����<�e��I�6�U+P���hP�X\!��7}/G�����Z�j�'��������)3��fj<
�K�s�u(]&�'����W=�1At
V*��
�<��&"n��zo�>����]�7W�0u-�I���H�v�t%���/w|�nW�L�*��1�q�����F�9S�f�T�k�?MB�zL��-2���	S�C��\P�I�y^����x��*4s6?�G�"oz8,LKLz<�Z�,���m-hK��<wq���[�g^�v��S7j!�l���%b�G3����1����������3��YV3�;�g���j�HPA��������74����bd"g��
��Ot�p������:����A���w25c�e����J60N)����l�� �]{:dl�O��5$m����4z���~_�
�F��^_���}����eI���+����*PC�@zH^>J���(��v�3>����������~�N�=�m|8���.��:�wx���F��q{R���hE�����:I�#8���������{6�����H��jP��8Q^�I�|�	����5�����t5�x"*p��2C����f���zZ
�������8d���Wla�����}��r�p��K�u:�����s���ER�L�����S#:c�m�G��������>(�����p��!�X�zmw��M�y��H��
�"� <j4d��3y�?�,+���	uR:JZGn�Vr�Q�C���A`��#%Y=\aE�]a���H��x6g�S��T�O��X�;I�6{�W{C9���a��v�������|��x�?�H���#���S�QP���^�M<(��px^dF���?tr�����X���5w_���h�
;|8�����OC
� >�����d�C/�� �d�C�O��d����V����w�2�
�py�!
����=�O_&�����2����?e'����hr[;!�of�O�a��pz H�s�w�����F�p�����I%��5��NO2'��}b�-����F1a��Z�G����^I�S��|�����Nt�b�B/{,57�����D�u@\V�XQrz�s%i�\���������\f���zX����X�r�V=�*���$���x����/Q$h!|ae����j��A��$��Q0,���������	1�E5s�8����B#o�J�FDa'��Zf��'�.����U2YqN�N/�
�������������z���e�7j��?f�-T��^H��(z��J�XQ��b�1�<����W���[�m�f��e���6�{��F����E7H
�����NT��n���V@�$�"�n�;�3���T?��l�~h��P	H���u�9���*�^v�Q�I�3Lj�b��(�e�Ud�mK|��CE)�H�',�$&1��K�?�G���
B��*���x�9����LJ����2�8
.��r�7����C�#����47m9L����L^$u3�5��V.W�����j
�OYqt-(2�zh�v$-�}BB�$,;2�Q���ic:�f��T�T1}w�
��.���n���s~�#B4�.b�5��`���U$L^�y�L��0��Za}��hb�Ujan.b�(tS������cr�"�}0�a������uk���-[���]PY���I���P��T����ox
�Q])�ts���!$c�c���3+��^`g�+*&��Z�|'1w����I���������|y#"�����/G:L?����6�9��"���;opB\�L/�JL�u�:2��<�p�1��T�a��|�6lM�=\����������9�c�d�f��E�	"4��<1����!
�tz2c��b��B��(lCb�i���z�L���j��Hz�0����9'�����F��;;y���-DU|���W.�y���W�/
i������e���]���O����]4?:*��b��:K)�����L��h:"-`��$������p�g�q�169�Gs��LM2u��SI�=�;3y S,0���e6G��9���E�S_e�s\4�:%tb��&��F�qCuUo|y���]��*����A��$�4@\\_�^�?E�A0P3�-T�|���R)��9aB+J;<(H���_���zN��bFi�N�
,T��`���]Nh���U"u�(O&�d�p�(��q2hK����3
5�[?d!�)��N���u���6���J[~Nq�iq�'lE��p��is<DMm=�����R��(aj��
�n�$nG�]����
���s8��2 FDD��	�fwD�!��H���ET+����yw4c���	�o�x���k0g�>tS��0CA����;T�Pk�~�HU+��Z��`�r<���O���)�C�B��@�_���5��CL`�a�Go����e��-I����i���W
`A��J���Rm�	q�p�W�����>�h�����.w	�����Dw{�|�:����b>�M��?�0t�s.�n��akS��5\}������d���o�O7���8b�kLoH����=�=C�������n�r3�L�5�7�g����,���-��F�����2����?���z����*���F��f����D
Uh�{a��%X�pJ�P�;�e�B�+5F���j�'#PL����9�i.q�CF�q���y���]��>����oK�=CX�2���{sK�mOrBH4��g��*��t85���������O��YI1)�)����x~��D�8���=������H ��`m���'�qD�l�j�hr~o�����=1��Xy��	�j�w��Pp�a��	��!Y��������C]!x��%/���{�"��s����NO.&����FdHBjN�l���NGR�c��'s���S���_��}�����`��{mt/p����:w<�����N*�u�����*$!�a��4�oV� �C�.R2����#��l93g�=��3w����]������H��o��W���W��P�T?g7�*�bU��:����#mT���x����mEg|P��h
���J��p��mP��K����/�n����l[�j��Z���vw����$�qR��a�m�����\UB��\O9�R2��n����%b���tM�[�g�j�e�������/|Fe�2�M�}-��@���teGR�T�� U�W]�+�VB)�����RT����3>D��P
-�f��=�f4#s���{n2��^(��K[�s�-�	���#G
�9�Ya��8h���������!Jm7�����E�D�c��u��%���L�>���*�a�$>pZ�z�h��#�_��v�$���.��:�P���/��W���LBH�T�L�\�z����Uq����\f������
��:s���?Ss�
h?���W�[�L��+zA�
d�<w6����x
�K���l��~��OM��)��b��'��Ig
�	�_cC�E��knt��\���������[��n�T�����Ng�J��Q������������}�@�����_�M�~��W����qETR������w��R�<'�������E3	D�����/0.���GL��J�1Q�������
�=���9i�U�j�5��n���`	�������:.�PE�6.�m����GD#���(Y���}
k��U��W�h�x��:�$Y`1`$���Y"��6.x#���25~m��A�f�H���/�,3�L��e�+#�fl��?��=��O����F�EiEj9���Jg�[�J���jyc=�\������DzY��9�W\4�c�R�����YXN��<�D��|�;���?_Mh��$�����	i�*�1�qx��
��5&&�p���}"�q� #�-���V��(��lurCMX��PJ�J%K�z���K6a=1���K_!��f��-���o�]�Mtqwp+d8���U2�m���|��5��Ix�4��;��q��o2�����.���F)���6��T�I��/��Fg�O�'A�	z�?=)�b&��g���6lc�P���z����u�=
zF�o�6������+GBG����������:J���#b�T�Gs�9��:;m5�����.�a��.%Z����C�����]@�p���_Y�f�� \�m��b<)������������1��P�`z3��V
3��P��`��s�PIL�5�����������w�NVrV#A���^S��#�^��e�fRV���i'�;8�d5�]��c���;���c�a�Y�/P2f���oK�z�I�J��g�@���8��s{�y�7���P��gc)������������XT#w$���B)�+�V���Q�l
8���J��r�S|���0l~�sh#�k���8g�l:����)	��������H�r��(�t�b�*�K
1U���xcb{WE��f8�@s�����Q<��[�Fm�1�z���%g�����kD����3��������]�:1�5�Zx�����N�Z�m�J�-�%�A;�g�5gUcEpCV7����-�*&��hx��n!C�O��&���?��v��~�~����������������H��w�[`��F�X)(�����>F"H���Hj�p�P�������hvk5�+�+�fvp����A���eH��b��:{��:;m^�5/�<�f�����m��Jx���p4g3tw�ct�,knKqA�~�of��_��EO�3v�����R��������w�d<�G4�<��A`
��F���'�8@��o<AeNwl&����'�����k�X����6&.����_�&=����E���	� ��YYLz�����=�#�����������L&�����[A~��+����h�xN����~	����J'Q���:^n^]V�N�*(��v�rc�r��r)����\r�R��g��`�F����X�9�PF�xW�z�������:P�rX�t�,�3���*o_����}�U����������6���������fqq����n��:����7��sG�KeUT�R�o��Z�����A��]JE')���!_��m{B^2�8
X�)�H�����^�?������q��]������R�{U�^�76*IWgB�6HxK*�:�/�X�wP�����E���A��E��a��t��t{8�x������ltkpiJ�����I*j�`�!�YuI��mxM�l1�
&C����^)on*ud�7���
7���p;��R��}������ySP���3��z��)PA(����zS�"�c�@y��X�����dL�J|������,�����N����A����#��OZ|q�mr���cL �XY�tV[���1��1t8�����VV��	q%g+�y'�d��S��K����������F[=s�C�����^A}����x�[��g?.���n�L�ng >-����3����s?�5k�H��������QCE�y"8Ge�eMe�_}�,|�����en_z�l����A���n$\�����>��t�w���Yw��
���������<r�n^Ya��d<hv�y����R��VZ���H�oHZ
H��_
��IJ�G��Q�KN%E���c�7�~����77��Io��PI 
�_b-)���6�)G���Z@��{s��4;�/Y�����d}���z�	�P�_C�b�	����{��g��>e?L�����m�+�f��Y�
}���?����FE���L�g����<������id���� U/��~�5@��k��l��� ���A�3���S.'F�juk�%��u`��{�*��#�����-n��V6�T�C���-������'�Sl��h���.v5�YL~f��o�#��2�.L�����L�%1zS7F��bCcI���,��]b��C
�\�D�6���O5�����r��u�L�`���Y�� Hv�5x��LJn����1�5����uG���M_��`@����3�&�e�t����rv����`L�T�x����	>~6����}��@�.Zq�~�BAA
�_J+�I�9��V��& �#�G�v��~���/v�A�A��A�EK��Z����U��ex���#0�n��G���:�jT��
�dQ��_8{�WX���7]��4e��Lq�L*^OF�q���R�\�*F�(�|�E���ZyS"��/
�3� I;n�"�C�I�	���ve;@���)�maz�,7�
��{�=x�xy�����$l^�|�����0,��A���Z�k��Ik5��������{H4���#P,d�(o��Q��Q�����5�WT��4��W���;g�[�N��k�|�wI�q9	�������'���b�;M��h7=�4�	��A��mn���=?}R�.zp	_���_�frn9�Sn��������v��}���]�.on%�)���k-����F��U�t�����B��w{?4ZG����;�Y7P����]�n^=Z^
�o6Zb����x���mj�g�����Z+X���E�BY#X����=�6��5�Ea?��lQ��p�T��c�t����������X<*d��k��BVi�|�8�q�G/�6���s=y���*���YiP|~��y������e\fY��-��ebud��e�3�NQ����~�DT��Q���)
n}q��$2��u\/��w�.^�n]��y�qquK��]�t��9���a�<j�������eu����
�H������i-2���k`��^o�	��w��lHkG�����}����%/���
Z�PC1���K]�f�nK;b�/kZ�>,�����f����Z6}�����]�5LS/���j��C���7���O�*.��K>5
�mL0J������B��������/�R	��2BT�1?�G2������)��I^���p-6�/N��a�.1`H��PT�[{��1a���%9��&~�1<�=A��R��H{��mO4���AO�Y��(}���]�$z��@Q��S$��.T?���N����,���z-�XKOB��z_���O?g������3)�<'���R�ZD��K��C1���'}/�){�vm�==:�E�=i|���$ �9p.Q��*������-��Zu�f�c��������'(�o�.����W���������-k)��P�~��q�a*�HrO~��\��{m�!�(
ln_:�������o���k5�1��$���S��C[(�'-!�wa��rbD��!i���2"c�S�e�����;$�_�1��%F�4nO���]����O��5�nnN�:�{L���9�G�����������q���y�d����`{��r���4�^���3�X`��tK�5��L�E0�&������T������F�)�G���.�{C	L�1����Y�ytz�d���mZ����.;�e�����n���r���������zy���&S�<1K�&��-~�����~�Z�]�G�6�<|�Zcw�cZ/[�D;��3&(��h|<l^8{e�����J�����S�{z8<&�e����p�������P�����{������2��X\BML�B*�������"������dE!��8��5�S�{���ce����o�3A�������r/y�"4� ]����Z!�(�L�1�{}s�E����+>�%�p�R}+p�}1yL�+����p~�d
1�~j6�������c�����0�j�[	�u����^�C+p��(���h��)o��T����sW0-~Qs����]�#�\��mS8�����~�Hy�[|�m���r�P�)Cp�Ko���g��l7lW��������5�;K�k�����Y)[Bt$)��b$B~C>�����kB
�����|�3�U4w�^��J$��������`���h���TJ���v����$�5:�f���/8�P�+!�F��I�����rG��=�l4�����E�{�gB�(��1i�G�3R�+�Q-��	z��@f�<���r)�~�����48Eb�-]k$��V�c�yD��D�|��u~pzr��q7�B�;����%�����zT�d���n�'�5>x��#7��To���,%k�������{�:��a�s�
�z�&��]Y@&:#`������e���\�;�0��~��I�n	�9 !�����j��������^��������~���lz���,���&]H=s�p��>�"���	Z���l�3��/L�(JO��8[�Uw�N�y��y����a�Y@e��ck���p�P��"����
�Z��9�����K�$�z��p5����nG�@D���S�Da��N��@���=�}`ZF�	6Ik�a�6!��
z-�L��[�B�(�]s��������qrq���V��&���-���:m�Hx��Q������v�D;���""*P�QL�����U��x&�L�0�PW�\��g���NZ	�+���oye%�}���(�L��G�r:[���r��0Q����Qv]/c@���b6O\}�#�O�s������W�����&&����vq� o).��J
uZ��s]��6rG��e�<���|~��4�����$�h�|�f�������&m����<�d>���a6��t��p�'�K.���g���i�����<���L��h�O'^O�����<Idt��=�Q�.8'i�t���k�e8Y �.{>~=���O�\D��������q��������u	�=��F��	Ix��(q���-�M}�2����Q��j8*j�l����9���d�-��7��F�������y?��T�&�D�&e���$�7Qc[]/W�e&'����vr� ��/��8��L_����CC����8K�GJ�7�	B����������w���uc�������T�~M���y�O��q~n��sz�H��&��6Q��%�� �t��a�\��l��`
��o��)��?@�u��������|���o�4����^�&f��:���������o�e2]07?s��q��q�^g���Mr`Zxt��M�����O��<��1��c�m|��j�����4�"e:��5s���k���e���.2W�X%F��e7��g��=�`6BFf���������\������?�=���{\�k���[���pnE��uJ�R���*��Kl��s�T�&b.pC*�nvG]o�Ov��
#�����G�+��f�����,49
��/d�2�gS�a49��j��l����t�X�M����.���}@8%������tR�x��R����I;wWM���Q"|��}e�����-g�������a�ku��t�%��Lx����C�~Y^t�2G������t|cY7'n��3�t/Ub���o��ek�����E��9��iS1��v$��:@��hi��t{��s%E��`��R��It������I����&����~2����hP�B�.;��On�{{aH�j�0����d���q��������1c��5��s�[.�]��������4�W��0 ��h*��r�m\[���y������U�8�
/���5�;�k���n��tAkF�#�hE5;�l��Z��s�]YNn�TQ��^�W(t�E�G@��
��h�A�
���p����A��==A\�����G�'�����������J��3j�vu�o/�c2i��)fy�)���s";`���*��
��	�5m�)	�s+��RW�}v�8na8[k���w���>h|�?r��tS��P���.��(�����s#�]�~Y�
NY�"
S<(���yR$p1�a#0	�O��q�F��a-����E�p~�\>:!4���5Si=��c=r����s�,q6���F���'��YC�&�;�W`��vfB�9�=n!����A2mo^m]�J������������x����
�0����Mq#�KE����'���56D�@��"x��T6t#��3b�����(��m/�����"�N��	l�X��e����|�M�������]��F�+������uX]2���w�?=��q�vB`�pQ����(�g�����������H��l�p&��aGX���>h~�T��A�q�2��N?��GK��sgb�������@�c�����<����B�l�Z���c<�g����7�uy<�M����t���=o4[�������-���e�.�z<��;��Xo
�[l�����aB}�w�7R���,�Ng�-�*(��c���]y���,v6@wt��|I����F�l��������������C��Y���`$+��[v-u�P]��xj87�o:Zf��h������T[gHB2+u�G��3&6��h&�s`�v�.Q0	M��%F)"Gy4���i���1�*�]>~�j�=��z��x`��WU�jT�,-��OK�y-]����i)y�gCKx��+sN?=���w}�r����f2�]���J1������`%��75%Y+h�y�g����	n���o�'X����6�7���N�}Yio����,��0�j���*���`�-������]���3F��p��������^�u/�Q��j\��.��$�C�����
��z&}���@���:����)�t�
���z����u����gA��Gc~l�eEQ`?:�vBvp}1�u&#�;���J��="F��y%e���r��)_m�0���nU�l����bl�-� ��q����9��hy�T�i:C�Pt(A�0��p���G^?'x����;���C�����|63�����D���7��X]�^&G�=�����Q�bP��+�MO�^�RP$V%�Fd.(� 0����x�|�s%����Li�
�%;j��@W�?�2�����/�����Ip;������Z��������*]�������F'��&���	eX��sxn})5��1�����Q���+��H��N�Z���M�+��2��
UU��dI��hHr�Uy-Q���J��Q<���A`-����Y�<!3��@j]�:��p�+�4�{36g���{��}&��
�O9�����8�;b�9@�^�a��OFwY�;�~�
Z�M���u��Iy+����iW�A�|Y*U��V��]�0G�Z�g'���2��^B�����x�	i6I�LboJy3`r�x��b�BhF`|��01W�����bp���PBa�����3G"��;T���2y��-��� �����p�9N�>6�[Y�����x�0��_��1l�����>(H�]�+�R�}��x�9���h�;j:g�|!]\���4/��s#a�:���3��|K��_Pj5���:rp�p��*���6�:��K:Uk����p��;'h��pO�e��6�F`�p��y�/������4l8���Q"���tk��S���
8�����������Wh��I�����T/�~�}���br��|�d����0��M�����i���-N �����y����h���1�q�/�������%n�66d���PP�PW���if�=zb&����4Q���Q�����]������;���p����l�;������"�����������SR�����y6V��v/
QYI����_���h�$����f^����}�R]���T���tw�'���}�g���1���&}��8����e;}�Zz>���B-������7�o��-�����
�N�~�Zi_mm��Ris{�����jw�<Ur�������r��-G��6�}!d-\���V7����F�����7h1<�#�u���hn�&��M�Iv��OXr�M�ns��x:�����1�!L�`tv��k����UH:)�����0�|���q����<.pJ�����w�L��s�+>���0�t�I* I��b��t�?��y�B�I;����+��hv��Z3G���<����F.5W�SFK@����~S����W�G�dt�����f�tZM���%��y��t�����
�R��h!lB9=�V'
l&!1OG\�@*�k����/�M�E�I&��������b�'��f<7.��v���J�J5���z�T7��8!J,��$�1����"�Fe!du�*������*�{8{��?=�8?=�V��J�\��Er��|n}�?wx[\S<R8�����������_�k����7ZG����ZUHl�Il�P���k�(�P|#�I��U`G��^���Y]i}�$5�����fS}�EuQ��U�@���l�����7����!.uN�f�C;�jsSN6(���6�4Y�u�����g�5���I��-I�����v�G:y8�v��<�����7Dx��1R.P6S�i��)3	^�������e��okt�-���T���|3W�)���}q3���I��3�q������c���n�����1����9����
64�2 ������.�!�E�����d�k���B=�h�t6��/����&�����7tq��������d)�A����������;'��4�IV+EiaJ)!������N��]r�Q��oW��	�����i%�(R{����i���nq��?��l!�S���	��A��cI%�U[+i�����(XH*�G���e��5YM���EZ��;^��IL����o�L����Gs^j�A�u�U�
�k�R�z����jI���F�;!R��O�U�\e�x����f�n4�yq���=x�@�'��l��X�����u~�w�I��"�� :�X"��x�8f�F7���V:+&AZ����M��yc�����!���-�V���I�}�����6?�#�o�F���I��;n��X��8�;{wz�h&L�����0+E��)2�N����t��
�;�f�H{?���&�Ew�g��0o73E���Q2���}�������N�)����e��
H]�>v[�ob�-�`e$Q��s&�5a��}��*l�D�������p�qE<��f�%�+�67^Pog�&�T��$o��-����;|����O��'��3�MI��	-Z�@�����2�V�����46�%I��k�������%T�_������v�r{�[����[�*��)X�Y{4����@�f���������KL�f)(;���O�����3y�P���#`N��VC`��\���md�(Y�*��a0)��y��w���H�����k(�Z�6I8����W��
�>�&�a�z)�z�� ���n���W���|R��}�V��R������7����E�}����p$�Z�(��MF�
��h%h������M��.�����?�Sa�����k2:*�g�g��c�S�R/�f�����R,����W�<Bq����y�<��,��9������������jq�
�[��l`�%d�%�f���u���s5������R�������:�G?J��QIL.��P�u���� xmwj���������b
IN)��:x8��y�5G��LS�B�kH4���ON�4�X����l��G��d6,NGE�vW������'���������<�g��vp7�����z�����?�������?�������?���s����8���
PMEM.pdfapplication/pdf; name=PMEM.pdfDownload
#44Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#43)
Re: [PoC] Non-volatile WAL buffer

On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I think I've managed to get the 0002 patch [1] rebased to master and
working (with help from Masahiko Sawada). It's not clear to me how it
could have worked as submitted - my theory is that an incomplete patch
was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For a
pgbench on scale 500 (fits into shared buffers), an average of three
5-minute runs looks like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065

NTT refers to the patch from September 10, pre-allocating a large WAL
file on PMEM, and simple-no-buffers is the simpler patch simply removing
the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want to
keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty bad. Not
only compared to the "ntt" patch, but even to master. I'm not entirely
sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've tried
tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much faster than
NVMe storage, but still much slower than DRAM (both in terms of latency
and bandwidth, see [2] for some data). It's not terrible, but the
latency is maybe 2-3x higher - not a huge difference, but may matter for
WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2],
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying, because
that's how we write into WAL buffers - each thread writes it's own data,
so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls / time for
each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

xlog stats cnt 43000000
map cnt 100 time 5448333 unmap cnt 100 time 3730963
memcpy cnt 985964 time 1550442272 len 15150499
memset cnt 0 time 0 len 0
persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100 mmap and
unmap calls, taking ~10ms in total. There were ~14k pmem_persist calls,
taking 10ms in total. And the most time (~1.5s) was used by pmem_memcpy
copying about 15MB of data. That's quite a lot :-(

It might also be interesting if we can see how much time spent on each
logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().

My conclusion from this is that eliminating WAL buffers and writing WAL
directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not the
right approach.

I suppose we should keep WAL buffers, and then just write the data to
mmap-ed WAL segments on PMEM. Which I think is what the NTT patch does,
except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL segments,
mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does that,
and the results look a bit nicer:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065
with-wal-buffers 7477 95454 181702 140167 214715

So, much better than the version without WAL buffers, somewhat better
than master (except for 64/96 clients), but still not as good as NTT.

At this point I was wondering how could the NTT patch be faster when
it's doing roughly the same thing. I'm sire there are some differences,
but it seemed strange. The main difference seems to be that it only maps
one large file, and only once. OTOH the alternative "simple" patch maps
segments one by one, in each backend. Per the debug stats the map/unmap
calls are fairly cheap, but maybe it interferes with the memcpy somehow.

While looking at the two methods: NTT and simple-no-buffer, I realized
that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

So I did an experiment by increasing the size of the WAL segments. I
chose to try with 521MB and 1024MB, and the results with 1GB look like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 6635 88524 171106 163387 245307
ntt 7909 106826 217364 223338 242042
simple-no-buffers 7871 101575 199403 188074 224716
with-wal-buffers 7643 101056 206911 223860 261712

So yeah, there's a clear difference. It changes the values for "master"
a bit, but both the "simple" patches (with and without) WAL buffers are
much faster. The with-wal-buffers is almost equal to the NTT patch,
which was using 96GB file. I presume larger WAL segments would get even
closer, if we supported them.

I'll continue investigating this, but my conclusion so far seem to be
that we can't really replace WAL buffers with PMEM - that seems to
perform much worse.

The question is what to do about the segment size. Can we reduce the
overhead of mmap-ing individual segments, so that this works even for
smaller WAL segments, to make this useful for common instances (not
everyone wants to run with 1GB WAL). Or whether we need to adopt the
design with a large file, mapped just once.

Another question is whether it's even worth the extra complexity. On
16MB segments the difference between master and NTT patch seems to be
non-trivial, but increasing the WAL segment size kinda reduces that. So
maybe just using File I/O on PMEM DAX filesystem seems good enough.
Alternatively, maybe we could switch to libpmemblk, which should
eliminate the filesystem overhead at least.

I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Also, I'm interested in why the through-put of NTT patch saturated at
32 clients, which is earlier than the master's one (96 clients). How
many CPU cores are there on the machine you used?

I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a
huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of avoiding
writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).

I think using PMEM for WAL is cost-effective but it leverages the only
low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to logging
protocol while considering storing data on PMEM.

Regards,

--
Masahiko Sawada
EnterpriseDB: https://www.enterprisedb.com/

#45Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Masahiko Sawada (#44)
Re: [PoC] Non-volatile WAL buffer

On 1/21/21 3:17 AM, Masahiko Sawada wrote:

On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I think I've managed to get the 0002 patch [1] rebased to master and
working (with help from Masahiko Sawada). It's not clear to me how it
could have worked as submitted - my theory is that an incomplete patch
was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For a
pgbench on scale 500 (fits into shared buffers), an average of three
5-minute runs looks like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065

NTT refers to the patch from September 10, pre-allocating a large WAL
file on PMEM, and simple-no-buffers is the simpler patch simply removing
the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want to
keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty bad. Not
only compared to the "ntt" patch, but even to master. I'm not entirely
sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've tried
tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much faster than
NVMe storage, but still much slower than DRAM (both in terms of latency
and bandwidth, see [2] for some data). It's not terrible, but the
latency is maybe 2-3x higher - not a huge difference, but may matter for
WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2],
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying, because
that's how we write into WAL buffers - each thread writes it's own data,
so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls / time for
each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

xlog stats cnt 43000000
map cnt 100 time 5448333 unmap cnt 100 time 3730963
memcpy cnt 985964 time 1550442272 len 15150499
memset cnt 0 time 0 len 0
persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100 mmap and
unmap calls, taking ~10ms in total. There were ~14k pmem_persist calls,
taking 10ms in total. And the most time (~1.5s) was used by pmem_memcpy
copying about 15MB of data. That's quite a lot :-(

It might also be interesting if we can see how much time spent on each
logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().

Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
maybe that could be visible in a regular perf profile. Also, I suppose
most of the time will be used by the pmem calls, shown in the stats.

My conclusion from this is that eliminating WAL buffers and writing WAL
directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not the
right approach.

I suppose we should keep WAL buffers, and then just write the data to
mmap-ed WAL segments on PMEM. Which I think is what the NTT patch does,
except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL segments,
mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does that,
and the results look a bit nicer:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065
with-wal-buffers 7477 95454 181702 140167 214715

So, much better than the version without WAL buffers, somewhat better
than master (except for 64/96 clients), but still not as good as NTT.

At this point I was wondering how could the NTT patch be faster when
it's doing roughly the same thing. I'm sire there are some differences,
but it seemed strange. The main difference seems to be that it only maps
one large file, and only once. OTOH the alternative "simple" patch maps
segments one by one, in each backend. Per the debug stats the map/unmap
calls are fairly cheap, but maybe it interferes with the memcpy somehow.

While looking at the two methods: NTT and simple-no-buffer, I realized
that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

That's a very good question - it's quite possible the WALWriteLock is
not really needed, because the processes are actually "writing" the WAL
directly to PMEM. So it's a bit confusing, because it's only really
concerned about making sure it's flushed.

And yes, multiple processes certainly can write to PMEM at the same
time, in fact it's a requirement to get good throughput I believe. My
understanding is we need ~8 processes, at least that's what I heard from
people with more PMEM experience.

TBH I'm not convinced the code in the "simple-no-buffer" code (coming
from the 0002 patch) is actually correct. Essentially, consider the
backend needs to do a flush, but does not have a segment mapped. So it
maps it and calls pmem_drain() on it.

But does that actually flush anything? Does it properly flush changes
done by other processes that may not have called pmem_drain() yet? I
find this somewhat suspicious and I'd bet all processes that did write
something have to call pmem_drain().

So I did an experiment by increasing the size of the WAL segments. I
chose to try with 521MB and 1024MB, and the results with 1GB look like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 6635 88524 171106 163387 245307
ntt 7909 106826 217364 223338 242042
simple-no-buffers 7871 101575 199403 188074 224716
with-wal-buffers 7643 101056 206911 223860 261712

So yeah, there's a clear difference. It changes the values for "master"
a bit, but both the "simple" patches (with and without) WAL buffers are
much faster. The with-wal-buffers is almost equal to the NTT patch,
which was using 96GB file. I presume larger WAL segments would get even
closer, if we supported them.

I'll continue investigating this, but my conclusion so far seem to be
that we can't really replace WAL buffers with PMEM - that seems to
perform much worse.

The question is what to do about the segment size. Can we reduce the
overhead of mmap-ing individual segments, so that this works even for
smaller WAL segments, to make this useful for common instances (not
everyone wants to run with 1GB WAL). Or whether we need to adopt the
design with a large file, mapped just once.

Another question is whether it's even worth the extra complexity. On
16MB segments the difference between master and NTT patch seems to be
non-trivial, but increasing the WAL segment size kinda reduces that. So
maybe just using File I/O on PMEM DAX filesystem seems good enough.
Alternatively, maybe we could switch to libpmemblk, which should
eliminate the filesystem overhead at least.

I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Well, not sure. I think the question is still open whether it's actually
safe to run on DAX, which does not have atomic writes of 512B sectors,
and I think we rely on that e.g. for pg_config. But maybe for WAL that's
not an issue.

Also, I'm interested in why the through-put of NTT patch saturated at
32 clients, which is earlier than the master's one (96 clients). How
many CPU cores are there on the machine you used?

From what I know, this is somewhat expected for PMEM devices, for a
bunch of reasons:

1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%), so
it takes fewer processes to saturate it.

2) Internally, the PMEM has a 256B buffer for writes, used for combining
etc. With too many processes sending writes, it becomes to look more
random, which is harmful for throughput.

When combined, this means the performance starts dropping at certain
number of threads, and the optimal number of threads is rather low
(something like 5-10). This is very different behavior compared to DRAM.

There's a nice overview and measurements in this paper:

Building blocks for persistent memory / How to get the most out of your
new memory?
Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons
Kemper

https://link.springer.com/article/10.1007/s00778-020-00622-9

I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a
huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of avoiding
writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).

I think using PMEM for WAL is cost-effective but it leverages the only
low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to logging
protocol while considering storing data on PMEM.

True. I think investigating whether it's sensible to use PMEM for this
purpose. It may turn out that replacing the DRAM WAL buffers with writes
directly to PMEM is not economical, and aggregating data in a DRAM
buffer is better :-(

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#46Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Tomas Vondra (#43)
2 attachment(s)
Re: [PoC] Non-volatile WAL buffer

Hi,

Let me share some numbers from a few more tests. I've been experimenting
with two optimization ideas - alignment and non-temporal writes.

The first idea (alignment) is not entirely unique to PMEM - we have a
bunch of places where we align stuff to cacheline, and the same thing
does apply to PMEM. The cache lines are 64B, so I've tweaked the WAL
format to align records accordingly - the header sizes are a multiple of
64B, and the space is reserved in 64B chunks. It's a bit crude, but good
enough for experiments, I think. This means the WAL format would not be
compatible, and there's additional overhead (not sure how much).

The second idea is somewhat specific to PMEM - the pmem_memcpy provided
by libpmem allows specifying flags, determining whether the data should
go to CPU cache or not, whether it should be flushed, etc. So far the
code was using

pmem_memcpy(..., PMEM_F_MEM_NOFLUSH);

following the idea that caching data in CPU cache and then flushing it
in larger chunks is more efficient. I heard some recommendations to use
non-temporal writes (which should not use CPU cache), so I tested that
switching to

pmem_memcpy(..., PMEM_F_NON_TEMPORAL);

The experimental patches doing these things are attached, as usual.

The results are a bit better than for the preceding patches, but only by
a couple percent. That's a bit disappointing. Attached is a PDF with
charts for the three WAL segment sizes as before.

It's possible the patches are introducing some internal bottleneck, so I
plan to focus on profiling and optimizing them next. I'd welcome some
feedback with ideas what might be wrong, of course ;-)

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

PMEM2.pdfapplication/pdf; name=PMEM2.pdfDownload
%PDF-1.5
%����
5 0 obj
<< /Linearized 1 /L 157443 /H [ 701 140 ] /O 9 /E 105327 /N 2 /T 157147 >>
endobj
                                                                                                              
6 0 obj
<< /Type /XRef /Length 62 /Filter /FlateDecode /DecodeParms << /Columns 5 /Predictor 12 >> /W [ 1 3 1 ] /Index [ 5 17 ] /Info 16 0 R /Root 7 0 R /Size 22 /Prev 157148                /ID [<e060eacfc35f82fb33fb3d1bb88aa376><e060eacfc35f82fb33fb3d1bb88aa376>] >>
stream
x�cbd`�g`b``8	"�6��F ��L����!@���	X��W��301�O�	�������9
endstream
endobj
                                                                               
7 0 obj
<< /Pages 19 0 R /Type /Catalog >>
endobj
8 0 obj
<< /Filter /FlateDecode /S 48 /Length 63 >>
stream
x�c```f``�"O�20@��e`i8����`:�B�P���
�����>������d
g
endstream
endobj
9 0 obj
<< /Contents 12 0 R /MediaBox [ 0 0 612 792 ] /Parent 19 0 R /Resources << /ExtGState << /G3 17 0 R >> /Font << /F4 18 0 R >> /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject << /X5 10 0 R /X6 11 0 R >> >> /StructParents 0 /Type /Page >>
endobj
10 0 obj
<< /BitsPerComponent 8 /ColorSpace /DeviceRGB /ColorTransform 0 /Filter /DCTDecode /Height 742 /Subtype /Image /Type /XObject /Width 1200 /Length 46232 >>
stream
����JFIF��C


		
%# , #&')*)-0-(0%()(��C



(((((((((((((((((((((((((((((((((((((((((((((((((((����"����]		!U16AT�����"7Qa��2DEq�����#��BR�34st�����$5CSe�&8bdr%Vu�������B4�3qr�15Q���!ABCa�2c����"RSb����#���?��G�+T�������cE�z2:5���-�]R����$���4���?S�3u9�G=[��Qm���=f�Y��isp�X�N�]m��u����IJ�����h��v��7�;6��(��C�I�JU�!N��$ih��z"���]J�����@Nc
�_z�'�������'��u��F��9@�uv�O��2f�1�A{�]��]�U������|)�}�/��[�	F�TJ�m��Q�2�Hkv9~�r"�6���>,HL��E���b9[~�%��}��&�U��Vk�i8��	�
bu�dg�������������F�n�K]}`eY��bbN[�"Bj������V�P���yg+&c9 Brp����=���iW����&�O�����R�Mo�VaU~�j����ZlJD��"L�L�"+7G5!��D�������f��c	z�j)����F�����D�YS�R���$^�����*��G��#��/��n����_Y����8JQ*�3M�Tl7g*�.�G6�&��S�e��m�;�����|V��J��vHc����a��J:���m�]oe����&Z��,1��/���%�������������^K�j�/1~�4���rRq���l(0���DK��	
4j��5}��>���5"��s���V��A�������q�.��
��W�	���P$a�+x*F����������x@���d��[�=�\����4�x����e����"��Z����]8�'����yN�������������Y3xu�[��S0u;Se��7E��;Y��+��TV��EEOY�a����t��Z4yhq^��j*��u���$��
B51�������y�j*��cr��J�S��i�k�"1�,��[���G�+4�4��NB�c�Nz�w���~�@���iI,�:cKE���&��[9���������{@�)����R�b�l����L�K�5��`�]"��Y�|���,7�����������S�X_'��,��L�d��5\�Tnu���5�h�c����B�d���W��Z5�i����u�9���3�������j�O�eX*oH����G�5U�w��%�EF��y��o�������d+*e!�C��h�>�]��������7H�d8n|G5�j]�r�=j�mq�I��}�g���vo�V����j�j��Q$b,6���}h�Db/����!j���v^��kK����:,F�#������X	X����T)��y��*Ck\��T�������Q���5��b�#'�
b�5c����|�f��^S.�����Zsb>$(0��+����G"}�����������p� ���eVy���Q���2�Z:��n���������$�^�K����R��D�V����%�~���0}
����he��V2�Eko��+��Ad��Uh��Y���������=������U@�P�"��JT�)�1.�"+\���TE�'4)x/�#!Bb]�{���US&�S���VhKI��hSY�����s���r%�����V�%��'f�+Jj64�����U�DU["j����q�H���B��������o�,����@dih���v��G5��T3V��d�=2Qa���'_��nw���� J����4��<TtYg9,�nED^T]Z�P
1-#O��t�6LI7>a"5�l4��\�nT�^S�a��)�x(�[]�s[������w�
{+�9I�r�5�+�L�k[d_�P�c|D\)P�-N���+�����c����S����-Uj��"����t�*�DX��w
�55L�%��O����8����f����b-��(�n�-�/Hu|+��(ot4df#�Q��u/���+������>jJZ3�Gf��������%U5p'�
.�Y�Q��-Rr��s�[��Mk���k�jS
�)T��\�k^��u�&r%��������1T�Y�Z�jK����������������`�]"��Y�|���,7���������
(JD�Y*\�����E�
!�2�5_d��.����]�_��M����9a�D�aE��|I8�QZ����H��	�/����Y��1fb:q������T���'���T�Yu���(�tHs���sp!6Hq��EV������*���e
�'GU�/&�s�"jTG#���Djk�R�U���L�;a�J9�w6Y����J���X��X^��!=��|��K>+�9��_��K l���Xc�,_��'r�KZ��1
/I���9����Lb�+#�#�E��3
�ns��.j"�Z�{��u	��xsR����:V,7��,�eXr��k�)V.I�"]5,eD�������ec�W�*w�.xG	S��&R����0�W�;����D�&�j}b+#]�S�g"��,����'5��u�����N*�����Xh8��
�N��MdFKCk���"��R��s��0��'�����18���������*�H�<����C����b�����������Q�d�DMh����-k�Y&��+��Ra�_�^F��S�����L�������16�[�LDEm�����ivj#Z�j""%��H��t��U1��2^&����z9�Q,���UE]w�KraZV#c7�]V+��
��jz�����l�`�5N��O�=df��LH����F�-�_��,�F��tX��i�_��������6�]����f����nio�*f�Q�(l�	��1���F��K""p" �MS
V�����[�P��dIxj�[����t��]�^#���)��h�3sq���%$�=���-�~NS�Z���L�aaF�����g�QW�eED�^��:�L���i�&,�����,��]�W+���0�v���5�cQ35����\�O��e*�t������Env����Oo��+�G�(�2���iP��s�F������5Yn�*�$����#������%L�$�w+}v���!��Z
k,����.���n��5����U�Omc��+���yp,9j�Tjs��Mm.Z�
X�n���o�5����-%1�����2����#���YSZZ��^P?|c����o�R���$B�O����'�N����l)�N��b���EE��u�<3B�����I>4H-{��Q]u���c�J�[���e**^P)5���&)�Z�Ek������T]KdTT����������
��Y���f�mc��[9]�xW���V�'V�|�F]�����z�h�(��&+
G�I�4�#�U3�T�6�]��'U�U
����
iV�t�)Xl��{Y���Z����'�����'����H�����K}�!C�IZ\�`B��mj�����P3�3�������^����1��������u�QfVne����n����	k�e9Ij��:�:jJ2���!:��g"9���;#<I���'��\�Dc#�o{Z����U���B��hR�v��	(��AG�������H��M�sue�5f��,M�["#s�{�Qm����#��b��5�����5/4�������dV�j�*��
X��	�T�����Fa���������j"�u�8�&���F�o���V%��^��_�=�rG%�vV;bDI��E�dh���DG&S�x*B,����sZ�����F������|���0��lDkZ��"'"@y*��:��IJ�G�]������P�;$�ycg�J�6���+s���h�!D�IZd�`A������U]j���W�����Tn������%��|���dsU}�}���������T�n������t_`�H�U�=��Y���sQu�1���'�?��A�G"��O�O�����K�Q�7F}���|�������t�p�����1ys�;�?���Z����/�������_�+E���+��r��r�U}g���>��|���dF*9�j�QS������+:G���&l�o��EM�:�&_7Y��l�D�b�1�	��F������u��#�p��[(�5��NB��$6+����]Y����������!��H����S������e������_���Iv�C�
�����"�[k����0��`L;R~2�U���K���t�83I��Z]{CE�*�����I�������g���S�(4�T��2��a2��t[���5jNr��zJud���\��^� M���\>j-���\��[
����=��Y��BG}Y��9�(o��L�OP�k7��R�.�U8����j�jj��s�}���.�D�5�<�N�_��,kI��������O1,��r***�u"����4�]��h82�_d��XPe�E�WGjC���Uc��W*Y.��|��K��!M�2nJ4�IhN�l�Ds`FV5�b��Z�����������%�,�2�P��)bn=�	������	<��US�9.2n����2N�!

�_O�:�Ul�����9�G=X�W/%��
�����I���k4���_>��"=���U�� �S�G�P���b@k_��X���-��E]H����`8�
�i�l�p�H�p�^J,8��)Z�����T�u�V"*]<���fL?O���%I1�	�P(�#h���!�Uvm��tu�\�{����LA���1+hZ�dHnG5�^EMJ��~R��d�`�J�dx,Hp��,�5�������/�X����K�r��f�����TD�HB�������������I������6C!Qn���l)���,W+�>N��]j��US�N���Q�����w��jekJ����?��(�����eb�S�!)F��B��b�sr9�F�9�vnn�R}e�����'��T�K���)H���S:��T�sxdU�����K^��d�����NKNK9l�e��#�V�����~JB�����om,�6D��5V�]''N�Yh�S��B�����C���K=���R���Qy-����/3��%��Mk�gA����aC���eU����-�^�N:FzvRBQ�S�0%���th��o�����I�S+0�Q������+�Z��Uj��A���S��T)�r���GD�J��������j�Uj&�]Z�xS&��(YM���V�������A�cd��9U,�����K%�x@�0}Jr�O��S��M>�XM}*+�
�j���U�?��z����T��+M�S�&��|y�D{~������1�\zd8� KV1���Q�T��]��/���j6X�*���J�J���9M���&���Q]����Qs���u_X�+��r����[��)��F7�b�>��1�{Z���S�Mh��p��Tu���d-��q���z�EF-f�!OG��*a�����K�&Y�����CW	�1�eb�)����ZI�K$9���}J�����ix�oFO����Nue��4�n�^���j���S��]�����,�
�BCI�����{���X������TC�����Xdw���x1����T�����j*j�NU��'+2���#U��?Q�����;_��OB���S��b�s��2�D}�,�[�Z�%bL�Z���������1��m�V�G�Q3xL�#87��C����Z�n<�Y�������\�k�����m�~��I d
�#
f$D������s�������t��!��T�N��S�Ttt�9�:*[��E����R�H��*���2���3��U���1�L�p��T��*Z�P�TeV���_=s���^��"�S�u����.���Q���K�LO���.���&dDTz�US�U���
��?'S�l�6n^nY�&,���**�W���4���T�:�H��t�"�Z����[Q��$�n����<�+F}<U��P�3	'!�U�Z�*f��dm�c&�Z{"��s20"T�)�?����c���H�)3\���|�h�4��b�	G�T!��59bDd�19SR*��z�1B�7�3������+���\M'
\��$��m�"��u��Fvh9n�\���;��|ES�"�W��
�"�_	Gk��$9�����S1�d�U���nDx�7��������N#�Z	����H���;9-/a�����,WZ��E]kmvC�d9~�@���
6s9ef�-��j���9�Mit�}�����x��V���W�N���$9i�Ds>�j���W��h�
kI
zD����d,��9R�O�0�
���Oa�l�.b%SC����D���Q��<��^��|=I����qT��Rzf52)�$VK��l�k�/~r{T
z%NB1j1'�YOk7E�tV�$o���k{ny���
EkT�:*"���������ou���%a��L�azR��9i3�%����|F]�m�xm�BS���~��'�R���2�����,ek\��>U�Uu^�o�eK
��3B�;K�/<��P�10�,�:����r�"�������C�F��!La�YY��/0��e?H��s���O:��=�V���'�S%���(lJ�)�L�T$��^��j������WR-��k�:�:�(�UJ������4�f�b��*!�%S����d�ef$��������p�r-�-J�������t�V^����)xR��z�2"*=UUS�R"_��X:��)X��;7a��LD�'
�b���b5�T��n�F�
y���e��R,�#��9���dVk�������R%jp��5Y5���bL1�_~1V����\?3�5�U�i�&j�)s3i9��X�h������1�d����`�g�����B^���O:eb�X������k^�~�����[/9?)/0�n��Qc5�V7�9V����|�k����Td��1s\�X�����U��;�h���*<������I�<%r�L*#]��p�r"�r���
5?e��aY(2	Xd�����R8��
�f&�TU��9
F������r�Z�6J<Ml�34�nw��TU=S�fLN�KK���c"���5�v��*��W��cy�[�Wbzt�b�Y�$�y�$G2��k/�������Q�m�%8�����
F)d��Ez��B��E��_�����UB$�)�9����a�X+�S���^Q�S�=�:�t��.nm/x&��~�5�(Xw�|1+5/�K��Xt�mly�5UY	V��U�{������(o��L�OP�k7��R�.�U8����j�jj��s�}���r�b����U��HR�!)N�7"5��������o��}���I�������o(�|<Rt7+]f%�m�b5����)\J��}�L0��6*]�g�F0�oG(�&Q�g"�9Q�#\��)77.���l�����wn",��Z���i�F�)wf�-����B����1��s��DD�US���\_������4�Z����rK,LAHbM5:vn��
�;�O������(,ns�=���=j����_��1P�l9��&Zj�;'+��3:+���.���[k[���9���k
�cE}"����lW�$}s���jD��9Q�*Ez�ZX�G������,���������-
M���S%�-!GX�P��=R��]|���.�+������#�p���&������&����1.�c��"������N�b�er�O�����(p!L�"Cfs.�f��U�n�r�H�KO������3-.��^�c����H�q�`al)S�LD�k���D���RF��Ul4U�r��Y�d�v�e"V�K���$*�V^�M���n��y��jMW����X�W��C����rIS#�CZ�9 �b$'�]��sU/}J�@.y>����Sj����3Z��+���s��uV�*�.�$)���T�|�6�M��e����2#�o[QUP�q�w��|��tPV��|	�J�e�5�9����9����y/}G��0�&������W�����)R���,Hj�K�[5b"���^[��3u�L�Y�S�I%��X��0��&*��U�QW�T��Q���X�oe�t�)���Y���'~����$�q�9��H'���s�Q\�f��k�5�\���|���t�2^5&��X�������<s������d��8��LZJ��&��j�f�v�(����m��?�*#�0�X�,�DG2Y&��9.����5�4��*fN��I�n�O�����|�H�MI�MI��
G&�ZE#MhP��������H[�~�����{X
�~zR�*����	Yf|���#��WR�&�M��tzEBN~V�V;b��v���Qr��pL�/j�H�����4U�&f\�b����Z��^ER~�E�`��`��-+����JOJK5!���1�f&�V���������� CJ�+t�`�n���ekR����Q5����J"a�*��d�w%�fj'/��TS'�&��IV�U
t��I����<4{�#��b������mjCH�	M�U���+8�4��GF��@���DU��Fr%��	dK{@�r��fp��W��Re�f�Aj�E|5dGYU3\��r�r��H��<6F�%�#`������R2.fw-�WY�FbJ�7
Qfj����/����s�s��MMEU��MI�I�?������4����]��-�M����p�����%�cL���Zl�?��s������O��j�����[vb�?������?���$�5�Vm�s��8��|X}n�3M����<���Ec]��UE���M\�=ch48�5�n�uFJZ^]��Z��Z+-]��U��
���gd��@�p�K,�BN<�X�k�F�������)'�*s�����I�$��ME������L������TD��p!���g+95����p�W���W9a���HQb���h��Hu�r�t�F.j�/}���^��"����f�<hu�C��s[#g!�a����\�J��/�zg+��(��j��e����a�XLr��U���
�1���h�Z�E�J�&�f��#����f:""���|���n��9�iX������)vZV�3%�&t7D��G9��+�/�
~�T����4��Y�{�������]��yf�%R��|�f�}����&���������os�����fa�vI��
��%������r7�9u'�J�
�T'�,���)j��`��+�r���s��DW"������u�����+/<�Xp`CL����kZ��U����r�l%M��Ruj��1D��L����Y���3WSk��2U���8��G�6��O���
���
s��EK*%���&����T�d�~�!-M���`��!����Tr�5]5��;XpY�-W��#"1l��U�����#�Ae����?$?�0��������d(0b>#���V��SFZ6�k������w���;*�O6��}�W��Ox1'���F��C���,��5��_9s�~!hk�W��O�U���n���h�O�[���[��K�kp���
�rE.��;�&%�N�J�����2�U�����r�
_�j ���:�&��K���3�^��qM��@s\��t93��v�]�9I�G����M3�0�'���9�l�;�*��S�f�a�l�D�����(�'2��#�����uC
��5��|DV+���j9l��}VD�S�c���6�����6N+e�/Q�DF#Un�n��p0+\�U�z,�.��
�
^#b%=����f���b"��%�/r�%��W(�R���7#KE������g�_�o��9��G���IF��#=.����T_�������n��������Jc$���sC�����U!�Ux�Q�hQ%��"f+"�����xZ��EBP�j���c}��m[b6����������b\�MT���)�bY�=M����:�^��U��G-�5�O5Q,��sGZ����KJA���5I��Z�.�I���HK��[2R�������")�(���2�?���8]���O�/�$7FK�[mJ�mZ��/��e����2
~bZ�@tWH���#���W5�����EDT��{U	5���\O-�q^%��UI8O�&��d�)tzY��e�\������;��e?�*�����&�O���lxP��{��nn�kj�/�����V~jE��O��2Q[��a�G��'u{UV%��&���	�z��s\�eEN������5�Q��P�S~_�M}1��~L�(��1�s��DD�5�	����	������w5������9��u�����RF�V����z�%U�t���e!�E����9�"���rp'L�K$�(���+\�t�"d4�����������������B���(������H>Gq�?��1�����^�_��9��7����>��I�Lt�]>�7
G��~�w*�������H����W{�4Zd��]�)gE������tW'
7����l�5B���r�E�M����VG��<�K����Dj""jD@*�G��xozt���
>��n�!��[98m�r/d��lG�����
�����	��
$Hn�:���z���d����B��j��
n~�jfa��9a�jkUu��xy$����D���HJ.w��n,s��]VG�r3^��
*��sX@�<=���)�=+K���{�����U�m��>�'��l��d��7�%�kQ����z��3;����#��=W�4f�OeB��4����|�������F��x`�8���&��g%������E�E��*����{��f��I�9W�
�1���\79����>����~�d�m��7����t��|e��y*�nsM���}����~�d�m��7���5��N�$��`�Q5���Y�l�d��O��9Q�����e�[+/���I��J�Ugj�-�"D"��<�����]d�k	��Nb�]7r��l���gZ�F�f]=\�K@�C���->��n�$�8�nn~cQ�������P�L�(����U��Z��]/2�v�KDj5��r�.����.�
*��j�v[���4,IKF���!�19"Br�u�k�p�-��rx�l�Z�Z����
FYd�?a�(K��p��j_Z��95��2����%10^�n������������������'������3��fl�-3p��4o���������]~P0��W�t�)��z\��;-d{
�gYn��mm^UB���zv$������E[d%�/,@�.�����H����.Qi�Q�'���>z\�S���
��F��D^TC�_%�*l���^1�������E���c^���������!���E=r�'��5'=r37h�����s��Z�����i
�Q0t�5�A��7FbZa���P-�3/��r����_�_R�%�S��/dI8�&�P 9�EW:����n������Ha5��G�u��0��Y �31o������}��+�D�6
�����f�������a������t�!-;�XR��7B�k��0�NY��l�(�r9W��{�P�G��;�����Z2g/��I�	��i�u�5�Mj�����p�����_��#����Y�X��ME���B0���z���O������:u�������zfU�|!��wp��t��^$
'r���W-�n���+�Jd�d����T�?Er����3r�
&r�9u�����d�4�oI���\AN�c<U7���_�I@t�%�����ks���X0��G���n��s�7-�3q�j���\����Z|+���bz���o��8�������[k�.u��d&���5�F�4�6��Ih����{U������sO������GV��$[��8��l�IF��TU���W���xW%�j��'���B�B����Ot� ���v�X����tK��y,$��E��)9��R��]�����v}��V�E	�N8��M���%�G��w�g��\���%������'��������N��I���3q���|����}G���
�[Q�&��Up�yX_IHLG*+Z�������F�Rj6)�Ii�H����feb�6$�G���V��C�(r�a��
���j1�cQ�kQ,����+��V��L�~=�If�%9(�vd;[ssS�nO�����r{C7�|A5��p�,x�`�)v�����H�����R��d�	&�OH����S��3�-�7tT\�]okp����+%Z,(��&:��a��s��
Z���[�Y|������2�9�-�����eY&��T�=�����sU!/��|�������3P����3H��i4����{����rkj�����^BL �����^~�RtFKM�_
l�+\�k��TE(3�'���Q����R;��^e�z'�����5l�UK�!�(�
���
U��,J5i�����X���r��^�����0�d��%����S�*��Y���O��V�����DEw-�mf�|�����+�oEk���*z���u��J%%�D��*s�#'-[������U�F�.�EK���S{��z6Q��M���%��V��vvv~v�������;D�G|jM�#�g�ZU����Z�rP
�<����#�64�si��tY����������J��V)r��XoT�xr#��!�2+�5����U������dB����3��4tY�k�H����4������*��j���h�y-�Mbo���B�4���I6B���-��1��r��_�W55�\i/Z���9l/1/+Z�a�G�r�!9ug*�9n�uMK�������i�J��N[H@ld���$y�Vm�l��j�������g���d2QM�2�Q���|*r2���bEl�Q!������su*[^�&�5'��x�#0����U-�IJ������s���sU?��
����N�
*��q�i������Ld)��l�;5��Z���*��*/
�{��8�r���%J+�H�I�'*���j�~����:�{����YE������lm�c�`Tis����*T`��c�+\������{*LW)����J��G�~��">FgBl��tG���TUDb�K���nh0�?:����[�%�&fp�
�>^f*�Ym�U���.��TU�[�u_�T�5���+�Tjq1�R��T�7�*�]��
����L�V��SU��F��'��.��j���	�����|7������K�X�Hal;?[�EXR2p�$ED��"5�UU��a�z�@���<S�lY�%de��R��� ����b�]�t_�������V��V$_%W����z������E�]�EE�� <�`�f������(�]���a���6������<P�c��u������E�G�o�dU+U��3��;���%��J�(�|&C��k!�����}H��_i������YJ�M��Lc
�o�����,��3Q8���G�Qe�^�N�����,)H
��_]���by��j�K�%i5��V����c����������hS�I�����6l�uH��qd9|��,�?6��������t�W
��0�j�jl��$�8�R����R�P�F���L�Z��`0l�cj;�R�W�e��lG3/	gg�#b9�5��:���[{YS���/����f>��)�����vd�]�
a��R-�����
b���T���T�-2Ff-�H����=��������I��etZ�>N~Z�����e�vr*�C�*P��V���a��v��F���z�;���r��{�R���ec
a���@�T`I�(r�%���+��l5���kMI}v��e:BN�(�Zt����?�4���MD�)�5B����3E�F�1Q[7R�"������>Pf��H0��F�N,7�D�����Ds�9�EO]�
#��.�"�S��}#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#��������O��wx� ����~��F����]�$�;yi�������o�w�"��Z7���o-?����H�#�������=r��e!�9vf1W9R�����)8�(�v��0��72���1��]�[���r��zJ���(_�-,�R����]���^��=����J�����(3������n����m�U3SQ�5�\��O�����ct&���Xs�T��S���s�T��S����?��v9�j�a��*�� X3%�~!�Ia��^�q�����_k�5�9Ux9UL�&Ykr��u��N���I���*���":&r&�^{@��E��F��I���%�T����G#V;3Q�\���j��}~�UO�e5k�a�K����"�Jd�h�����et8��:������4eU<�T��*��/�1,�"*��N��d6+~S!#�V+�����EY�YO���������K!���7f�Gf�Y��y��m�{k��G+U)Z�)��Tp�.��$�fY���Cb���	W���UI�u�(�,A)�0�
gbI�+5�@��-���";SQWR��p������B�1����2b�^��t�|�����.�����T�����J.Z'�$��t<
U�`�W��*����=�[>$8*��jYu�8���6@xhUY:�J�L����!64���r]5r/���9�*||��]��9�*||��]�	�������u0��P\����T��uY�U9Iy�8���&���E�]�EE���a��O���5�\
�NjK�fZ��z\��vz���K�15��C@��~���RxBjZN�1
!���{��H���"��Tvm����Y�0�Vp]
*�&������:����1":�s���%�P�L7��i����`��Q_����
W��D�!����%��]�SdKp�4������$q�	�5���EL������	Z���|�|��E���~e���(�����qLm���
�:n�Ye$e)���iV9|�"��W/�E^V��a/�	8��9�94���X���o��!�M���%�u�U7%o�w+�����WL	���J�'U�*Ra�z=RGD�����U���}h����Y&d�C�,\~3���Y��3��������DK^���}`y�jdg�
�Gj�$��/*?H��Om�O>O����yD�6���H�Y�(����Ofq���X�G�H��C�������#b���%#+��}j����T���'.4��0
F�'ZY]
rZ���Zn-��Vy��^T�D�����:,�X�k��H�����f�}"=�^��&F`J�������K��'"�6����y����Wk��zV��+0�-%��e�K�[q���QxUW���uZ��N��D�L�9�����3��������^�����u��U[��uX&?e��~C�������k��?�5X���z���DUF��_U��!���������F
Bb��:�._j����Ip#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������4�����H�#������ayK|G�j��B��.�vgg[�m�:�������wL't6�s�6���o�U@p@:oK�l���K�l����3%���6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on��)4Xj�7����u��t?`�������wL:�������wL't6�s�6���o�U@p@:�fK�s�T��S���s�T��S����?��v9�j�a��*�� X3%�9�*||��]��9�*||��]�	�������u0��P\����S����E������9\��������Dr�e��a}$}~�=��$6��$���ZM�_�L��|;����E����*R^K8�L�{����a5Z������F0����:g�Ps�T��S���s�T��S����?��v9�j�a��*�� X3%����H�c+���~k�u��D�S���������������Yj[��4dd.�0��{�U�_�e�����c��`�Y����p<?�=u�uG����rBzn<y��)�K�l0����z"%��mJ}�r�V�P�J���$��$j�'=�v����mr���'�|�nU���&����KD}.ze��[��������v�l��Y�m���G`D������6�����-��3u��:������9[�V���U�
=g"���)qR6����*[���{���u����eK�����:���ep��������E���F�g'�?�]BN����Bj�����:��F,$���_��^
K�2���F��G6E��Pw��F��*�����r����1&I�rX����]�������J"D���s����TUK"_��U�gL�1�Ug0}.�P�P��P�g�)�Hs�E��6%�5����o�|~�ui*�I��L��7;�f%e����G�uX�}�n�xx5��ag����f-����/��d�$�"����������V�6E�	�*����
bz��nJ�IZ������,�n�!���Trj_b�I���6�x�����O���|��>V"6F.��[��"k_^�[k��9o�\�cxtW6f�
�F��5&c9�k.��g/��T,�#�TyL�Rj�3�!B�S��^�GC�	3U��(��Y8V�n-9'�N��<�W�����������V>�������P�����M���WB�c�����9���bE{��kU���dDs�����*#�{/��};D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H��t��P�]�Nd,�9�8x��oc��Z�Y��gK�<�|�2#��9�b�Qt��s��7r��5&�rF���m�n�v�P�]�F�P�]�L�kH�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��D�m.��H�;D�m.��-9�lEa���lW\��k���j��*�-r1����5b�%^����~���6������`<g[�W��g�����f��tV�]�����]5�X���$c��E��������B�����#�J���4J���C�� ��*K�h�*K�h ��*K�h�*K�h ��*K�h�*K�h ��*K�h�*K�h ��*K�h�*K�h ��*K�h�*K�h ��*K�h�*K�h ��*K�h�*K�h ��*K�h�*K�h'?eO��>��a�h�
��4�����������M��������N��&m��m�������u`#�J���4J���fK�D�%Civ
%Civ
D�%Civ
%Civ
D�%Civ
%Civ
D�%Civ
%Civ
D�%Civ
%Civ
D�%Civ
%Civ
D�%Civ
%Civ
D�%Civ
%Civ
D�%Civ
%Civ
D�%Civ
=r��B��1w}���F�=V@?`&R���\���*����8���rH������MZ�m�bX	�p��RC� T%����"*+��"���R���x��Q������G�\��D�E{�\����0��rW�qU]*��?-5?�:2=�������Dv�^u�Ci4�����d�$D��r��l��[U�h��2S�0�c}hxzVZ��V�W�"��b=������'��F���e�E&� ��79\���9�j���tE�d,`l'���F���r��)�������z����x5��]���T����a�(��t{�^�{�uWCGf9Uxn�~R��1�cQ�DkZ�DD�"��ua�e�G=�5��q�^K8�L�{��B���>S:^���w
I���mt�a��@x��)�Xr�f��y��
k{|���>S:^��P��������y;]�Rn�$m[]&�f��-`��S��O���t	��S��O���N��&m��m��������u`���3�S,�9�8x��oc��Z�Y��gK�<���q����O'k�jM����k���3��9L����6�{�(k[�����q����O*��%�|�t������v9#j��6�7@��k~��*}t��N~��*}t�wG�3n�8#m]L6�ET�d��)�Xr�f��y��
k{|���>S:^��P��������y;]�Rn�$m[]&�f��-`�gV�Y�s�p�CZ����������yT-y,��3���N�p����V�I��g�Xs�T��S���s�T��S����?��v9�j�a��*�� X3%�9L����6�{�(k[�����q����O*��%�|�t������v9#j��6�7@���r�+��kO�$���5���Yt�%��[k��B����A����j����>������I
I(�������.�evVn~	�[�aDr7ul}�6������+uhT�,|F:#���b-�ng�z���Mt����OgVj����K�no�*?�_��m�I�n2K�t��r�E�����g�	�C�����0�V�~2s0�%a�AU���K���7$��,&�7�.8�0����S:��2��������8��%�|�t����k�g)�/t�v�����H���M��� <Z�������E�0�������E�0���L����WS
�U����. =��2��C�����I":VZ����j���,�DE�����RJe�/�"$(�b���k��m�c�S&��N�&�����FE���n�[.������L��S�����q-
R�R�Uh�f�~S��-D�����u,�p�5�.���H*��G1�����E����P�Lu��S�*,ly*]-R�vn���������T�* �r���>O��l��/=%=��n��p��+����M�)��������Y)��A-+?�:
1������\�Z��j-���R���e�,-�KH��nv�~w���M�5S�*�9���e���b	������z�M���u�Z�]\�����&U,F��(;��G/�*�W�_���_���9V����4LC+3P[�`�_t�b��������3��s�B���X�q�R=���9�j+��K�5]�����SrL�zl�	Y�=1+1O�	������m��w��*�X���������Jl���DK�)����R�U/��@���)�b�f�a��
��vc�
U��9�ej��U2���U�Q��,�����z���TtDEcU9n�nS<��+�����Dce�;�cVbC���Y�.{Z�&������KFH�u"o ��\���������WD�3���*��K/
Y-���5�k��s\�EE�*�1��jSL����os��dYl�[�����O����
2#�
�=l���W���#����:���O�Gx"��������S��Q�H�;~������T�q�w� ���8�;�o�?�u��#����:���O�Gx"��������S��Q�H�;~������T�q�w� ���8�;�o�?�u��#����:���O�Gx"��������S��Q�H�;~������T�q�w� ���8�;�o�?�u��#����:���O�Gx"��������S��Q�H�;~������T�q�w� ���8�;�o�?�u��#����:���O�Gx"��������S��Q�H�;~������T�q�w� ���8�;�o�?�u��#����:���O�Gx"��������S��Q�H�;~������T�q�w� ���8�;�o�?�u��#����:���O�Gx"��������S��Q�H�;~������T�q�w� ���8�;�o�?�u���gM��?�u�s!f��y��
k{|���>S:^��P���<9|mN�����]m�7!;]�Rn�$m[]&�f�2����wL*%�*Q��c��X.�����k�1
����f��v�v\���a�~S����s����?3?;�T��}Q�jte?g�E�������3�3E�|_������\�uI9z,�(��b7:�����_Q%�T�q�w���x*�r��Y�O�9�A�T�q�w���8�;��x$A�T�q�w���8�;�	Go�?�u�7���:��D��O�Gx
�������v�S��Q�~�����$A�T�q�w���8�;�	�����8�;��B��������8��%�|�t����g��xr���3�a�t����nBv�����H���M�������8�;�o�?�u�g�ZD��O�Gx
�������v�S��Q�~�����$A�T�q�w���8�;�	Go�?�u�7���:��D��O�Gx
�������v�S��Q�~�����$A�T�q�w���8�;�	Go�?�u�7���:��D��O�Gx
�������9�*||��]�
�~������\xs�����s����mBwG�3n�8#m]L6�EX���Y<��S��/������5��E{�F����R��9�m��Z*��X�����-���I�V+��RFIaC]��

�ZZzI�|o�������yE��z5�f���:TO���c����OY��/��k��G"�[�{�����j_��Ig�U�Drf�/F[k��C�{B�4����1�?��CS��z9�FhB7�������]sdUS���?��_��m�IQ�5��_�/���W�_�?�K��aU`9���=�g1u����)82v^OL�bfg�f����w��,��O�Gx%�[��T���~�o��c
r�Z*i����D��O�Gx
������	G2D��O�Gx
�������v�S��Q�~�����$A�T�q�w���8�;�	Go�?�u�7���:��D��O�Gx
�������v�S��Q��Rj�5�.��"���S_��f8��?3�,�O��ME���QI��Bs��W@�s��f�u%�M8��>�Y���?�p�*=e�r�)�kn����k�����U[r���x��*�P2�I�u
F�A�����B��Hs�&k�����F��'�l`
TqIaU���er� ���(�t�I��\�����]|+��U��%��&F�����i���p�L��g"�Y���Kp�9xm���&/��L�R�p�0]r�-13����4�(nG9�]�����['
��(�Y��V����B�C��B���L��& F���HwL��u=i���xv��1v(����3E��R��!%4��$��V$H�����R{W���1&6��!0K�=j~�%	�R5	Xh�����:*��dT���y�����|*���%+�<�X�g�{�����s���[@~�e?LN��%4�Iur�Bl:����UO'��W�}���m
(z����������GF#��=�Uq*]^+&[0�fD���5�j��(�k"�z�v%�NJ�_+(��l4K�5UUy~���k�U(<������������i�i��k�����p3��:�:���~����;1�a_�j�W��L�aQ�hB2��H�T�������`r���)�m��<P�����-y,��3���U^K8�L�{����5&�rF���m�o&V� �z.��D�ek�
W���TLf�[6������[���������tf?��I�NS�)�=�"-�M_���9�.������%��'��_��jK�=�����(*5�t�c�;��gV�Y�s�p�CZ����������yT-y,��3���N�p����V�I����'�Pb2V~�L�fk��z��U��#
m_w��)�em��@����5!"{?���d�����2��5������$��i�����_��.��c�9S��+'S���5�|7��_R+�����	8[��V�e�u�S�!���5�Uk������tf>U��_W��O
��=^�IX���e�g�7#���MG$�F�����B[�����7���*S�&���c��r
����I�#����&��?�������{�k����EC�D�EWP%V�g'YIcA��y���]��#�az�O'<d����~������E�0�������E�0���L����WS
�U�����O�fDk����3"1J��n7��������=)���
7\#�zS��!�-&�rl�:A����(�b�[���?�[����G��q���FQv��U�L�2�'\@f��K���Wn0������"'�UQ��]j9s(���U���,���>!��.��?;�����IK�}����R&��R��BY#w��y^��=������<�'��^����lHN[���QyQu�#��[!_M��{�veZ��5%�Fz��w��S�SRSBI�|"��Z����A�t���q�k�B�I} ��^����8�5�?��!��w����i��7?�(�D~�����3�S,�9�8x��oc��Z�Y��gK�<���q����O'k�jM����k���L�zA��]�
�n����E�0���>�m�}ga(�e�����'�9�������6�FS�{�D[x��o(* :s4\'��_��jK�O������6{/C�.PTk:���0w����9L�h��������>k�g)�/t��Z�Y��gK�<����7c�6���l3y2����wL*%�+^�j�t��c4����������(Fc��t���3�1��:O�p��O��Am�j��������p��~���.Da>/����R\�����AQ�������~��*}t��N~��*}t�wG�3n�8#m]L6�ETO���=?�����������+���d�������n�G�����C�(�p���O�����j}��x���~\����n�?��1n*8���-���=E��EW�q3p��q�r�=1L��+�^X�C���b9V���P+5�#�4�P������nZ*�5zicIw*��_���8S'�)���e"�F9����_ZH�v�-M�<��w������l������P��U������S*|C��]�~:*|C��]�~.?��z9Aj����!�d+���������A��M����px���x�61���vI} ��^����8�5�?��$��i]/t�G��`�����f��l����V��F��@�?
p�����9'$�r1�N����m�d���!��C�S���m��%�<�����EA<h��c����SSu�9?
������w������bl�.'k�V��D������&�%v���J"T^�c�5��o9#5�oe�O
�k��$�z#	c��d��k��(���0����-�����>�E�H�������#�=�?�J�3��+_���Q�Z���O�K��~��'8�f�'4�����sn���������Ky��_�����a}������=$o����`�[2P�g��C����09L����6�{�(k[�����q����O*��%�|�t������v9#j��6�7�+^�j�t��[�����wL*&3O��l_@Y�J-�r�tf?��I�Ns:3�����
�������&���
����	�~W��Z��A�g��x��&6$H-\���9�3R�����(���UJ�5�)/���lV}<�u*G�%�
%�_��V&�{��c��� ��(x�j{�/�M7%��o��2��t���x�nn{]{]8/���%bY�����h�S����a�%��LTf�L��|X�r�U�{{���-�h�T��-�Ya,92��.9.����e��<�^�������[z���a�5��;���ne�F���H�1��9AE�ia-n�9��q�]S:��2��������8��%�|�t����k�g)�/t�v�����H���M�������E�0����zA��]�
����f��v�v\���a�~S�����0�?)�jte?g�E�������3�3E�|_���������+���Isg��T;��F���ls{�9�*||��]��9�*||��]�	�������u0��P\
?/�@���dF����z�2#�����z�:>��z!��C���0�u�>��>��r�a��&����-�r� f.�|�������������fZ^r,)*c�`�������u�Q3WQ�h�$�Y�\��7z*��X���MH�}�A��(���������[�F$��&U��&�����/d�m|�nZz9�t"�����&������)\u��}�J�f�Wj=v��f��*�zl����%��@:d�c�K�{�����^���j��eO�u>��a��@�O�u>��a���G��oG( -]t6x�5l�}7�~����B���?P��ps��?/���<�n�/�WK�<��g&���!�d��
+���h���_s�����~����H�s���G�����>��������O�fPlU�dX�����P
rW�k~����J�
o�����QK�6I=�=��A�
wi��c����fy�c����hv��z,����p�����3�S,�9�8x��oc��Z�Y��gK�<���q����O'k�jM����k���L�zA��]�
�n����E�0���>�m�}ga(�e�����'�9�������6�FS�{�D[x��o(* :s&�G�9����13l���~����^�����&Y�X��~������:�2�W�G���u}d6f&^9E����R��^:��N���a�5��;���ne�Fp�oG((�O��v�%p9L����6�{�(k[�����q����O*��%�|�t������v9#j��6�7�+^�j�t��[�����wL*&3O��l_@Y�J-�r�tf?��I�Ns:3�����
�������&���
����	�~W��Z���;�����H����k��T�����-�L1�s��{��]j��U5��?���B�b\���[BZ��ro��<����ly���H��&"�$)wCt<��nvu�=�j�U�\����[N�HQ�[�����Fzb�7fn+��{���[����<�|�Fh����n2�	a��Aq�t��S/(��������MW"��t6���i�'v�s/:54f�F�4��
.�K	kp�!�,3�,
�s�T��S���s�T��S����?��v9�j�a��*�� ~_����L��w/�@���dF)]�M�&�`t}�Cu�>��>��aF��}J}��8���S�M��H5T[��@�\+v��G��:|����?�]���=��Y���a��J�������O*��%|}�����J/�
��W_&�f��Wj=v��Y���Z�G����{CK�6q}g�(�e��'32X�[��|U�*�]�[U�V1]z�5�i8B
25n�:����a���\�G%��z�_Q�a)��������v~��[�u��n�m����c�u����9���P�s/��*��5��Y��
6��}H����Iy_��%���HFd�(�&�fw�q�����E�������n>=��Nb�=�:�L2Y�i��:4���V6v�%�*�W��o�j�cH8?S������f�j1�iHJ�j9�+�y�S�]�Is�q�Bv�]�cyZV �Bl�gKDWK��T��
WZ"r��]9n�V��.Q+t)�]���&��"A�����PX����|��**���\6�DaA'���6�MWh��R+�)�������L����w�9������w�9��h�m�������������2�V�W�}�6�~��g�%��X���C-�%��J�{��>,�����3���A�t���q�k�B�C�����Z��n�Q��?\�������]����L�
������3p��J�
o���#5�_A��s��)w&�'��#G��3�An�7"�Lt?���"�Lt?��
��Q��E�[��N�]�ua�e�G=�5��q�^K8�L�{��B���>S:^���w
I���mt�a����H5^��aQ-�Z��U�������/�,�%��@:3�����9���a�~S����~�z�oW��DgNd�(�G3��&m������B��������4�7������A��Pj��s��C���������(�����Y�k�ZwI����3&�u�t�����h�m��I�r��8���)�Xr�f��y��
k{|���>S:^��P��������y;]�Rn�$m[]&�f�ek�
W���TKvV� �z.��D�i��m��;	E�.P���0�?)�gFc��t����:2����"����yAQ��5�=����!��f(�G3��&j�l-��2�"�����5|�}9�~���B>��?P��� ��12��,�g/��l�����wn73��i�'v�s/3��z9AD�|\��� �+���S��O���t	��S��O���N��&m��m��������i����3"5��������w]7�����|{�
������q���=)�����
O�6O �Qo��Q1p�����9�������iv/FP���gZW����(Z�W��_K�<��������O%(��6��]|�a�C)\u��}�J�f�Wj=v��Q�U.����-�����(L��t�b�o{��W4��v�mT�R05#c�<��#��O��(��R��8�k���G�<���u�K�W�@��^X�};���C��3n�;�����rEfr���R�~_^��Q2}��*����S3�V�,�z����.���~U�����7�K����N!�����V^�h
�Z��8M�5W]�����X�rA'	��0�&�0��uu�����M�p�Rm��ZeSuU���uE_�[M������f�5���C
�U��<��)VGb�w�|%r��U����vL0�2f��Fu%�������;;u������nK[P�2���z���Y���_�C['����'�|/�(��_�DIk���X�Q���=����jT��<kQ�t�Lf����,�-�����U=��UE��y<�^�2x�����#����d-���Uo���D]i���,�R�g,�����V��jL���G�����������|/\���{��&k|�������F��*59��z���jLF����X�&�������Z~ZRn�D�UZn�{�%!����1�w��W9�[��/��n��e�Y��=���+��_���O��V"����n{�Q����s�C�I���T�d�%'	�`�My�jY5�������6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�D��6on���6on�DH�\���ME��w(k����:��o!��>!��.��?
�M����u�����-b��������#m��fbj�
�o��C�����8<l�D�r���e�%���wLt[�D�l����	�������F��D\��;:�jr��I} ��^����8�5�?��!��w����i��7?�(�D~k�1y��m���o���'�3W���}?��A�V5�c��&n@5IX�F[F���n��?������rW�k~������]��I��H���z�P[�D�Y�[��Kn����������&�P���J&E���9�������8����;�Giu
���iu
�����!��6on���6on�D��6on���6on�D��6on���9L�h��������>g���3S��7K3;6��w)X-y,��3���N�p����V�I��R|G����B��.�vgg[�L�*��+^�j�t��c4����������(Bc8�t=&[p�~o���>������0�?)�jte?g�E�������3�2GE��]3F���m�?���<<:�p�1G�9����13W�ah7%��i6o��!��R,�-����{�9��F�����
_!Nt�u���p���x�z3��KO_��A�an�t�3�o�7r�6�P���Lw&�u�t�����h�m��I�r��8�������4�����H���#������4�����H�#������4�����H�#������s!��)�m��<P�����,�4|Fcjs���b&�fgf��n�+�%�|�t������v9#j��6�7�*O��wSth[�E�����	��T�ek�
W���TLf�[6������[���Lgb.���n���":�'�s�����'�8mN��������5~�PT@t�H���K�a������'�zg��Q�f(�G3��&j�l-��2�"�����4��E���:4��}�;���������#�������3/��@r�)b���3�:-�"n�fvm���Sf��7�i��������q��������i>.]�gv�P���F�P���IX��v�P���F�P���Iv�P���F�P���Iv�P���L/)o��mQth[�E�����
��B���>>T�.�����f��pF���m���.��8�w�I��-����~�*5��������w]7�����|{�
����U)F���m�?����F&n�G�����C�ZL5>��<t�UE�.QD���`����[w�fw���^�:|����?�]���=��Y���a��J|��#1�5�an�t�3�o�'������>��^������r���l3{2�������r������o�����l�q����m+%��R�M�_B��J-�r�tF�P���Nw:d�h������U��^�.��{v�.��{v� ���.��{v�.��{v� �.��{v�.��{v� �.��{v��H���U���>�F�������*�S�O��Xs��9S�O��Xs�p��4���W]
�1
[!_M��e&������:m��3����K�(���>�[�K����O4|Y�	���g%��J�{��>,�����#4��e��v*���0��h��~�w���3(5|������cY=i�f��\�����&Fk���[��$���R�M�OE�F�l;�g���nE���9��E���9�������8����;�� 9L����6�{�(k[�����q����O*��%�|�t������v9#j��6�7�+^�j�t��[�����wL*&3O��l_@Y�J-�r�tf?��I�Ns:3�����
�������&���
�����Q��g��!L�1G�9����13W�ah7%��i6o��!��#���������A��]_Y
����Qg�9}�d������q��fMx�N�;��x��,���
&�����q�\S:��2��������8��%�|�t����k�g)�/t�v�����H���M�������E�0����zA��]�
����f��v�v\���a�~S�����0�?)�jte?g�E�������3�2kz#����C6�Q��g��!L���Z
�re�E�����j��s��C(5|�}9�~��W�A�fbe��Y��_E,�5��;���nf�^:��N���^4g6�r�����wa�@bW������E�0�������E�0���L����WS
�U�����O�fDk����3"1J��n7��������=)���
7\#�zS��!�-&�rl�:A����(�b�[���?�s�����9���^���w������P��������yT-y+��/���JQ}�mW*��6�4�R��Q��������:�z>��d���]����[?	E�.P�9����������iW��x��*�S�O��Xs��9S�O��Xs�p��4���W]
�1
[!_M��e&������:m��3����K�(���>�[�K����O4|Y�	���g%��J�{��>,�����#4��e��v*���0��h��~�w���3(5|������cY=i�f��\�����&Fk���[��$���R�M�OE�F�l;�g���nE���9��E���9�������8����;�� 9L����6�{�(k[�����q����O*��%�|�t������v9#j��6�7�+^�j�t��[�����wL*&3O��l_@Y�J-�r�tf?��I�Ns:3�����
�������&���
�����Q��g��!L�1G�9����13W�ah7%��i6o��!��#���������A��]_Y
����Qg�9}�d������q��fMx�N�;��x��,���
&�����q�\S:��2��������8��%�|�t����k�g)�/t�v�����H���M�������E�0����zA��]�
����f��v�v\���a�~S�����0�?)�jte?g�E�������3�2kz#����C6�Q��g��!L���Z
�re�E�����j��s��C(5|�}9�~��W�A�fbe��Y��_E,�5��;���nf�^:��N���^4g6�r�����wa�@bW������E�0�������E�0���L����WS
�U�����O�fDk����3"1J��n7��������=)���
7\#�zS��!�-&�rl�:A����(�b�[���?�s�����9���^���w������P��������yT-y+��/���JQ}�mW*��6�4�R��Q��������:�z>��d���]����[?	E�.P�9����������iW��x��*�S�O��Xs��9S�O��Xs�p��4���W]
�1
[!_M��e&������:m��3����K�(���>�[�K����O4|Y�	���g%��J�{��>,�����#4��e��v*���0��h��~�w���3(5|������cY=i�f��\�����&Fk���[��$���R�M�OE�F�l;�g���nE���9��E���9�������8����;�� 9L����6�{�(k[�����q����O*��%�|�t������v9#j��6�7�+^�j�t��[�����wL*&3O��l_@Y�J-�r�tf?��I�Ns:3�����
�������&���
�����Q��g��!L�1G�9����13W�ah7%��i6o��!��#���������A��]_Y
����Qg�9}�d������q��fMx�N�;��x��,���
&�����q�\S:��2��������8��%�|�t����k�g)�/t�v�����H���M�������E�0����zA��]�
����f��v�v\���a�~S�����0�?)�jte?g�E�������3�2kz#����C6�Q��g��!L���Z
�re�E�����j��s��C(5|�}9�~��W�A�fbe��Y��_E,�5��;���nf�^:��N���^4g6�r�����wa�@bW������E�0�������E�0���L����WS
�U�����O�fDk����3"1J��n7��������=)���
7\#�zS��!�-&�rl�:A����(�b�[���?�s�����9���^���w������P��������yT-y+��/���JQ}�mW*��6�4�R��Q��������:�z>��d���]����[?	E�.P�9����������iW��x���c,S-�����	��Fr�&��nU�D�2��{�D����:�X��G-����+t4Srg������+�W�D��[��5uO�1�LI!�y\�	!n����l�z���?��=V^U4�C����M%fnEo�Rd#���z�P�[��7e�Y���o��?�"������~��?������WpV+��rc��|�\��	�|��*/*.����5��C��@�tV"�����e<�V�V�$���t&�������T�Rj.O�C�hG�������$���NgptA����;9�b���X��%�G��g�z�=��i}��Eo�(oc*���Dzg#3\�[�f����(x_j{�_�IX�KDb���E$n�hB;`���M��z�6�iJ�7Gz��{���"�.KT�2t�m"�0��l�r���Di$���1��z�M
h�E�[=9=`�������D���f9��r5*Iu�������?���/����>���Jo���Y�
#"�dHOk���sV���
�}p��H��S*|C��]�~:*|C��]�~.?��z9Aj����!�d+���������A��M����px���x�61���vI} ��^����8�5�?��$��i]/t�G��`�����f��l����V��F��@�?�.�A���e�����2�b�k"��<L�2�k���[��$��rW�k~������]��I��H���z�P[�M����4�3����4�C��|{�g�>��v�D)�Xr�f��y��
k{|���>S:^��P��������y;]�Rn�$m[]&�f�ek�
W���TKvV� �z.��D�i��m��;	E�.P���0�?)�gFc��t����:2����"����yAQ��5�=����!��f(�G3��&j�l-��2�"�����5|�}9�~���B>��?P��� ��12��,�g/��l�����wn73��i�'v�s/3��z9AD�|\��� �+��gV�Y�s�p�CZ����������yT-y,��3���N�p����V�I���Z��U������H5^��aQ1�}l�b���Qn���1��:O�s�����'�8mN��������5~�PT@t�Mb�Ds?w�bf��=����!���[A�.L�H��~�q
_!Nt�e����:�:���8l�L�r�=����&�u�t�����2k�ZwI�������f��PQ4�.�3�,J�s�T��S���s�T��S����?��v9�j�a��*�� ~_����L��w/�@���dF)]�M�&�`t}�Cu�>��>��aF��}J}��8���S�M��H5T[��@�\+v��G��:|����?�]���=��Y���a��J�������O*��%|}�����J/�
��W_&�f��Wj=v��Y���Z�G����{CK�6q}g�(�e��'32X�[��|U�*�]�[Un'�%
���j=��f����b���a�G%R��'����Q����H��Sd�F,��wCl6"��Q3u[�eMV�q�-��#\�BTH��1�Z����nj��ru��5�r;)�Q�'\����z\gf��hWj5��Z���5%����"8�gd��Y�;>v,'B��[=���j��7;�/&w�>Pf��H0��F�N,7�D�����Ds�9�EO]��JJ^�N�������B|��V�1����kK���g~E������<�`
���
��l�=�c�"�`{���Dw�[l|�������1������&> �-�6�L|@#r��'O�fRmE������<�`
������^kC-"����o�F�,�-INU��>l\���w����_"�`{���E������s�����yG'�}����n�2���k���=���~=L$��l�=�c�"�`{���*�F�w��^��}����6@>��?P��E������<�`
�����,�$���ry�����0�C�Z���%?#�]����]��%��%~��)��-�6�L|A�[l|���mWD<��Jo-&0��������-��$���x]��Pr��\yREUTD����J��y��&> �-�6�L|B�5W�4cz�KdyI������lZ�
�{Z���S�M�-
��v�l�U?�?"�`{���E������xV,�/A=*�U����ZZ����/�������������&> �-�6�L|B����_������7�`����MW*��"}�������`
���y��&>!h��y)h���C�AV�R�jY�n������;�-�6�L|A�[l|������;�-�6�L|A�[l|���h����`
���y��&> 9S�O��Xs��y��&> �-�6�L|Bb�����4|�����u|#���T|�xO������[!_M��Iy��&> �-�6�L|C��my��<�����������F�����|�H7o"�`{���E������S���������}����o�_H4���y���0M}��C��[l|�������1�[B����*�����}���������>^�
�"�����7�����l�=�c�"�`{��� ��}���������~H�������W�l�=�c�"�`{���.4�n\���>�e�zX�r����b��+�5�o�Ow�l�=�c�"�`{���<��r������su�������k������G��������&> �-�6�L|B����_������7������4�;�-�6�L|A�[l|�����U�J	h/���<���U��o�+S�u�z9��4@g~E������<�`
����q�@g~E������<�`
���
��l�=�c�"�`{���D9L�<�`
���y��&>!'gZ>c�����w5����uO9��u�\��������y}�-�6�L|A�[l|���}=��h���w_��������O	�|����|�zA��]�
��y��&> �-�6�L|B�=���3r��6�W�!C%���C�uB��tf?��I�H�"�`{���E������xV��8��U�wr����to���^�<���������}��C�c����1��l�=�c���_������~?'�(�G3��&m^E������<�`
�����T|��\��,!��t9�kF������]���q������B>��?P��-�6�L|A�[l|����[�4&��V���,)9w�����j�E������<�`
����}V����h�Z���|������wn73;�-�6�L|A�[l|�����P�
(�r�W��k�P���jZHR�����s���z�������&> �-�6�L|BE�����&> �-�6�L|@4@g~E������<�`
���
�3h�-�6�L|A�[l|�����h��+�o��]���#����<��M�1r���>S:^�������1��l�=�c����������|#~�Z;3�<'��G��������wL*&��[l|�������1�
\�.h����|�
_I���^J�L!����I����'�"|�`
���y��&>!�Z�|��IV�����w5����z���J:O'w"�]���	�"�`{���E������W~�}����������X�����B��y��&> �-�6�L|B�EQ�tRQr����7=��U����F��u�����?�*j��s��BO����1��l�=�c��unD����Z������w��������&> �-�6�L|B��[�/���j����T2k�ZwI���������1��l�=�c��eC�(�E��_���C���h}!K
^O&�]��1��[Dw�[l|�������1�	kDw�[l|�������1������&> �-�6�L|@4C������E�0����&> �-�6�L|C�����$i97��������U��HI}��.
�����1��l�=�c�X����#����������=?���y��&> �-�6�L|B�MQ���~U���?H|��-����7��������=)�����[l|�������1�)����(�_jX�������-�=�I|����{���s!�1�[l|�������1�
���������c�?�[|����?�W�l�=�c�"�`{���,�+3�j�U�W�o�����Wmj�i!I'�����K^J��K�{���"�`{���E������t�U������<������#��T2��Z�G������l�=�c�"�`{��� �7��f���r��u�kC��/�����_��!���wS*:d���`
���y��&>!#e�G�������w5���m[W�G�ry7������������1��l�=�c���������1��l�=�c��;�-�6�L|A�[l|���h����`
�����L1G�t��8~SD��c����]����z��5?� L��MU[S���#T�����)
��S��T����.1��N��TI��c�
���,X�^��X��
���4�#8���]�%�$�������i�ks�������"1Uu.�D5�f1��Rc��v���B�_��N���F�a�5�a#���G]�����P��C��-�a�JE3���������l{�p�]�_nj�������F��)kS�Ib�,��9�l�)(�(��#,���^���������_��tz���J+Y+���cO-���+�
����}�i��0~P���Ti[��0��jz���s�;�F��DUK������N��E���1q��%��Yhd������v�s7��D�*p��#p�Z{R��a�*��)�1�-H�G"/�rHo��Q^��(�;,�*�M���;2��IK3R�7�NU���t��Hs�%��E���������As���)H��S�������DU��c�����a?K�#bY�h�,�Q�)/
/a�����Qu+����j`����5L���7Y��t�x�3Q��t&��d�%�WRj'g9^���$l;���9e�8�e�%�L��K���"�*�95}z��CF<��i��K����rO�+�m���$�)Re��4�j�����~]V���
�2���F�y\=F�=��5(�m��12�w1�w��o�
�w��)�8��[-Q��p%�LJG���K+��J�)a���E�����-��\D[]8&�������5��A�RLQ����f�9]tL�#������nb#QR��{�������*3DK'�l��5������)�C�P�5
�z�*�
��DtXw�>�[�^�K/��������K���zd�i��(�p��{�!�5�����{Qyn���vJqV��h������4���>3'!D�s�����k�Er)��,+H�T(����LJ=������zp=�Mh���U8J��%�_��G��W*R�VJ��N�����]�,���r�������H����+�IH����yi�_+�3�n�.��]�*]L�������+pg�T:�v$9��4��:a��&�E������������7�Q�%�R��hqaTQ���dV9�H�K"���W.�K~��L�r���JF���TI���Y��w�8(������b�lL���8��i�����9732,�wg"��#����]V��j�)�(R�e����Iyh,Hp�B�#��Y�HvDOR��K����--
��Aca����5��DOb"����oe&NGJ������5t�53���9mu^U=`)���T9	���*eG�M*�|�C�����Mh��*�Z�*r��X�����e�Va�diX�����U�]��c���5kMI�@2i�L�Ir��d�V��Z��|���\Y���r������T�)���q�Ob\����c2bz-!Y)'�^�]����mu[��Q�����p�Zb�j�V������V�Y�����DD�.YqE[a��y�����|�WATG�\�Uj�*_�E����������U�hRr�����w|�,�$-k�
y�������h�J}.U�/#/Z�*+��j5���{&�"�s�5��;1!_����	��H�2��4ED������DK�"��CF+���q����8��q[hX[��98.�a�N��9��O�cJ�&-B-~]%)�xOl<��tY��'���F��9}g�'y?��g��[171))_�-�/���r�Gz1Q\���u�Y}f���_E�5��|�b�Yl5�
z�8�1`�xZ��#o�{^�����2�mW7���z-e��Y�<��E���k����������V� ��62��6��B�D�,���n����\�9���K[���6��pm�O�A��D�������<E�{��+��2J�Ra���Wdq=j������$6��s������U/���l^0�7�x��3T�U�N�K"�w1�����s���5�P0RhYT�u���"�\�5R���@��1n��WkUm����gYm�x�Yi��6��t�(�I�;1�H�k��j�?�+��i�hR����ZZ^V����UWB��g�=Z��K�*�{H�)�,�*��������"�A�|5}��n{S:�����`]Q�yI�.f�%�K�L�L���|X��UkU��V2���%���� =��e�2jY��DF���ED^[����
=7+8���B���Q�n��2#��kb*f9Uxo�^�d�������~���(1%""��}�����*%�NS�Z�/!��A
��Cd���Y�����v���}������q4� ������A����A|�e��nF�Q��uTT�=�C&�<-O���t���Q�����}�j=�S�z�%���T2
�C��i�	�IM�S3��!�����3�t��f��Qo�
��B�5���8O'�:6	�#,(�y��_Z������.������
��]@���,�&����J�tHo��G+Qo����S��%3%�T1������N|��KCDK�/"��nWJ^��\���a�bV5Y��"]�j9�j�*Y-oQ�e�JZ[d�j^(S��Y��j�77[5rj�,4��Qd(�N�v��Y�7D�
\�+�����K�3��x5���	HbZ�rz,�8�i���H.j#��k>�[����@�r%/-;���Q�B����YU|DE{%Z���}h�D_�5=D]��ZC&U�3�Ctz��J����~��D[�9R�'"pp-�UW�[�?Z����!���N������Quj����C�&H�����0v^M�l��0�������v{��[��/�����>�O�d���R���R��7(h�����6���
�;�R���X�Pe�hq�����9�G"]�T��^����endstream
endobj
11 0 obj
<< /BitsPerComponent 8 /ColorSpace /DeviceRGB /ColorTransform 0 /Filter /DCTDecode /Height 742 /Subtype /Image /Type /XObject /Width 1200 /Length 50552 >>
stream
����JFIF��C


		
%# , #&')*)-0-(0%()(��C



(((((((((((((((((((((((((((((((((((((((((((((((((((����"����_!�16U��"7ADEQT������#2aq����B�R�3Cst���$4Sr��5be�8d��%&Vu�������C�34r�15q���ACQ!Ba��2c���"RSb���#$����?��
�Wa�T�Y)���3
��n1k�*kF�q*cG�>�_g��Im���c^��\���`�|��`�������p!5^�����j|iUZ�>�>.�+��V�����x�@���31V�����|G#Z��UJ���.��.��{_s~��Z6�0-�J�A��dyX��Az]���s\�b�U��~�Q�#P�n3P����#�tEMh�N%@,@"��*-�{���F�\�Dj%�W��4L_C����.{w�F��w'�Rq���N���b�%
q��Y��;�����v��^�j�(��&caU_�+���'���uf��� ���[Au��N4�@���4(��$��
UQ�"��l���C��p�.��qN'���
��*1�#,6���[�������a",�T��H���Wqi[U�)�e4U��Wjk���6��QV�YU���~ X�pN�k�#-/2�������yp��FF$W"�P!���tZ����o�����(��'�^���/�uK#��J����}�����#f`��g"�kO�t#��z�0���A�t��n����:�V����RyUlhU�W��y���D��tD�$��!���@����k���j
kU��2�s(�T|�d����U�U��V��=+�ee
KGH�����O����"�S��6����RN�5x�hp[�AW�]o�N5��+3H���:��Z
:#W��k���w�����yMuS1��a�td�Fj'}Y���E4,��34�i��]Uj�������_�Un��*����EDTXVT_�
���^��\�(�����f�zY���U�`������Q%������t��[*���
�UHz{��/�?�V�����T�=�J3{iy��	��"�f��g,��I����x2����f*�����iH��C��f�/De�.��K����{�V�?�L#�SvG��O��{��>���o����w\�����j�jLE���N����2��&�8�5����5��sW����}>�����[����n���qc��mI�Ix��Uu]~��*jO�@���������	Y�t:4m��Q�
2q�DW�>�j*����:Em��\��D��.�[���S ���mN�Ji��of��^�+u�������k�
���IF�r>$9f#��T�j94u]V����������D������<�2�%Z
����=��5?���Mb$��>�2��8�t�D��"�[V�Mj��a���2�JE������:$i8�{������X
*�=+Q�l���)��|7#�����J�)L�t�Bb��W�u��O�~�"�5Ju/1�+����PM���F>����"�T��)����t�(����i��*�$[i[GK����� �^3��I��IT�>;��c��i/�4�/��-v�[I��Ln��;G7E|��%���(�
\H��s
o����c�����}�k"��B��}=�����]`b�$hS�[P��rO���#\�c����.�������hY����B����5�W/�4�/���KWq�mj-X�r��pU]�^�ET��I����T�}J�+
Ja�)
��F"���['}�h��-�Q�7�F�
t������z5��#K�I�e3N��1V�L[�|��_�J���vP��?#vrfE��0�7��K��x���o+apf8���w;zCW+Z�{h���NH�U$iR��Jj�+����;���
�L6�V���$F�*���D3)������ql��N�s�J��{��:�}]Z�W����^�;�5*\h"KJ�h.��Hj�[U��i����{a�tG���W*�(x����)R��M5��{\�s�U�����HV�����' M=bD�G�k�n����v���
�Oa��	����d�
(���k�U;�h�G����K��P�G^(nG1���������JHa���.�;7������KZ%�W�J�o��}FB�I��%2�����h"���[&�TV����1 K�`	���h�l�#������+������#��}��F'��w7%�MW�Z����*�+���n5*�Q��"����QUl��I����Rr��|7���&��������'�3�=A���P����I���f`��E���^���������[r��?;�.�K�������3�a�
b�!V�u9I(��G��,�r>��mG&�����sD�T|�:l�������EV����l���#�4�NHU�����=R�xp�����^;���<C��*������/2���K���UEEK�������P�����1��{f�"��"���q�)N}'
��b����}������Fa�D�L�������b�;I���������
�d����?2��
+������� �)S4��Dd��s��#�/}WE��?I�C�K��� i#]Q\�r�]��|�Us?D��A��9������Q/m-z���/�R����j��?S���+;gE|���x�n������'1._��U�	���FsZ�Ob5�v�����t;������#Y�T#,
���TM%d�{�<�E���s)��s��u
-WS0�I���bEt5���V�U����l<+#��Ig~�n��\���#x�~$��P�t�i@�lEl%v�0��N?�Z��z����O�q$v���u��h��o���m��	<IKY���Zi	Q�O%�S����>�L��/N��XQQ��������5�@���v���'����F�/�T�
!�s��#Z���Ux������5�qfVra��1�n���Z����<W�)����������BD���l���j���	O�j=���.���r�O���S��l��"/oZ]�Dj/~���!cL����O����A=�p=�0��9wD�o���no��O��Xd���q]��k\�*�kS��d����"5U�U���R_�~�
J����eb�-��#�G���]�t�z�iN�*7
NHL�����.��]�4���9;��U����hx�/�U���G�^e�w��z7M|�����lp�_��sL��,����0�-��""_��\������b��,8FLCT�otr"��g'���Rdp�Z���m$6�����9-�K��	�h��P�D�{�G!F���;:����
U��*h��MU�0>������K�Q��N������-����i�W�8��}���\���QrB��C]����V���5P�0�����K�Hn����{��_�qL[U�T�Sp�#�5�,��<��*����Z��o��\+��5.$���4�������E�#|��"��$+w�tv*�e=Q0n�p_V��>��p�3�Erq#Z��K-��F������b5Z�������DK�{���(�I�?����{3��r����|9����	�8�/�<�|D���"�WU�?��w>��Jbz|99�����#"�r"�S����P=���%������T�;N�
2Ze��Ab�����dK��-%K�������fwKtk��b]��H����U5��i�\��S1,����=]/j\���[������4*����>rz�G���E������mj����V"������bY�Z�/o��O�u�L���#�H�;0�[�qb����������T�������g��e��[��x��[��_� r+�3����h,��Gy���2������������'�����YI�!�����r*����MZ����$�?����$��*�t�����Lb�9)������G�
T����tEN�.��I��x����f�*��5���[]��e�t�}NX1b�DHzV����~�x�����vV)r�j_��b>�t|��5�wKw���Xw�ht��E��Q�j2$F�,����b��W&�5Yy���]��5��**_���aP)�RrBgNF�Qt]R�����y��Mz�����m�W�5�����W"w������i�P ���b��&���TDDK��-�\I�)X�S��{52+E�����.�hB���;/"DY����s�4W�E�2�/�pe��mF��K#Q;����^�[+������6�s��Mo�?(�*��_�O��~\�5l�5�S�����g���~��H�U�=��y�������?��o�����p?��#�WR��?��X��#R�TD�O���������tg�����-�����3�m��Y���tg�����\�}UE���ku?���9Z,G����\�{��s����?b�����1Y���Ts\����|�����+:G���&p����"��M�/���`6Z"A��f����K�]u�V����S�8N��s[1�k�*-����$HlW�������E�{_QYL5��CE���;9X�K���C,�'�W*�Bx1%�1,7�st�#\��m�t^+[�ay5�0�I��V��/O�M�$��&��h-u�
�������&�;F����KO���YS�(�RN�i�H���m�n��t��8�
����)���W)ssi{��7#�q���>�z�&��n�U$i���b�L2;��T��y��8t�D�-f���T%�
�9X�]-MV7�r����r���f�'�����>K�f��Al�w�t��K.�\����]H�u�w�L���F��*�MI��Z$Yetv�9���V1���r���������_a
mQ�rQ�bKBt�eb#�2��{.��E^%��GFn��\���Sjq% �M����5x>:�z�a"�j�j�w�S�yq�x�|��tPV��|	�J�e�5�I����9����{���7jn#�U&�+M��g&�}82�L�����UT?����N�B�N���|f��c\�j�u"����+��i���9G+�j�Qa���J�e�$5j�����R��/~��0�?�$�2&�@�@��c�"CH���V��"���g���5/bV,8�"�����k��J��������'+ZV ��bC�
�f���DD�""P9�|\_ZX�s�'#5V�Dz�'�B��<����G�D�O���v�,h���UE�jS���W���x�\���8.s�u��UNY:�rF����i��~�J��G���3�\��;>[3+*�I	J4�2�������=����::��������K�DUR�/r�"VJ�M(oz�SI��u�WR��K)-z�Q��J�j�9-9,��E����_�Z���sa�)��9]�K����,�6D��EV�]''N�Y�V�R"%*�?��Q��Q��{"+wf��f�������/3��%��Z�+r���]����V��#��[.�H�\@t���������`KK1.���#�����&�L��tjEFN~V�|�v�j/�U��
�qNPe�B�'-\��tH����,�j]���eV�jE������m;B���Y
��V"��|�1�l��[�*���rYV�d�/��NTi���s�I����	��Es���T�^���'�'�n"�U&�J�k���z�^i������s�LF�W�<H��l�t�F-� ��v���Q>�F�*��b�������Sc�4	�r6Wh�K���m^%�������W�\�Ur��W�~@0��l����=�b�\��*w�������N�8���>�?@����Tb�j����S���i*\�0��|'���N���+!M>$�8��NjY ��DF[�U����+K���2����qd�VZ]c@~����n6�]>�?��Fn��f�P���bV$G6m�V#�`2�V"���Q2��-s�G��I���j�H�,�n�����qy-�.������r5\�������M@k3��<��))��>Zv-�9x�,dG���U�����V$����/>�|]�����5[�uD��2<����bOM����q�"�LL7M�T��#X�lDF��m���I ��0�bDHx��[]w:����{��%�
j��t�b�����a���R�wb-���Uj��D���OJH�����f�m��r��r`�;��]*��iR���:�*�f�7F*�����7yu�Y��S���~��X�Z^����)xR��{�hDEG��U8�Y/��o���:��f��r�r���`DH�_�QU��'������9�h"El���Z�v�_J��f�'�p��/�����Z3�q���*��I9
�����SGV�#m���o���^VnfFJ��5'�U��lwC���&������������Q�(���8�� �H���F"�*jEU^���,��C���������tG�&���v��mr6��U����c;4��\���:��|E[�%P�������_	Gk��$9�����SA�d�U���nH�Ro���\3�u�=������9Q���/vrZ^$��8-��X���Z��������s���R�4��%����dk���+l��5�����KCK��
5Zb,
U^�;����Y��������]��R�*�����50�Z_v����a�.�Yu=�������M�e!�4�
����8���^���z��1>9���z��
�jdS�H���
�(�;R^�v�}���!����������:+R7�Zw����E��$�B�Z�3EDXr�����K6�[�-bV��,/JW6�-#r����������/��%1.^��/��$�*T�
��B�����B�V�����YUWU��P1�b�a�w�hQ�ipe��;��&��c!iC���nTDWq�Q>�?d8�jM����������S��%s��Ev�v�����Z��4�C��]9;B���B�t��BM���v��{�Uu"����S�S�R�5T�JIK*�#LFl6/��D?RU9	��%;+1'e]�V���t�l`X�jU���@LKQee��ILOB��.����""��UU8�"%�����<IJ���q�9�[ORb$i8U3��S����SF��Y`5�b�>��GWiH������?�E["�_l��i��"V�
�3U��P�m�V$����n�������?�\�^v�f��36��[�������]mj#�M_��b,Co�����0����K�R��L�W+Wt^���k��o�uZt��e�'�%�
��
,f����g"*�E;�����X�V`�-"�'?	�����lV�|��U1�[G��y�����X���1c���TI�Dk�]n4�DUN��^ �S�Vp���+%A+�����jC�����������'y�j���I��/U�Sd����s3L��}��ES�;Q���d��������2,h�c\�jj"��U{�S�,!@��q&'�JV+5���MG�bDs-�F��Q�����B�[lX9S�0�)����b�IH�W��-�,�_�U��/![�T"M2�S��|��a�&����={^%��
~'�T�VN�\������8����������_LJ�K��R��@�&[y�UVBU���u������,�7���R'�yk7��R�.�U8����j�jj����~��k��S�L��kU��HR�!)N�7"5��������o��}��N$��$��CP���|<Rt7+]f%�m�b5�����2�Q���0�����<1��Z1���9@Zr�#���Z�*6"Dk��&�����b�����m:�|��YqFi�F5�-xj�[����B����1��s��DD�US�G�.O���b�E4�Z����rK,LAHbM5Jvn��J���O������(,n��=���<������/0�C#a�Ti2�U�8�_7�QX��Fh9u�E�Z�DE1��&��f4W�*-��6�z�G����7/�R%�����T�����U�����,���������Z�f9Z�K�Z"B��f��rz������mv]fW��n��'��!�����+�Nl�f ��9�kl�tN>>3��p���b�Wi�����)�$Hl�e�����-�n�D�����9��32�����9>�MJD��CaJ�bb$�]-/$G���4Tj�a��}��DMf]�s�pEC2%i����2B�e���;�V��������}z�	|U��q�[c(s�V�NI*dy�k\�$�D��k��W�j���R��/�4�*���&��L�������G6ec\�*]U���K�	
n#�U&�+M��g&�}82�L�����UT0�d��0d�B���u���wMi��r:%��z���^����b+��i���9G+�j�Qa���J�e�$5j�����R��/~�t�r�'f�RFIhi;"�1�	��G9{TU�U>k���I�E�S�2�;�}3q��N�?�d�������n!��3^�ad����.��+���s�{����p6��8��1L��I��V,�>#t������5�o��5jN 6��S��D���1�Y����5;�w���s��EDuF�k�����K$�7G"��Q��������L�����7r���FIx��$V���&�������ZJF��P�������c�-�Ot����^�d����J�f�5VY�Z,x��7�U����W��`:="�'?�e�+�Z���UP������pL�/j�H������*��3.z1^���Z��/yT��Qi�/8�L���Nmef%'�%���Fc�k�R+UotN��&(�,H��J�#�X0����#�l�j_[�u*&����$��&�Z�H�r]��f��r���E2|���Y�*�j�N���2�0�#���tc�Z�U�������Hi�)����0%g������y��H�����D�S�,�o�
G4ql��P*�v�L�,�-X������*������.�+������`Dk�2[68.����#"�iw�����3Wi�j�3W���j|���A���r555W[�5'|�0����o����]��-�~�n��8�K�h��k��0w���������7?����ON���MV����a�S��Tx=��S���7�NCX��h��<���/�k��&i���VG��TH�k�Y��������s�6�C���M��]Q����k�V�`����EKEW}g��}wu����<��a�P��F�'b,G5��\�$j������9��t�z��e����lJTU|&Cj�k���"Rq!���g6r�<w��E��^�*�\����Q!E�.[F��@������"1tWq{����{'��CC��r$������lX������["9t�*���|��WiQf!N�$%����c2,�����9���x�Lc>p��ij��+L�������f�����������u���g�s��/�SerR���fK�M(n����sx�dWj_(�.�!W��4��Y�{�������]��yf�%R��|�f�}����&����������os���1��%�'.�7�L���Clv"���$��O*��@�O�Ym�(���Q��D�F���ba���\��.�+����931V^$y������O����<���
&%�����J������-�1D��M���Y���3WSk��2�c�q!��m�����
��(
s��EK*%����y���T������6$j�`E�*��*#UQ���t����m`m�9f,�x1_,������TTS������T���E?��?��L��~nY�+<fB�#�=Q�j5n�u0f#�����Z��������������z���p�b<OU�����!�n[�����m���W���5���\'���B����Z����s��Kv�Yt�n=W�!VnQK�3��GI�z�����!$7A��Uj��-������
D�W�����k������$�E�d(�"���M7h��Wj�Rs����&��S�����6C���d�)��\���G"j��{� 
&��Zu�TjSu:�!��u�TF�"+��jjj5�M|~K"y)�c���l#�+qk-�l�V�6^$�X��F�*�R�k�T�4`V�YV��Y<]����F�J{�Yv�[�"�EWDK�^��K	$�b�1JNi$��9-��[CE��������hs5����$�WNFz]������)�D���:��a��g����-)�(�=�T|�'$$���
�����:�B�/�)c1Y�=���F"*���]}���������M=���_�q�r�j���qM+�Q�n��-�%��f����9m��v��eD[�8:���e����\CT����r�����������!�.�N���'����EV1�)�uU���f��|9xp^����.�m�*%�j�j��S,�)�k�����F�-��G*����TV�*"����T$0�_���ylC��,�"�I�|)5t�%�K���Tc.��MW�~���VS���n�9n��Y���q|�����v��q����X�g��Y����z�%����V{��wW�Ub[��5L�Oc��c��[**q�����:�o���k���'���48Q"=�c�9l���k �.TW���KW�WV��~n+���:��Z�1��UI�Zs
c9�$�Va�Srm��#����}K���$4�N8�\�PC��c*\�ra�����Nf�F���s��tW*���b�nx�
�8F� ��p�O��n���k�{Y~���*t����&bN��c�����)K��������W�X��L�K�����%�*��K:,%����M��8��w��Y�T0�
�:����76�Z<yXq����.K"#Q������>�s�|�w���}�r�>�����q���^0����,G�k�X{6��3c2<>4H��du�d�%��
6���j5
�!���5���4��&�c`w��jkUu���<�����D���HJ.����9�k��
#�I�[S�u��� p�S����{����{���[��m��Rx1|��q,�T�����j=U��UO�O�g{�����u`9k������-�����N�����=���w�6��������m�f�[I�K�t_�_A����:�Z��L6.������U���d�������}'t7-=��m�&����-�I���	�2U0�����
>��
�Ov�[F�I������\N�$��`�Q5���Y�l�d��O��r�#����Fg2�[+/���I��J�Ugj�-�"Dytn��]l���Y=Z�|'�1p�����%��-+Y��-�����hx�rE���M�zK���h���5{]m{q\��	���:�*��W��+��Y.��h�F��TE���=E�@�Yy�]��4*��C���� @lHsN�HN[.�-~N;%�Xg/M��T�U���!��,����l4�	�6�����~�{Y}S)���&�
�JB<���-m��Z]�.--v�?�<
����/�t4d`�o���������G�K��]~`a��3_�`T�dfe�s�����|6��e�*-��x�U�������W�qm���h���Bu"����E�IF�@��',��
s!L:V$6���]{����&W������x�{�g��4y��a�{���W*,;���!����z��,/L��,��d�i����G��G9���kdK����|������F4�y�i�NJ�@��������]E*��u|MKX8�OTg ��$�,�!@���\�,ri�F���V����!��W0�8�������&���}---w�[�- \�HQ#`��!1�}��j]l���H������K���k�>���������9�cS��M�C��u���G�?'4�o���h�9x��2OHOVCH�{����rk�U
��}�Zm}�����|��/:K6�UY{=](����=F[~S��t�7hN��k��"�����Y�u_a?��+�w���%�@�;�����o�u����x�v)����<=�0��+�,�VY��XQ4��H���vT��'��3zN[OE���3�Sx�-5���K2Z(���X����)`��O�1'�����s�7-�Cq�j���]+���Z|+��bz���71>��hn:-V��K�{��	��2
j�Q��9��=/Z"��#^�j�}�S��i�c���l#�+qk-�l�V�6^$�X��F�*�R�k�T�<+��j�������\�K�lD��U�wm�R,DUtD�%�j�
��I\���)9��R����m
_KN���~"���'b�E�wnId����������K�~>$��Z<���{����|�������7H�7�]/�W�yhaP�a
<��3���++�)	��Ek\����}H�jMF�=)-?)Vz^��V�����=<����~�B�/�
6��5��������R��^J��I��O���),�d�%���knnb�s-����|p�^�����x�Mb*�(K^4X-�
]���!�R*��*��*�}��R'�Vs~���i�=�s��F�[��}�!Q�eEV�
-.����e�������a����G^��k���j��OC���xbZ���}]*p�m'Ab9�����������j��(T�KE��V���>cGu���ih����������I�6�Rx����N��i��+����k��r}���)�5�zF&��*�������F@k^��:.��E�[-�R��k@
<�|�"��h�D��MZz�%�w���w�]+�_�T�f�����Ibx�������nF2$��9�5�u1Q]�����O�Xl�	������V��K���P^�P[��RY��1�Nbz$d���k`���X����������u[q�orxO{f5Co�-� �-��[CE�ZZzZ����JR�����4zt�g���iVBs��j%�@*p�l?���$�4��g���>I�uh��������[�*P2��K��J�x��K���I
�����*��eTsZ���5�
���i�I�����]HzF���w�@�����IUUSU��%�a�mRk��x���������I�=�n���Dk�l��j�)���B�Iz����ay�yZ�h+Z<��	��IU�tK�j]v0����=�2�w��S���+;�"��3������\W���tP9�9���Y����I�����9XoX�"�]����Y�s�9��-�ZY5'����Fa�7U���Z���u�3�9,�5�����|
����.��4��1�q��c�N�e1���	��nmj�J������v�����d��	*Q]��bE�L�9Wu^�TK��h]���{�X,z�\	O�<;-��lz�
�.b�7Y��J���zEk��Urq""y/eI��:hg%W����a�#3�6fb:#��V�*�1u%�[�[��#��*��-��38z���/3b,�����k�^��*����W��r�_���+�Tjq1�R��T�7�*�]��
��(�MV��SU��F�O3�2��Y�;Nv,'B��[M���j��GK�/ Eb�A!����n�aH��X����j'}UU��a�z�@���<S���J���v�a��A��_��1wv��WS��Z�!X�|�^FV~M���P[��v�����q���������H�]���a���6�������(W������H�j��j��#����*���b��s�;���%��J�(�|&C��k!�����}H���i������YJ�M��Lc
�o�����,��4Q8����G�Qe�^�N�����,)H
��_-���by)�(
�p�-�J�j4	���)��:4'=^��E��G-��^-Z��>�Q������1�"c����~�eQ�7���w�]|GKUp��3
f�F��LBKC�5++��+�U	DkQ�(����mV��m@�x
[
����V���e���,�7�bDlG>"&�"#5�Yx��ok*t�����\E3�K���W�NL;B�.���������S���K�t*D�Y�U�#3��iiXp���\�ES�W��k2���O������5�Y-���a��A����V���a��v��F���z�;���r��}���N]f���;)�Q�'\����z\gh��hWj5��Z���5%���m��	:l�%i���������5�x�p�
j�������El�IHn���Vz���
���A���A��*0�
qa�b$7����#����Tk�<�4(�lXo�����O*)�Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|G����w��Gp-?��w|H����=��Fd�M$f���yl���A����\wREUTD���}I2�M��l��Q�����������T����4`�<���U:���3l���c�=l��uf��]d����ot7��Uk��EN���h��W�j�|��21[]lG�����Z�����������"
�y���7���O�]�D���7���O�]�D���7���O�]�D���7���O�]�D���7���O�]�D���7���O�]�D���7���O�]�D���7���O�]�D���7���O�]�D���7���=r��e!�9vh1WIR������|�����s�i������N���W��Wc������2\@��O��=�T��N~�>^T�]S	��j�����c��U@p@:�fK�s�i������t	��������a;���\3�#q]�w�J�.V�q���ls��k�UU�"�'��j�o������}��mZ�13������W|�P�EJj�,�j�1�����c�y�F���>������������6?9�_��JM@���e�6#/k���,��<��Eb��"�^p��W����T���3�[��X���{� ��kXz�=W���dd�:4W���D�'}W���e13��)L�_�e�^W	?E�#���+a;�#�"i"kE����
.5�i*�
)2Q��J�mL��r5c�E����F�-���ET�����Z��k���7����L��������GYx��v[�����(�����uK%H������������H�U���&��QVvg���������K!���7f�Gh�Z7��Z������xO#�U)Z�)��Tp�.��$�fY����b���	W���*��:�(�,A)�0�
gbI�+5��Fl���m8�����?��\/��e�`B�1����2b�^��w�>i��c\�k����*w��TU�E����9��Uj�>U�lJ�&!��c��
����]wN%���
�VN�F��S#$i)�M�"j�\�M]��;���nj�-�f+a�������]4S5W:����uC���:W�to���*S�&���c��C��K��������5��'�%,����c��r]��N��l<@��O��=�T��N~�>^T�]S	��j�����c��U@p@:�fK�Db�O�z�	r#r~k���pb���YK�����3g@[����?x���?x����e���*�#���(��������1d�r���qQ$	�M�
��t�\���'�<C$�Z�XP��sP�^�2������S��5i�]����`���8>�'�&���S���������r+Z�Gh����e�#���B�P�e�8-��Ez��,w�H��l���	d��F���C�ZD5s���?p�W��6*�U��.Hf?���T>�w�M�-���W_���H�<�k��8��Q�Sq�"��k�8�mD�	Q{}�Ex���;�(����0*4���	e����C{e�X������\�Z���z����_($����ri��X���o�����������Z��������t��4��c�+R�����
�0���T��$b���j**_Z#�'��VI�A!��.
�?�Qb,���?H��Z{���DK^�:�}`y��dga
�Gj�$��/}�!��������8��bD�O�C$i��~4b�Ez'��|��,{�#��s�C�������#b���U#+��}j���q*Mcl�A��O��,�����1�����v����9���Mi��H��E��<Z�}�+��i�4���_�SM&`J������������M�����|�������6����
�&�FIhj�x���a��T^5U��w��U�����P�p���faH�p���@���X/UUcm�:�[*���:�����#�������k��?���Esa�X�7�*�ok���V��xW�ht�v�����������UUU�T�;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����[�B�Ac�1��Tkm{q�<m����skk:���~���J��c\������o��6�� �����M����%�JEQ;��o���K�����������q�������;F��soN��'3rV�]�t��&�������q.9���(��P����}�9��i"����um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��V��i��&���&�����|����b���i��ZM���4;p���yJ�2e�y��S��K$)-���D~���l��K}�9��i��|����:�����V<�
��m_|���um��7�C�zv� �x���C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��i��[�?T]�w;�KJ�F����?f�/*{.�����5p�p��v1��*�� 7�����o��6��*����p���:b1�n���Z
n�������i�vH�|�����2{[���]�����+���-�������K{�um��F�I���fGr��7~���l����X��G:���K��Y�nkl����p�Oca������H���[}�E8���-�v�O�*�_�l��]�v�TG���*@ L��2,�-����{�i}"6�[�Yw�C�zv�X��~�n5���/�U|Gy��P��um��7�C�zv� �q#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��i���,5Y���dn�:���`����{������{��ve����;y��R���hh�O�����R\��|����:�����V<�
��m_|���Tq��m=��Tq��m=�I�2����n���)T@L�4\'��_O��.Da>O���Isg��+
r�F����s{�9�4�yS�uL:�����Oe�0���������;�%TO����l��w?�C���"1K���Y7������7\#�zS��C�(�p�q�OO�9m7k~
�x������3
��<?g����<?g��Y�����y�W�w��e�q(��1�p�Mi���92���}&��].�{-�����W{3�l��T3k�
We�0�����Zm*���%~��w+K��u��f"g��|k�Ki�f����������n��������)�����{�9���i�����v�]�Nz�>hp�u���c6������G�������p�'�}?�����?+��u%����x)�����������{������{��ve����;y��R���hh�O�����R\��|����:�����V<�
��m_|���s�i������t	��������a;���\3�#q]�w�J�.���!�������o�fDb���\�oXg���Hn�G������Q�������r�n��d��
��9J f.�x~��-�Gx~��-���=�e��J���5r�:�"��8�br+��Y+.�]m��Y;�W"�����X0����M�1:�u��4V'k]�SN�z���X�����%�.q�V�_�M�Dd�c�����k����DKY��2������mz��7���.��Q1?dDuB>�x�����=q�y3k�
We�0������U�uL*&Go���[��Yp��Fc��{�9���i��8�f[�|�������(T@t�h�O�����R\��|����:�����V<�
��m_|�������5�������l6��~�.��V�[��Q�vmwA�����x��Z����a��7K(�����H�MRg����|(����]k���T������Ns:3���81I������|n����Z���e
���
	�~W���K�O�����R\��������Q��������~�>^T�]S�9�4�yS�uL'ty��s�n+���IU�����;�2#]����l��R����M���>~i
�����P�
7\#�zS��C�[M�����:A����)D���`�����������tg�,�^iU���Y@'\@�gV�Y�s�r�Cb�s��Z������T��Z������T�v��Zp�H����������5]�T��[�k�
We�0�������}�n�\4�����6���gFc��{�)���4"1������9�.�������"0�'�}?�����{��9B�y�W�9���EO2q��W���V]\����w��EUSw9L�`7{+i��Jbuj��~h�N��8�(�Z��-}��G9�</�Kf\���N��V����k����5�����/{�e�+9yL��O&o�+�]���"b&~����}��k�i��z���f�t���aQ-�������TL��kW|��;t���(�����Ns:3���8qN�������7P�����p�'�}?�����?+��u%����x)������������Oe�0���O��=�T�wG���g8F�����P\
??�C���"5��������/�j�d��������p�q�OO�0�u�=��=?�8�����*���+.:r�@�\+v����[������[�gF{2����_�j��u����H�c+���~��u��D���Kb�gV����������Yj[��4dd.�0��{�U�_�e��]������,��cT�8���:���bI�9!=7<�������Z��O*��u��?y9T�O�a�����$��$j�'=�v�5��mr���+���s75���&����KD}.ze��[��������v�l��Y�m���G`D�����6�����-��3u���k�S���d1m��bwc*����e�����a�]�-���{���u����,C'&��_�%�+�wm�t���.���5�[I8�����t����T&�KOa������b�M���u��e����C.�$l���F�����k���J�Ef���9S���i��)�rX����]�������J"D���s���F���D��,�&��c�������B�C�IC�����!�G�[��]�V���=V�ui*�S-&�5nw�J������X������if��kO(�������[#?�_[��LI,E���7'5��[������	�*����
bz��nJ�IZ������,�n�!�������b�����6�x�����O���|��>V"6F.��[��"k_.�[k��9o�\�����l�9,�$%�jL�s��]5^�_�Z�Y��Q�2"�T���
"��&���:HI��T�G-����Kq�i�|T�k��J�V5�PU#5��"����;��j�}������l�/%p�)�:X��3��n�$W����Z���DG:���z��:���l�����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D�3���C����Y�s�r�Cb�s��Z������T��Y�������r������o�wx�����3�6����3|3k�
We�0����dFc��cE�b&�w������*�3o���_@a���
9@tf?�
��s�����-��fw{���6�W�p���o��B#�n�<�Y����>O���Ir����E�t���Y��:����-�P�.�����V<�
��m_|��v��s�@��*��4�x$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D�3���C����Y�s�r�Cb�s��Z������T��Y�������r������o�wx�����3�6����3|3k�
We�0����dFc��cE�b&�w������*�3o���_@a���
9@tf?�
��s�����-��fw{���6�W�p���o��B#�n�<�Y����>O���Ir����E�t���Y��:����-�P�.�����V<�
��m_|��v��s�@��*��4�x$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D�����Oe�0�7�C���^e�#1�E����s��to�m��?���9�7������i���������)�\	�fw{��?F���?!���mW,����|����=)����m�F�b�H���;��������������*���+.:r��f.�x~��-�'B���7���m
/�G_�yK.��s�@�Y�����y�W�w��e	GoJ�9t
���]I�D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4oJ�9t
D����]F��s�@�$A�*��4��C�
��}���;Ad�Y�3�r�P�4�����T��V�����Dn��T]�j��n��`&-�t,_MIKL�P�k������_+\��j�����L3��Ff�K�n���s�=��Ur��{`+�^	�Ut�W������(���W�Ri�9��v�-
���GJKde���w����r��m
+[U�h��2�az�
�����A.���|E�~={��WkbO��9�e`���L	�AUXnr��e��^�G"-�����X��OQ0�3���6>SKMY	U����ur�kUR�S�S���=�d�����Uz1���]
��U��-����cZ��X���,��dD?�9L����6�|N^�l[�s�^Vr�����U^Vr�����N�wkN�v�Q���������TKvmwA��������/�0������:3���9����Ot��;2����F7���yB�3�3E�|����:��F�������6|/r���(To;j��0w����9L�h��������?@��g/)�^��P��g/)�^����v�����m�����]�j�.��D�f�t���aQ1�}�]���,�i��1��m=��������NS�-�>hDc{����* :s4\'��_O��.Da>O���Isg��+
r�F����s{�9�4�yS�uL:�����Oe�0���������;�%TO����l��w?�C���"1K���Y7������7\#�zS��C�(�p�q�OO�9m7k~
�x������3
��<?g����<?g��Y�����y�W�w��e�q)�Xr�f�����
�}�~�k��^S6�S��k��^S6�S����i�9#n�j;�7�6���v]S
�n���5]�T��c6�Z����Yp��Fc��{�9���i��8�f[�|�������(T@t�h�O�����R\��|����:�����V<�
��m_|���r���)�m���P���������3k�<�����3k�<�����3�6����3y3k�
We�0������U�uL*&3o���_@a���
9@tf?�
��s�����6����ve����oy����DgNf���?+��u%��'��_O��.l�^�c�NP��v���`�x?f�/*{.��@���|������?���9�7�������i����������v��dF)}�U�&��v}�?4���{�Jz�q���=)����-��o�VO �Yq���b�[������Tp�����k:3��^/4����W,��$]cR��k*S���%��W:�["*���xs�z����e��4�"��"CD�>��`�����eo]���fc������G.��ZWUZ����W���K|���Be���:3��k��]|��D�N���9�k���j�;�t�+�bv����DD�������D�X]�4{9�������`�A�)�Xr�f�����
�}�~�k��^S6�S��k��^S6�S����i�9#n�j;�7�6���v]S
�n���5]�T��c6�Z����Yp��Fc��{�9���i��8�f[�|�������(T@t�h�O�����R\��|����:�����V<�
��m_|���r���)�m���P���������3k�<�����3k�<�����3�6����3y3k�
We�0������U�uL*&3o���_@a���
9@tf?�
��s�����6����ve����oy����DgNf���?+��u%��'��_O��.l�^�c�NP��v���`�x?f�/*{.��@���|������?���9�7�������i����������v��dF)}�U�&��v}�?4���{�Jz�q���=)����-��o�VO �Yq���b�[������Tp�����k:3��^/4����W,��&�\���������\�������0�7�^*������\4��'32X�[��}U�*�^/@[U�3�S,�9�9z��o���-yY��f��yT-yY��f��y;}��8g$m�mG|f�f�t���aQ-�������TLf�kW|����K.r����Ot�3�1��m=�������w��
����	�~W���K�O�����R\��������Q��������s}Fzb�7fn+��{���[��O"`�����U:��SLScT.9]?1��2+���^���k���[yn��a�k�Zv��q����j�LL�U9B����7����c9�]S:��2�������������mz��B�����mz�������rF���w�o&mwA������]�j�.��D�m��w��;t���(�����Ns:3���8qN�������7P�����p�'�}?�����?+��u%����x)������������Oe�0���O��=�T�wG���g8F�����P\
??�C���"5��������/�j�d��������p�q�OO�0�u�=��=?�8�����*���+.:r�@�\+v����[������{*8�S��-5R����E�d7�E{���T����QNe����S����T�1]}������������bJEq�e.vw�.�YZ�O-���'"�g��G�x��uSTL��#��Z�v}[J�f��Z�v}[J�����Y�X��K.r������,Z-�|>���|/�-��D�b:L� �P��a��RH������#�r�K5u��k%���lG��l�
+����o��j[w�s�`�����SuY��f������k���TkZ���&������o�8���)N�b*�		4r3t��Y��-D�����u,�p�5�.���H*��G1�����E����P�Lu��S�*,ly*]-R�vn���������T�* �r���>_�)��-n^zJ{sM�����W[�Q����S��f�
��d�W�����((��V_Zi�5tuk����2�O�HJ�'�����}��Sr��}=.+[]���j��U��v�Q��3=�'����3�7G7A��jYuqkS.�$����qb6����
Dr����~���MU�@�3��#�x&��efjtlk�����W5�^�T��x�`�HS���2�CG�����E�Er��{&������*nS-^�+VwLJ�S�Bb1`*Fc,�q%����<�
� �#1l��6�u�%/$���Ff��7��UK��P4�'�h����n��)����UEc��j�Z�b���n�:Uy�i�K%���oe��c��X�N��-�3�I2�+8����l�'plj�Hp�E�3�k_d�{5~�r�h�<H��zM.jNXUi
�i�b+�D��J�^5ri%��,��Uc��5�r9�K������
5)��J���os��dYm%[�����O����
2#�
�=l���W����5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��]��
�XT��l&�v�W/��#����>��^�z���o{����]
 g_�u_:���>����;1�a_�j�W��%aq��������8U�F�q����;���q�;�8j������w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�9L�����������G>'/T6-�9��+9yL��O*��-#��������
�����V�'o������m�������U�uL*%�4����u8�]�
�����T�*�3o���_@a���
9@tf?�
��s����v^sy�h�zz]�������S�-�>hDc{����+ :s4\'��_O��.W0�RN^�/
4m���h��R������z��>�X�S�*7��}��D�5O��Q��T�8��;�w+^Y������V�s0���_�E��m�.����!�p�?�=G|����qNP�i>�Og)Gp�?�=G|
S���w��+�Gp�?�=G|
S���w�	Gp�?�=G|
S���w�	������z���Y�s�r�Cb�s��Z������T��Y��<9|mN��0��]m�nBv��Zp�H����������5]�T��Z�J<9�wS��P��Ymo�LB�c6�Z����Yp��Fc��{�9�	��e�7���������Z�>_���;2����F7���yB�3�3E�|����:��s
�$�����F���+����*�	.��������{��9B�y�W�9�A�T�8��5O��Q����w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�9�4�yS�uL7��������\xs�����s���������5p�p��v1��*�� ~��o�fDj��;/9���&������Z����2������z�;>���Cu�=��=?�8������*���>���tW�����r�n��d��
��9K�3
��<?g���r����[��|��d�����144�4{U[�K���9����c��K:���?^�nT���t�j�#�*;�M���4B�Y��<9|oM��0��]m�OBR����\���������Z���iY,Y��/���P���mo�mB�Q�
������.�Yp��L��tG
S���w��h������U��^�Gp�?�=G|
S���w���	Gp�?�=G|
S���w�	Gp�?�=G|
S���w�	Gp�?�=G|\���k]�lE�U�������jd��ye�����YHU��d':t7M��m�R_���Ng�k4\}����G��NZ%:�#-m���Mt4��k�V��_j����fkRq]B�Q��p��hrP�p���x���,;��Q��������E����]�2�]�I��y�\$�
��.�C�����]|k��U�&Wb�74�l�H������h6kIr����[��K�mf��1}Of�*G��(2�0_X��CHpaB��s���U�d���T��+85��}.�P���2P���S!�I��jc��I��_�Zj������c�]�1f'���d���HIM"o��r���������j����scl������P�%#P������i#����EMK������,����YyD�E{bG�����X�r��^�i9Q>�B��T��U'b��F|h��Ur���'�>��X���e�SLSM1�!{�����2�����e��j�c��o-����z�����?"�R�A�mz��>,��������i��,���1N%�#���"
�k�vR����9L�h��������?@��g/)�^��P��g/)�^����v�����m�����]�j�.��D�f�t���aQ1�}�]���,�i��1��m=��������NS�-�>hDc{����* :s,U����������Q����eM(�+o"�T��=���\����uSk���}�!��������re�E3�����6<����H��&"�$)wCt=%�����>��p��#���;.���2f/4�~9K=����-ykN�un73�^Z���[���h��W�
&��t��r��ua�e�G>'/T6-�9��+9yL��O*��+9yL��O'o������m����L���U�uL*%�6���v]S
����j���v�e�NP��i���tf?�
��p���o��B#�n�<�Q����>O���Ir#	�~W���K�>�X�S�*7��}��;������&�B��>b�H������h����O���W��2{V=~��������\"5M���mXOK�����_+0�W14���q��������fZ������ne��Z���]����_�'�S�.vW;�Qc�b'��za��|�����s�i������N���W��Wc��������v��dF���!������mW,����|����=)����n�G����������Y<t�ee�NR���n����:x~��9���^���y��i^��u!k��^���S��k��^���S�K/���r����3HfW-j;>��d�fW-j;>��d����Y��\?t���(�9���������iW��z��"1uY�<9;Q�������qi9����u1w�&'s�R��E�$�;'�i�f� �{.��?��ckaUv����_lk�B����)��������X�����&��8�o�����t���x�n��]{]8�����Ex�a��F-t���UtQ1��5{��q����)����y1��>�[���
+k�<��g(&���q���i[^����9A5�~D#4��i��,���h���(��\��Eb����	��lG����k���F�������������������k\Y��3��R����1�����^�@�d��??�x��z����_�����������������s5�9�O�,���w�-�����]J[�������[Y�J�v��k66�pw�r���)�m���P���������3k�<�����3k�<�����3�6����3y3k�
We�0������U�uL*&3o���_@a���
9@tf?�
��s�����6����ve����oy����DgNd�(�G3���L�1Gr9�G�bf�v�l8)��i�W��!��G�6�e��<�~��w�C��i�������Y������V�s0����;i��������qNP�i>�Og �+��gV�Y�s�r�Cb�s��Z������T��Z������T�v��Zp�H����������5]�T��[�k�
We�0�������}�n�\4�����6���gFc��{�)���4"1������9�.�������(�����������Q����eM(�+o"�T��=���\����uS_�[�;��j�����"�y�����g���V��M�$�33R59I���
]��Io���tO��C��������u���isN����@\��l���;i����3-ykN�un72��;�\S�(�O����,J�s�i������t	��������a;���\3�#q]�w�J�.���!�������o�fDb���\�oXg���Hn�G������Q�������r�n��d��
��9J f.�x~��9����?x��K�{2����u�{�~����*�{K��O*��*�{K��O%,�~;�������!�\���������\���������oV�Ug/�p������:d�c�K�|_���_��j��f� �{.��?�|������\4v��r�+���Y
["�w���)5l������gs��p����y�,lc�������J��O4|Y�	�C�!�e/tV��y���PMz��!��x��*�'�1��iD
#����(���rL�#14��#�������W]$3>�x��=��%�=�G���(6[��Qi1�q��7������]S�-��\N���E��+/���h�����Q��#��kx���[}�E0�\�����'=�uUw����VI-����,�����|P�|�ZnK��c��a����6>���Y��K8���N^X� 9L����6�|N^�l[�s�^Vr�����U^Vr�����N�wkN�v�Q���������TKvmwA��������/�0������:3���9����Ot��;2����F7���yB�3�2kw#��P�&m�������13W�n���4�}������#���2�W��l?p���!��o4��Y��_E,�k�Zv��q��fZ������ne�FwZ��(Q4�{��3�X���3�S,�9�9z��o���-yY��f��yT-yY��f��y;}��8g$m�mG|f�f�t���aQ-�������TLf�kW|����K.r����Ot�3�1��m=�������w��
�����Q��g��B��b��s>����^���pS�,�-����CW��l?p�
_"<y����������)g�9}�e�-i�N���a�k�Zv��q����j���D�}���@bW��O��=�T��N~�>^T�]S	��j�����c��U@p@4��������s��;�2#�����z�;>���Cu�=��=?�8�������P���v���'��l����Q1p����~���@`���?�^��>/4��+�c���-yW��^��yT-yW��^��y)e����]v�w�i���Gg����l���Gg����|Cz���9}���\4��'32X�[��}U�*�^/@[US4�S�u�9����Oe��������\S� 1]�wz�j�����I�dW���t�;�|����S�)c`}�e/tV��y���PMz��){����S�r�k���Fif��YU�>���J Q��=�G���(5|��������Ic���W,�5�^���Rdf�+�5���N{]��Y$�[�l����P[�M�l}�L3<�����4�C��>~ig��i���D)�Xr�f�����
�}�~�k��^S6�S��k��^S6�S����i�9#n�j;�7�6���v]S
�n���5]�T��c6�Z����Yp��Fc��{�9���i��8�f[�|�������(T@t�Mb��s>�����w#��P�&j�m�����f�o�~���Dx�a��Pj�����w}�8p���yK=����-ykN�un73�^Z���[���h��W�
&��t��r��ua�e�G>'/T6-�9��+9yL��O*��+9yL��O'o������m����L���U�uL*%�6���v]S
����j���v�e�NP��i���tf?�
��p���o��B#�n�<�Q��5�;���?�C6�Q��g��B����[
re�E�����j�����A��G�6�u�����7�y�,�g/��l���;i����3-ykN�un72��;�\S�(�O����,J�s�i������t	��������a;���\3�#q]�w�J�.���!�������o�fDb���\�oXg���Hn�G������Q�������r�n��d��
��9J f.�x~��9����?x��K�{2����u�{�~����*�{K��O*��*�{K��O%,�~;�������!�\���������\���������oV�Ug/�p������:d�c�K�|_���_��j�>��W^�Ma�N�L�[���4�$E��l��{�AQ�n�^�W�}��d�g���gh;O�����n���kk��{<��b��b6\�hq�t���o��h�|�V}t���"_R+�/��Fo����Q�#2fE�}3K�q����UE�������n>=��Nb�=�t:=P�d���cDti?��m-NK�Un���o�j�cH8?S�������s5�l��%[5�������*q��i �p=�N�+�o#+J�hM���h��x�������DN�/��tJ��b�%n�1����D���$H2�Q�6~jQb6��o��EE_&�����!�$�V��)���="����j�����j�U3O�u=�Z�����>A��]k~.?�U�9B�Gw����^;�~����Ex�a��N3���8x���<��61���vR�A�mz��>,�����2��
+k�<��g(&���f��m<q��[������}s��{�2�W���l�
����=��5r�\�����&Fk���[������Z�U�OE�F��<����������4�3�l}�L4<p����q�o����	t@r���)�m���P���������3k�<�����3k�<�����3�6����3y3k�
We�0������U�uL*&3o���_@a���
9@tf?�
��s�����6����ve����oy����DgNd�(�G3���L�1Gr9�G�bf�v�l8)��i�W��!��G�6�e��<�~��w�C��i�������Y������V�s0����;i��������qNP�i>�Og �+��gV�Y�s�r�Cb�s��Z������T��Z������T�v��Zp�H����������5]�T��[�k�
We�0�������}�n�\4�����6���gFc��{�)���4"1������9�X������13l���}�!����u���&Y�[�_�|���<�~���Dx�a��]�i3y��R�@r�)f�^Z���[���2�����V�s/3���9B����<1����?f�/*{.��@���|������?���9�7�������i����������v��dF)}�U�&��v}�?4���{�Jz�q���=)����-��o�VO �Yq���b�[������������4��,|^igZW����HZ������T��Z������T�R����\��������Z���iY,���Z���iY(���k�Vr��,�i��Nf:d�h������U��^����205#c�<��#����G��yx�bbK5��G�������������K�^q`�v�5
��f�*$w�c	����%UE����H�]W����x�~�W�U��X�����f������u=�e���S��r�5/�&�8�k��Yx�69hk���6��]w_�|�7�����'��aNM2a;�uu���P����v�6�������N�uE_�[M���)z�Zk�Z����ty�(R�������J�E���M�"$�<����7`}�3�.�gEW&���Z[������+w�m@@����n�$IfBN��|C['�dS���|/�16����4���� �9S��=���L��O�x����t���'!Y�[�7�:+�Ub�}��UE��yy��d��Ma�G-ew�-���Uo���D]i���,�e2,�rb�S��[��P�h��b?E>�m4L��*�/������K��/y4[���E>�}�!ah�Z�F�1]�U����4�j��K5�bjk;��""T_�����tJ&:��pt���)���1�[�8��Lj�uYx���p��,������*"�a�E�
�D��j����ct�������K�:*N�G��� �	)8M�k�kR����o|������o��6��$@�����;F��soN�D�����o��6��$@�����;F��soN�D�����o��6��$@�����;F��soN�D�����o��6��$@�����;F��soN�D�����o��6��$@�����;F��soN�D�����o��6��$@�����;F��soN�D�����o��6��$@�����;F��soN�D�����o��6��$@�����;F��soN�DH�������cH�P�s��du����
:4�S�u�9��h��W�Wm��vKE���;�[w������})���Ex�a��N3���8x���<��61���VV�#1�1�an�u�4�o�O��>$|G���tn�-+v��3����J��O4|Y�	�C�!�=�OyeV��F8?�(�D~���1�����m�G���G�3W���l�
����=��5r�T��1�e����k�m�<��<|z���%{����I�k����$��v��|y��
wh�Eboim�����#m����������d��6>���Y��K8���N^XGo��6������;IK������;F��soN�D�����o��6��$@�����;Nd:��2�������������mNt[�D�,�-���`��g/)�^����v�����m�����O��wSth[�E���-+};�T�f�t���aQ1�}�]���,�i��1�Y���|�n���Du������1��m=�������w��
����Qc�t�=����n���������
�w#��P�&j�m�����f�o�~��~JE���;�[w������������G�6�u�����7�y�,�g/��,�|Fb�A���M��������l�����;Lw-ykN�un72��;�\S�(�O����;}�9��h�um��H���#��C�zv��P�����;}�9��h�um��H�#��C�zv��ua�e�G>'/T6-�9��-����0�X��Y�Z7�7w��k��^S6�S����i�9#n�j;�7�4�������(��]�ZV�&w��n���5]�T��c6�Z����Yp��Bc8�w����-������_!�gFc��{�)���4"1������9�8�,�b��{��;_���?���Q�f(�G3���L�����92�"�j�{�4���1�w���}�K���I�3W��l?p���!��o4��Y��_E,Yz�����u�������w|���C�zv��Z������ne�FwZ��(Q4�{��3�v��soN��������\Go��6������;Iv��soN�������Go��6��0��|G�j��B��.�vii[���:�����Oe�0���������;�%TG�8�x|�n��~��~C*5��������/�j�d��������0�Y��R����v�M�'�;����Q����=)����-��o�VO �Yq���b�Y�dY�[�{Kn����Dm����������4��,|^igZW����HY�������u�������|�����/k�<�����W.�z;�7�0���}��nQs�4���m�����\���������oV�Ug/�p������:#}�9��i��L�-��UsJ��������o��6��$AmT�����;F��soN�D�����o��6��$@�����;O\�H�a��@�{#t��O-��*���:���a��@�� �{.��?
����������CV���?p�M["�w��������<p���yK��e�){����S�r�k���gK�����h���^��B3H{6�8������p�Q��>���=���A�������e�x�K����e�J�
o���#5�^���Rs��-x*�'���g�hg���nK��c��a����6>���Y��K8���N^X� 9L����6�|N^�l[�s�^Vr�����U^Vr�����N�wkN�v�Q���������TKvmwA��������/�0������:3���9����Ot��;2����F7���yB�3�2kw#��P�&m�������13W�n���4�}������#���2�W��l?p���!��o4��Y��_E,�k�Zv��q��fZ������ne�FwZ��(Q4�{��3�X���3�S,�9�9z��o���-yY��f��yT-yY��f��y;}��8g$m�mG|f�f�t���aQ-�������TLf�kW|����K.r����Ot�3�1��m=�������w��
�����Q��g��B��b��s>����^���pS�,�-����CW��l?p�
_"<y����������)g�9}�e�-i�N���a�k�Zv��q����j���D�}���@bW��O��=�T��N~�>^T�]S	��j�����c��U@p@4��������s��;�2#�����z�;>���Cu�=��=?�8�������P���v���'��l����Q1p����~���@`���?�^��>/4��+�c���-yW��^��yT-yW��^��y)e����]v�w�i���Gg����l���Gg����|Cz���9}���\4��'32X�[��}U�*�^/@[US4�S�u�9����Oe��������\S� 1]�wz�j�����I�dW���t�;�|����S�)c`}�e/tV��y���PMz��){����S�r�k���Fif��YU�>���J Q��=�G���(5|��������Ic���W,�5�^���Rdf�+�5���N{]��Y$�[�l����P[�M�l}�L3<�����4�C��>~ig��i���D)�Xr�f�����
�}�~�k��^S6�S��k��^S6�S����i�9#n�j;�7�6���v]S
�n���5]�T��c6�Z����Yp��Fc��{�9���i��8�f[�|�������(T@t�Mb��s>�����w#��P�&j�m�����f�o�~���Dx�a��Pj�����w}�8p���yK=����-ykN�un73�^Z���[���h��W�
&��t��r��ua�e�G>'/T6-�9��+9yL��O*��+9yL��O'o������m����L���U�uL*%�6���v]S
����j���v�e�NP��i���tf?�
��p���o��B#�n�<�Q��5�;���?�C6�Q��g��B����[
re�E�����j�����A��G�6�u�����7�y�,�g/��l���;i����3-ykN�un72��;�\S�(�O����,J�s�i������t	��������a;���\3�#q]�w�J�.���!�������o�fDb���\�oXg���Hn�G������Q�������r�n��d��
��9J f.�x~��9����?x��K�{2����u�{�~����*�{K��O*��*�{K��O%,�~;�������!�\���������\���������oV�Ug/�p������:d�c�K�|_���_��j��f� �{.��?�|������\4v��r�+���Y
["�w���)5l������gs��p����y�,lc�������J��O4|Y�	�C�!�e/tV��y���PMz��!��x��*�'�1��iD
#��������e���#�����i,{�j���+�5���L��%{����I�k����$��v��|y��
wi�/����i�g��8����hx�g��,���9y`���3�S,�9�9z��o���-yY��f��yT-yY��f��y;}��8g$m�mG|f�f�t���aQ-�������TLf�kW|����K.r����Ot�3�1��m=�������w��
�����Q��g��B��b��s>����^���pS�,�-����CW��l?p�
_"<y����������)g�9}�e�-i�N���a�k�Zv��q����j���D�}���@bW���9L�h��������?@��g/)�^��P��g/)�^����v�����m�����]�j�.��D�f�t���aQ1�}�]���,�i��1��m=��������NS�-�>hDc{����* :s&�Gr9�G�bf��;���?�C5{��a�NL�H���^�
_"<y���(5|��������8f�O<�����R�����m:���e�-i�N���^4gu��r�I��xc9�\~�>^T�]S�9�4�yS�uL'ty��s�n+���IU�����;�2#]����l��R����M���>~i
�����P�
7\#�zS��C�[M�����:A����)D���`���?��<?g��ix/fX������y������_/i{^��P��_/i{^������z�u����3+���V��Y�+���V��Q�
������.�Yp��L��t�b�o���W4��x�mTW��)���pb�����9[V�����.���v^����������7�:vt�����5������u}�X[3u��H�Uk{��WT�4����������It��n;h��}_y������������{j/5t,�\�� ���G��
U��Y�/��?=�����"��c�o�)��1���?������\�+��^�E��lXO[���Q{������_
g2�����~��)�z���:)���QWW\�������V]�GZ�
���������TLE
�3�:��TUom��o�k�ZE�����������6�t�US�Grp�KH����OC�����s�o-���"�!�~t�x����iL}�(��TN�����Z�^��L=R�d��J��l���f��\��%�u:\���L2+�I���"q��J"��f5C�����V_���rz�Y�{����D���3����$��`�D������_�i�����}��*����Y���a2$'�������tT^%E?G\N�$��S4�S�u�9����Oe��������\S� 1]�wz�j�����I�dW���t�;�|����S�)c`}�e/tV��y���PMz��){����S�r�k���Fif��YU�>���J Q��=�G���(5|��������Ic���W,�5�^���Rdf�+�5���N{]��Y$�[�l����P[�M�l}�L3<�����4�C��>~ig��i���D)�Xr�f�����
�}�~�k��^S6�S��k��^S6�S����i�9#n�j;�7�6���v]S
�n���5]�T��c6�Z����Yp��Fc��{�9���i��8�f[�|�������(T@t�Mb��s>�����w#��P�&j�m�����f�o�~���Dx�a��Pj�����w}�8p���yK=����-ykN�un73�^Z���[���h��W�
&��t��r��ua�e�G>'/T6-�9��+9yL��O*��+9yL��O'o������m����L���U�uL*%�6���v]S
����j���v�e�NP��i���tf?�
��p���o��B#�n�<�Q��5�;���?�C6�Q��g��B����[
re�E�����j�����A��G�6�u�����7�y�,�g/��l���;i����3-ykN�un72��;�\S�(�O����,J�s�i������t	��������a;���\3�#q]�w�J�.���!�������o�fDb���\�oXg���Hn�G������Q�������r�n��d��
��9J f.�x~��9����?x��K�{2����u�{�~����*�{K��O*��*�{K��O%,�~;�������!�\���������\���������oV�Ug/�p������:d�c�K�|_���_��j����D����Y�G�BR,��{��+��F�r�_���+�Tjq1�R��T�7�*�]��
��(�MV��SU��cK}o�)P�$��yV��"����2���5�rvRj�N�C�Y(�����X���ka����ZjK����H�y�a���A�s�a:wZ�o��CW~::_�y3����7�2�HTa:��|�HoK9��G=S���"*yn_*RR�*t���=�Vj�Ff��I�EG%�Z]x��;�-�9��&>`�-�9��&>` 3����=�c����=�c��;�-�9��&>`�-�9��&>` 3����=�c����=�c���$�����Gal��1�al��1�	��5�[lz�k��:�H����&���_���sS��
���(��-�9��&>`�-�9��&>a��bO��=���_�c�7Nw�����������b�	��[s�L|��[s�L|���o�����~������l?p����0{d����0{d��8qL�
,�����v�z�~1�C�b�;���C��}���F����|����"�����=�c����=�c��]�=���_Fbut~S�����+�(�:q���PsU��������	��&�3j�-�9��&>`�-�9��&>ai����gZ�k�{J����:��6-
������sV���Sb��];n�U����~���=�c����=�c��?��Wa�������N��[���o���W�c����j���0{d����0{d��!?e���?���U�/����s�����U[DO�"9��al��1�al��1�E����Sg�_F"?�k������V������@g}��0{d����0{d��=^-�al��1�al��1��������0v����0	���Oe����h�-�9��&>`�-�9��&>a1���B��>��s��W������q�Eq_KW��b���^;�~�%�[s�L|��[s�L|����}*����z������w����Zt���?6
����l������l���;���/�����������
+k�<��g(&�������l������l���.!�}6�~��Ut�������4U�����t5�V���}HpLv����0v����0����?����������Fg�����e��[s�L|��[s�L|��iv��5kS�8_���N�����MrW�k������=�c����=�c�u\�Vu��������������o�Z�}�]S���P5^���=�c����=�c�������{����~O>K��c��a������0v����0�\n�D���^�Z�����W/���^*���^����" 3����=�c����=�c�n6�����l������l���h���``����;``�����r��v����0v����0��������^�~�����w���u}��1r�����mz������=�c����=�c�����������>N[<3�\W�������k�
We�0���al��1�al��1�
]xN�����~m������/e�����_&tf?�
��D�����0v����0��a?H��]�z�Z��]Z�_V���]����������_$8&;``����;``����Wb���?�y}q����?X������13j�-�9��&>`�-�9��&>ak���;*,�_�LGW^��U���m����_�V�|���b���<�~�'�[s�L|��[s�L|������*���a~��-:z�~�*��[s�L|��[s�L|���-�_���/�W����T2�����V�s3����=�c����=�c��e��Se��k�}Z������bXZ��G��5u�������������0v����0�F�@g}��0{d����0{d��Dw�[s�L|��[s�L|�4C������l������l��������z�{�uk�'����>�����������3k�<������0v����0������V~�V�����r���
���T��6o�]�j�.��D��``����;``����R���uM].���h7}'�4Y{-}����j�0��1��m=�'���l������l�����	�E�����������:��������,�=����~�z�!�1�[s�L|��[s�L|�����������?�����}�!��Wal��1�al��1�]����Qe���b:��GZ���~�m6�-Z������5|�����	>���=�c����=�c��wn�QV�����i������P5^���=�c����=�c���o��O��j���_������m:��������0v����0��.@��.�K\������~H,S���-z=Q��_�g�6�����l������l����5�;�-�9��&>`�-�9��&>` 3����=�c����=�c����|���������=�c����=�c��}��V�i���5u��w������(�5��X�6����=�c����=�c�����?����/�
��!�����al��1�al��1�
U����5��k��?H~�w����������f~L$�p�q�OO�?���0{d����0{d��<��u��Y��z����_����1���E>�WF���}Z���8&;``����;``����V?b���?�s�q����%�����������l������l������}�E���������}Z��Z��H����.��-yW��^��y�-�9��&>`�-�9��&>a�M��TN�m������T��~j�er�����VMW���l������l����F���V��WJfz�s����K�l���Z�1��(��eGL��al��1�al��1�	/����>�KW�WV��~h�W�����G_�_^��>M�al��1�al��1�	tCDw�[s�L|��[s�L|�4@g}��0{d����0{d��Dw�[s�L|���0�	��H��M�+*�s7W���n�U�j`&@��5B����3E�F�1Q[7R�"����%�\c�d0�
-V����6aK�X�"��f��N5U��h>L�lG�qv8���n����Oz�����6#";�r%��Db�*�]H�k0�c�uI�c�)�NJj
-~^�;V���,M���K�Yw�jTK[�Cs6qmK�R)�f^�$�M$��c�s�����q�S�}�"*�#�4IKZ�bKQf���#e!ID�EMQeM6����T�@5��'���MQ�G�I�:�����+
v4���z����}H���LI���1��l�F����
�F��*l'=��Ydk��ET��l���/d�HZ�\���F.1�����-~MP]Ml8l�m'1#}}$K����@�A�����r�W�
��T%!M���jDb9~��@3|���x����R�B�������,�d��5+��������m�4�g8B_1�����~W���.	���)V*q5����W������x��0g0��%�Z��,��I�Y�������0�]VKY�������� �,��Sa���M�nn�/4Mn��	�r�,�uU�����ob��6��J��C2���R����E�S���M_~��P���%ke�%�����'�������m�M��)2��4�j�����~>��Y~�
�2���F�y\=F�=��5(�m��12�w1�wj�o�����)�q�Z�#��KD����+R�V5���>��E���QQZ[�T�
���qM����s��7�y�����b�G^��4����:e������ct��E[�W�, �>���h�d�m���h�b�%!�cP��"�Cu�
�psQ���V��������������eB=2Z4���I��|=�����K��{Q{��C����;*qV��h������4���>3'!D�����M�k�Er)��,+H�T(����LJ=������zq=�Mh�����J�c
�%�_��G��W*R�VJ��N���EK.��DMZ��h`*��.dIU�R���Zd��<�����
���
�7[Z�.�n5.�}�r�����+pg�T:�v$9��4��:a��&�E�������S?�5Lo��LKF�a������	\��s����E{tl�]z��G��3Q�����R5?.��D��9�j������������be.��*F������s0#"��v�*��9���e�k*�����%JV��`f$��$8p�V��cQ,�j$;"'��e%��J�����P ����jY�K"'�����Q��2�'#�����A�Q7H�tQI���-���S��s"����J�!=?5%L���@����'��8�����������"�|���<��\o#/�3#J��ef���(����\I�ZjO"�LRe��6p��
��)u�9��Q�������G*�9�����|�Ob��`\q=�pV6LO^�������d��j�8M{�v�jkU��n�}F��2�����xsUZ�f4=��
���GH�E�"'�r��(��������V~���
�=���UQR�z(L����[�S�����'/)y�}�w������t��y��_�
��D���]��2����"��Z��Q\����k�!�g9������f$+��<S!6�i�\�F���p�W[���tDU���h�{`�N5���[�Y+c��"��+x��'��33�x�r�����^���.���<'��WJ,���Uj#R�����S��yM��2�1l�����~<xt�L�/����Er�Z�m��e�����.����+��a�(S��������-�~+��UN���e��LM\���U���CHQgh�k-3��[*/����_R[�LQ��<��g�**$ygH��tt
�������U-n���l%w��>	�:�B���cG��,x����x���d�|���sz�#��Pd���&�(q!��3�e�V�M��~�/��b��1��y���vJYc���t$DU����MH�i������B�L^���)�SU)�[����v�V�����u����R�����_N���4�h9!���w���V��Cz��0�
V�3U�KK�����
��Q;M;��+Q�)}%TO�������e]��?_���DUH:������jiY5�����*8�2p���/�k��S1I�t_%���b*�}��c5���et�$���di�xL���Dj���j��Q��t��AG�����6��q$�j��nL���c������U^$E[�����>�9�SU�O��rZ�$�DD]����^�"�_��<����!Q��l��1c���K5b���"]�~�/�]1^S��.&��S�yZ�H0�����H/����
����9Qn������2�����XN����5[;Q���MG�*v�T�]{�����?���P�����%399*=��M�K.��Y���K�+�YU5	����bH��
-^tW�V��G1.�%K�i*��l�x>�A�0�$�7I�.�el��������D[�*jD)��������z�
�P��U'>I�{���%����m����K�2�"d�0�1+��D�.����S��K[�i�%--�2�j^(S��Y��j�77[5w�qvUQd(�N�v��Y�7D�
\�+����K�4��x����	HbZ�rz,�8�i���H.j#��k>�[�����g�%/-;��>�P�%q��������J�-	�����~�y���i���a��
��r�*2��:{�g�n�%Hj��H���hx�*����������<�ue�0����J��U�~/"��G�W�����o�d�i�FGLG��#�����V�}_u�����};+��!%/-��)=��CF�?H��m��M��c�%!�h<U�5[v�JY�k���r%��K]5�, ��endstream
endobj
12 0 obj
<< /Filter /FlateDecode /Length 238 >>
stream
x��S�J1��+�Y�lf�LE�����B�J���Ive��
5Ix��{�H�DGe
���S{U!n��U=^���:�����TB��%��a������U�$ ���X� L�'9~@V%���f?/i����w ���2"�Z���A��������{��Uwy�
V{��I�Xi&�jSo
U���o���IN��q�~�v�����tsQc�s&��pQ�tQ���D���#_������G����v`�Eendstream
endobj
13 0 obj
<< /Filter /FlateDecode /Length1 11780 /Length 5835 >>
stream
x���	tE��oU�,��$@v2=L2H&!����
�0�(	I$ �@XqPq	(�~���.��v�Q����*���������~��"��w<��w������Uu�����V�bDdT�12�V�k�$b�H�4���j���"^GuS���&6��$���y���jk���O�s�y�M�f-����ao���47Q��{�Z��\tAQ���7 >��n��]_��!����k���n%�?�	����"�nG<�a����a��Z�(ofcm
��M��Cg�,lR�����E5��]����z��MSc�\#���="����7}���CDN�}�	b�5l������x"1*��<�j�af�����)WE�E��!��2��9�
���61�0�
)L\&Ea�1�3}�A?X
��=B!�	
�P0���p
#(�!I=@E�Q�/MQ`O�{QO�7�3�P/0�z�q�OO��(z"%�I�}(L�$�G�Kj�t��K��r���t��S?r��Q
��\`�����Q:�fH�40���@���R
S&8���P��i�M��b���I��Pp�d��A�`>��4��_!,��`����WTBg����Q�q�F�I��(*��,�b�,*+��8F�%��p,���(�K/9��+��8J��$y6��h��yi,xx���q�'��<����B����&�5t68��j���9`=�^@��Oi�d�N���OhUC�Pr&���h*�/�Z�Q�����i6��sh�,9���hM��p�!-��E4��./�\L����.����L�G��R�^N���t���\F�Ct-��E�5t1x-]b�Z.�K��<H���z�\IK�U���.W���t���n��-�����t
ro���t-��Z�}t-o���w���:Z	�E��������V������t��>=@�0��i
��n�|�n�[��m��������t���`+���h�Nw���]z��5����O�}���7��f��h=��2����a�����A���G���1p=n���I_�V�-zQ�_��D��nz�6���	���JO����A�����h3�:m����7�M�
�M��o�Vpu������s���
�K���I�O/����>�����K����t��������z��^?������A�H~J�����k�9�	~!y�v�_��U:Fo��%��w���]����F�[z�x�����i���~���Ot����H������IG@�>�����>����O��O��O������������?�>�H�O������~X��������O?t�O?$}�!���������>������>���#���o���O������>�����h���O�����O���O'��_qz�L�*�$�!C���-~���q���V�L�)�`z��U<7aFf+����'"_��3��B����i[�9vO=����5>�v���0���=/a���mB�M,���|7�x7�r,����q-S��S�����/�q;����_���,��&�3������}�������U�������<����lu���g�?lR���Y���2cV���A��}�up7J���Y[���{
�xVI��
����P6�;L��
|���������sb/77�����:4
���v�����������XSF����<����Y�h
7e�<��1�{a����>�'?b��%�/S^T��"�������MY�dc�$��7�;�9X���� �����[P�>�fy8����>��l��y���/��'���exS�5���v��)�6~H�I}H}�R��>^�:x��Y4�a�����-fW����5��}�y%��S����j�	j�z��*�r�'�U��w�����e\�uk1|�
�&w��6aN��{?b&�z����Md��^��c����!���b����k�-��cs3O�������M�v��.��Q�U�*ne���x�F��je�'��j��S5��Y�5�u���GL�����-�c#��/��H;���:��\������7�az�����r����8����X�gg�g��l6[�����e���?�����f����$�<��E|��y=��W��|��R,J���V���yJ�2WY��Qt�U������nC
U�j_�����u�z�����i�����P�,�Uf��+���|�X�8�y�������j��m��O��/;�,UJ�'�z>X��;���)T�Tp�T��]�/ex�i�y8�F�q���~������J+gh(��K}X���n�����������c�pj���:_P�n����g�nzO
e��(P�Q���o�"�r;=��f����(�g�
����a��J��~PR�h��lE���w�m`=������ff������o���f��^�����m �>���R�b�EW�����c�]�?v���O�'Z��?�T��M�Yf���5�6��"S���F
�D��X�+Y�!��6���8x2?*H���9�b"<�Z���O�A�1����A���O�L=���J�x���*>
;������~b1J\��g%�g�:/�>*3g;�T�w����������������2}�]@�����m��X�{ce���M�����K�0R�����y�Q�4�}�c���ag�����.t�o1Q���),����7<wXN��!��
����N�Z?Wj���C�'�IJL�������gt�-�GDxXh��b6�
g�^�,��tW����#Gf���	5�$T������Z�4��[z`y�Xz��.Kf��(/#]+uj�k%N���W���W��J�B������Zi\C���j�T/���RZ]��Z�B�������5��lje��L*<�4���5���%�z��D�@WRKk�����JKoF���k�Sur��niB���\�[d5�t�6�\kM�hY����jwx���fr���xEQn�[��^|$��(
�.�����D��4n�&�--Wk�]��N�uz�(������2T��X>ACm|��Jg�P�&�D�U�����"�z���8��
-3��iZt���������qB��RY�t��NoMIRk/j��=���w��Ho�E:��GdP	�8U�����4Z����e�E�Q�V��%UN�S�@}�������)�_d�R\�b���y��jsj-�F����Sj�)�T��$T1N���O�������!b)�7E�e|hF�|?w:�lt�E��xs3�����������W�k45��<�n���EN����E��dN���N��
r��[����E�bz�6��,������Op��;�J+m��mye�X ?�+/��=���D�x�"s1('w�HU�����Y�:���Q)S�V���G�
u8��C~��xJ�>l�����-��y�-
�E�������nyj�
G�x��rh�:M��L�?���#���{�e���/��v3L�^\btf�������9����������l��M�9�\KSi����76/O��Vx�W
,7C���<E����g���7���_���-��)�y�Z�r���p�4>qF�K?=~b�-�z�!'������D�*������o��$�zD�P�z�Q��l:��-�(�`�8������N���}�Wq"�
��~
tD9�R�e�M���c��GS���3	Y��Vy�		�aDh�����0���e!�F={��ao�M��(.:��VJ��������'����2LMJ���(M�P���e�\�]8���?*%�MOG�D�YY���<'�M��'�lhr)60	Du������kK~�h*��O�w��v���K5��$�L2�8���
m�i!)<��@��,8c�����8���`uP�h��A]�����Sl���GP7����kv+�/*��&�l4�������&~C��B�i.-����rOyN�����aw5�9���R4�G*�Z��F�7�(��<�k����.@��FA*qN��I�b��@�A����PV-���l*�k��%ev�L�^�i��l��6J�&&��s�0��dP������?D�M�{7���`/6������[
������^�N�Tb���(d3!c S +!� fi'R!�A�B�����z���`��g�����@t�y2�~�7V��%�f��AC��a��@���ahDVGa�C� ��@���H����K�M:�+�`�G�nOqe������0t���PX[DTVa(7�1�&;��
����=�����������}������H����rb�p��������	)�L���l��X����/���� ����{x���H��^�M{�-{X�&��3��=5��&���,?����v??����w��I�`��6���4A���@�C>�*�]G� �y��B<��+���j�|g���^��������5��_�/����_F��p;�-�N�a�'<cChC��|�=%�nF���;�	)���L�����V�����B�����v�F�����J�v��cLp��
X��sq�k���
��_
M�u�
h���Bp��M�U7����)�\c*�~~��)���c.dZa$_�^Z�^Z�^Z@*_ n�Qm��--
=�����f�mf�-�7���a�z�[�|K�/���g>7�%1_2�y��)����1��n�a�8���|�2_3���/��R�Oc�?w��,�R��y�����H���u`X;0���;!��y`��
�'��o{ZA > 7��p$����3l��h��6�
D��)��1�1��/�R2��@�@.���es�A85���lXf��cD�o�-~�qp���-����TV&��d6&�H��$m���F�Y���#~�>�B
C��|%���XW�����g������f7S2VQ;F.��0��e|(%YE8���#���&���6W�}3�!��h�1����$?��I�S��5����o!�����I��_��[����g6k�tSR�����t)2�����`������dF} ��f�<����s�#Q^I�T��en�$�o�X
�l�D�5
���$+u&�'f�Y�'���Rec9��eI�8,vKK���5�j����[C�V���Z�������[l4z�m"0����m��o�r�������T�y��"V�w�R�TM�n���B�a69��]N��Ez���o1����r�2���V���"U���v�~f��e��h���Zv]�O[v��Kq1��
���������t�z�u���k�'T����YB1�x����u��/-�����j����./�������&I;��W����J�Y�Iv�Y�vkv�xv)"�����.5$D��L��6������H�X���Ms�v���T���J�m�6�c|�F��&II0IN�&,���IK�&�~5��\�er��Ia��$l"���8������n�>�[;Y�������j}���8�7U�Zk��\�SkDXS�{��%z��Dk>�w�'�����Vl�+�Z'{�K��{��:kJ��#���V��]u
�;���
u�����l�=B��-��u����u��c�Z�T��1S��<,��:��-��5���;��$q36$�)��pg�Y��"sJd�?����w$nf��Y6$G9��=w^�<�+�^���Is�����G�JuOMI�\G��	�z���R��+��'���Jq�$@b�HT�.C��'�BB�������a��>�T;�$3�Y���\^��
*�����.�����lfn�|��l
�$��������07��#�'����3pU�����endstream
endobj
14 0 obj
<< /Filter /FlateDecode /Length 223 >>
stream
x�]�Mj�0��>����`O�
�2��Eh�8���(�"����)T`���'����cO!�~���0��k��!�8R�|p������x���KOSTm��E]3�pz�q�;�_�#���y����qA�`T���I&=��b]�s�Ey?����BS��o=��:dK3��Hu�>Iu
���j�������1�����p��Pnc�<u	5H�o{J1��@co9endstream
endobj
15 0 obj
<< /Type /ObjStm /Length 436 /Filter /FlateDecode /N 6 /First 38 >>
stream
x�}R�n�0��+�h,>DY���1
'F�6� FbT��(H4P�}�r��=E���3;K1�H�I��q�!%mJe����v�+���1�����Vk4Y>���^n��\�����{=��k=������Vf(L[������@���g��m�J�V`�������l|������is�ZK�B��16���#$�W[�|L��;]����<r�D2�����5�����$�$'���{��Q.�q����`�t��p6��(Ix�aK�<�1+�(�J<N#��8zpmk#!�sK��N7��s�k[,��6����4��HY�)*��� ���w�q�Y�O���!��/M����?{0�|�dQ�3
����4�N��J�e��������� �=����.C%pH	��_�d�d#��&CO2M�l4��>j�7���jendstream
endobj
1 0 obj
<< /Contents 3 0 R /MediaBox [ 0 0 612 792 ] /Parent 19 0 R /Resources << /ExtGState << /G3 17 0 R >> /Font << /F4 18 0 R >> /ProcSet [ /PDF /Text /ImageB /ImageC /ImageI ] /XObject << /X9 2 0 R >> >> /StructParents 1 /Type /Page >>
endobj
2 0 obj
<< /BitsPerComponent 8 /ColorSpace /DeviceRGB /ColorTransform 0 /Filter /DCTDecode /Height 742 /Subtype /Image /Type /XObject /Width 1200 /Length 51126 >>
stream
����JFIF��C


		
%# , #&')*)-0-(0%()(��C



(((((((((((((((((((((((((((((((((((((((((((((((((((����"����\!�16U��"7ADEQT������#2aq�����3BR��CSt��$4rs��58be�%&d�Vu�����D4�3qr�15���!ACQa2B��"Rc�����b�#S���$��?��G�+T�������cE�z2:5���-�]R���W�$�S��z�4�i���j��c\��.�-���Td���3N��1uiCu��E�/��9��2�f,80X�s�9����V��t��p�-+�����h�����+1n]��b���v���r}���"���G���R�l�+�	�.r�Z'y���I@��q��O����,�[�1`�m����Z���X��W����"N�c$hj���U�kd������Pk����H����
��2�k��D��&
��>�2�q��*Z����7p��r-�.���y���=��QW�Pb5?��X>2sP'e�1)xK�$7#�>����'E���5.�����z��g�V�UY�?���Bt'�)��-��[/~�?"9�kZ���u��-u���d\5�Br�YU|����uf-	LE�z�b;�^�����S��E+9+�~������!��e_��zw}k���?3-�1��U�Q&&���[n�bCG%����k��F��_��.Z�4G�*H��=u���������~�X��6�*���	��u��3[V2��o��S�m���j
kU��2�s(�T|�d����U�U��V����2��%��K[w��������v)��K��K)'i��D�8-�������'�Z����b��te�-��G5�m�;�Gk��"�T��R�'�_�n�W/y�U�U��
LQ�f�u�U�O��M%����_�Un��]�x�rb���1�z��O����H�%Jz�E�U�E����p��h��]/����^%2��kYY�mb#Z�����DM(F��J2P�BV
~Z���mr'}S��U?���������Eb=�����L_-�|3�q<�%�D�X��[�h��(��#� ����$��u��x��Ko������N��`�m���n��o�G�D>y����W���/�|S_wm7I�/z���?E���D�~��� N�N�f%�V�������M�����j��/1��D��e=��S�2b"�W*�����!�S��o����Su_����76��t�]�AS^�����G_���[�'W(���P��E^���j/�����O!��fN��ef'd#���������8����������L����V������cU���l��]?�QP��3'"`��i�������5f�i���B��L�iyZ���9�E|�
=7yQ�['�V�������:j�+�]
�����y������|��xp 3�>#�����a-%L�MU�5Hq)�-����M������K{j�[�mU��f"Hb	����%��G.��QW����[_������p�h��h(�[]�{[��Q��3���c��r]��f����t���&�Y"B������F^�y��
����S��t���,h>�]�����m����!�P��+=
]\�kV��O*5.��$�1�L�LN@�B|	v�E�]�����m+]S�P������ ��oX�(���dU��DEj"q����zj�*^�y{$F�M�c������E��_�Z�������;sX�����w�k������0)����P�-0����v�����iq���Kw_���������j�Yhs��9������lj�����p/�\CI��#�s���iw�]�jyU�o������)��}^��k����D��38��~�eIS)j�)j�6n���(�s��j}����/8��Y�g�/L����+g���TUw�w��(��
�!=���G5�[����H��$����O��R��nu��)u)�uX�#��s�MdV6�����]�[�aM��/�l�Z<Ur�<����������Z��z5n"��T ���}�[]o�.DS�Y��Q�]9R���5Q��:��d���c��w���?�&Y+V��ke�D��K��DEN;��e��V���'�����^�N-+����Z���N���(P�V6#��;E�tUj"�]<�{(��en�K��2��������u�R��I�[���I������^���K"�TD]I��j|3�4�T���p�����Q��O��E����9��
|���R:�f������/���B�Sk��c��wxP���h9�[_���u��R����������;u��h��dm�����)�%���p��)g���2V3I�M=MM}��=3x��GX1��U��]�������MR�rUYm��5
f�����|��_��p5S%^V�)+{F�_7OM�T}���n��T���Z�T��2��	4��K��K����%�@vJ���7�|?��N��2%K�t��JDl���QQ�����_Ia���$b��t���7g5l���dO�)<�����E��q�oJ2z~���&����F]Z��K����V#OC���Mu;J%>�g�KYA�J��]�����zzi����Er�Y/�����a��v��e]�s����]'�~�������(2�d)�kjsi��TD���K^���K���F�zf�*O������N��_6�$5^=��������K���EU}Bzf}����F��DD�]���Wb:��?���qmN�X�NB�%,HNEl������d^-hf��������.WS0�I���bEt5���V�U����`��p?���*_�7��{���hhYZ�����}�F��p���:f[~o��&���nv��k]|���i,C5O���2��9_
!9����]wE�
y	��,�_�W�����C�j/������aO@X���=���}����i�G
��h3��fb6;�����TTN��<����^%����:a�`7A7'5�������V�P<�&��}��,G���z__��O����O����4����-��j�m�EK�O��y8��f"2,M���UE�&�"y��%BN����$����7D���<F�wn$USH~)�1����U�2�_��V�9[D��L�F���,�WEr6#\��n��Y�fRP�U��9<���`�9�X�8��h������Q-���Q\�b��
����v���:.Y���]�����U�N"X|[X�@�����Y����
"hY�����Z���#
cJd*���lh�GE�_��o���muE��k�.�a�,'�����Z��.�����w+0���������}�T���T
���@�X�=9fJ#U#�I��*����md[���
J��1cW�`F����txH�7I����9-�"�>���8��d��W�����'�������C��e7�RY�������/�5���)Mf�S���D��V-}�#p���U@��z.��-?IR+V����P����<��n�>���7En��6��]i��J\�9J|@���X��?�~�3l��B�p�������U�dE����O*��/�J��rRS�S�3;6�jCj�Z�r*�S�����2����R��W�Z]^]����M8��E�j"�Z�J�G�����32�J�j���5�^�������DDjv���p,��������Qr
���/X��C��l'G��!$YF�*6�������u��<9F��R ����!+��������{"'�wd����<B��9;T�����Tp���&��F�2��F#�9��DW.������bzY��S�U�47Cr��R�o�c����MZJ
^,g�k���'����u�����*��r�`Dw����~7%�Qn��j4<-��p�1ddbG�	b,K�r+����D���P��.������,��U�j�����k���1�f����i�"��l2��~�-��j�L��W,����'����F,4W"�Z��u�<��R&�0�Q��1n��z9�[""T'q-[��O�|fA{��t%Du�o�E�@>8'��_�8_�
N/���?�!�x��K��M���@s�
^a1^��TD�^��.�����J�,�f��c��\��������g�k����������|U\u��-*�����cb����Z�_*LI@��R	)R��1��c��1|����:~����tx�fV�0�r.�Z����D�#
cJd*���lh�GE�_��o���muE��k�*zR5`S���I�(�T��$r"h��*����o������3�~K"��81SG�ER
ajNc�2_F+�����{��~�}�`'U1����53,	��x���W}�����l|�Yu@��2e��MGb���z9�TDDK��\~U�E����j����o�?��k~��G�V�����~������g9��j�����P�����Db���U����E���F�����k�����E]J�����`���K�Q�?��?��������o�?H�K��������f������o�?�s]�U�S���������h�#�Er��[��[����i�s�+��dX/s"1Q�sV�����NS:��������/0���t96L�n�M��h�:��cR�.�u��[k��%LG�:6!�l���H@��Zr8fX�!�_�v���]��}Ee0��1
JF��b�/'��fX�q?��V�����.��qa�����V�m{��Z����	�jO�P���5	z}rn�'i7VKAk�hh�UU\�T��5��Z5���X�~����YFB�u�L"FW6n�tv���I��nT�O@�N��:�K��K�	�q���E����)4V�ub�#OlE�ja���v���k�7���R'�yk7��R�.�U8����j�jj����~��k�l��5�<������X��5-
ec�����Yt�TTU��E[�{�f.�R4P��jN,(2�"�+��!�DF���u���,��U�������Sj����3Z��+���s��uV�*�.�2:3p��R�,�2�P��)bn=�	������	;T�US��������������B�����N�U[.��8:NGD���V5�����Sq�6�Zmf�94�����dG��V������jtH��zt�H
k�6<�����U���x�x�a\M3M�n�9\;W��,�JV�,�!�U/�m���N�{�,��������&!��7
m��EWh�Z�GY���<���xs�a����D��s^��TT��}���9X2���^8lK5�D�"'y�����#�����{��9����#�?���1�uo�:?X� �T��kvoU7C!Qn���l)���,W+�>N��]j��US�N���������w��zekJ����~q�.Q��-���N$��F�
w|�v��s��Ts��~D�������N%�"*�\��9R�+%b��7�\����:���uk%���M��U%[5L����r�"��lF/��UC�9�0���������om�n"_���s.���`���S��B�����C���K=���R���Q{���S���S��g-k�gA����aC���eU����-�^�N. :FzvRBQ�S�0%���th��o������X�V`:5"�'?�g>V;b����S�����(2��N���pZ:$zT�]M�.��U��Q5"����YL���B��,��jx+MI�d�6Jq���R�K�,�t�Y�
S��*4���9�$��M����������Y�U]Q���������I�����:rj����dG��j*������C��c>�5�eH/u���}O�Q����(�����{h�����j����?�{�M���WU��z����+���U��������t~��1�{Z���S���df�^�*;xp������:�^�QQ�Y��S��S}L2����rD��1��\w�bJ�:�:��T�4����KI9�d��9o�V��^��/b����"����Yiu��[��-���t���S����E��B��Ho��X����X�E���X���uD0����u&^d{)�"$�����������x��r�2����r��Tjw��5�����������i�����L��~+5V��GIX�+V���8���w�4���od}�G���g��c��q=6V�W����510�7�R+��b������PcI$������!��Aluu��Db#�������4Y������tj�.���2�EKq����#�U�S����=)#/{n�1�	��]��c��p��t���KS�����������.�����}d^*}N��5�K��b�iixR�S���K�F5����UT��d�z��S���r�������?���"1~�ET"����&���j��U�@��n�b�UkU��}+j1�D�-�����'���h�������fa$�6��+Z�MZ����ce���yY��*s����Tr^a��tW�O��.^+�V��.ZW��F����C����"2U����U{�R�P����_���+����N���I����EW���I���s��r��L�"�_iB���xn+�0�v��rC�{X��U4�O�W��y���%&��_��?v]����r�A<��)���g%��L?s���Z���_E���m��z��?��+5,M��YY��aFF�Z�b��m�Z].��$�4�^ �U�"��U����k�Ze�����~���M!�b�!OH���S���i*\�����\u�S�n�+K��T�qRCH���*9���u�o��8����*���O@���A�8���x���c�%��n���k�*r��Q�=*�{Y�,���!#��{[���\KBJ�)�S7�TE�/�������ou���%a��L����shr�0g K�+�,������"���BS�����OB�Kp�:d)�����,ek\��>��Uu^�o�f,��xf�v�^yc��ba�YV2�;�Q��DEw����C�F��!La�YY��/0��e?|BW9�tWi'm}j�j�k#I�9�����(lJ�)�L�T$��^��j������WR-��@k�:�:�(�UJ������4�f�b�NTC�%S�����S��vU���k��8�I���[������t�V^����)xR��z�B"*=UUS�R"_�i,C���K��������&"F��Q�1|�;X�k4n����^v)��YDuv��8��k�C�tU�+5��!���R%jp��5Y5���bL1�_~+1V���\?3�5�U�i�&j�)s3i9�QX�h������1�d��O�"��6�)]�#-z,��A�(����r�UwE�������7U�I��^r~R^a�����k�o�r"��S��H~i5�ef��*2s���.|�v�kW���S��yLW�y)[U�-�<xJ�D�TF�E��M$ET����
5?g���Pd���Y�9V�8q�9��MH��{�w�
F������r�Z�6J<Ml�34�nw��TU=S�fLN�KK���c"���5�v��*��W��1���[�Wbzt�b�Y�$�y�$G2��k/�������(�����8�����
F)d��Ez��B��E��_���+t��I�S�r3O�[L6��W^��/k��~C�O��
�����������@�����T[�\����	�Y�x��\�H"���kc�9���J�Uz���_��e���8t�D�-f���T%�
�9X�]-MV7�r����p7*v)���j��
Y$%)����F�uW��TU���#/���?�I/�����3�T�
��Y�t[jX�E�%0�s��ETv}cL0��6*^g�F0�qG(�,�=��E�r�b$F�~�nn]�!V,�k�Zv��qw�U��3L^0�d��6Qng�#�
�DsX����r��UO���?S������k���i,�1!�4�t
)�i���+;n�>�Z<xR�<Xp���N��#Z�����C������=Q��MTgd��|�fiEb������kmkq�?a���u��_H��f$���{��p��MH��S��AR+�z��J=V�PX_z��.����c�7�hRm�t�j�.�h�
:�������F������u�^ja�>��8�
�KRjL�K�9�lHM����9��-�8����	�4LC��9]���VVr8�X�!�I�s�WR���]��m3�����df ��DK�,����5)�1-�*u����t��H�Y*CH�Q������*Y5�v]��������s��
����`��[��^�.��U��D%�V#���m���aZ�9$���!�r��Z����u^�����K�<�����Sj����3Z��+���s��uV�*�.�$)���T�|�6�M��e����2#�o+QUP�q�x�|��tPV��|	�J�e�5�I����9����{�����0�&��������E�R�+Y�X��������EK�l����3u�L�Y�S�I%��X��0��&*��U�QW�T��"�%'��N���������M;�����Hbo�}��N�	z$	�����X�Z(�o�Q�����.	��z>p��t�2^5&��X�������<s�����kd��8���LZJ��&��j�f�v�(����F�m��q�r
�S�z"#�,�L���EF�����^V3.���$���'=%�"���Z�������Z�hy)ky@Z�:R�R�������M=��[K�{X
�~zR�*����	Yf}h��#��WRM^�Y���������$�v�j/��UC
�*�&9�2��5#��oh��L����{������^��?^���^q`��-+����JOJK5!���1�h&�V�����LLQ@X�!�r��G��a7}��G��Z���"�TMw?u\IC�L6Z�L����a��C������d�	�(��U��B�-7Rera G�
������{es�[Z��/�Ssu�`J�9�55����!,8�u����f�Y�h�h���/��U����tY�Z�Q_
Y�UM'{�n]NW�������Bd�$lp]OqjFE����Uu�Pf$��p�f�[����}�.�����jjj*��"jN�&`��?���.$��?���+n[������q�����z���`�������gznE�;�����n��k�k�������{+��O�hod���o����ys=:_�[�L�i�.��3,��X�j�Q{e��W|��m�-f�����%-/.���N��-������������W�3�y�@�p�K,�BN<�X�k�F��H�����)'�*s�����I�$��ME������L������TD��q!���g6r�<w��E��^�*�\����Q!E�.[F��@������"1tWq{����{'��CC��r$������lX������["9t�*���|��WiQf!N�$%����c2,�����9���x�Lc>p��ij��+L�������f�����������u���g�s��/�SerR���fK�M(n����sx�dWj_(�.�!W��4��Y�{�������]��yf�%R��|�f�}����&����������os���1��%�'.�7�L���Clv"���$��O*��@�O�Ym�(���Q��D�F���ba���\��.�+����931V^$y������O����<���
&%�����J������-�1D��M���Y���3WSk��2�c�q!��m�����
��(
s��EK*%����y���T������6$j�`E�*��*#UQ���t����m`m�9f,�x1_,������TTS������T����?���������?7,���3!A������U:�2��cn����Z�����e���m~�z���p�b<OU�����!�n[�����m���W���5���\'���B����Z������n��>��Z�z��B����vf�X���'U%g�BHn�\����[�IS�/��"��_R�%�S��/dI8�&�P 9�EW:�n��n�����#����M3�0�'���9�l�;�*��S	f�a�l�D��7�@L?�2��#�����uC
��5��|DV+���j9l����D�S��u���GV��$[��8��l�IF��TU�������h��r��S��x��+�)x����J����jE�����D��d�I\��b���I�r[�r������}u���,k-/�$H1�����WZ+�QRr������pF�wk�c���e-)��BOE������U!�Ux��Q�hQ%��"f+"�����x���EBP�k/��c�^�m�|F��Yz_=~P����O���V%����+Z�K���]-4r�#]t�U���4pu�Y��IiH5����5]�V��)9=�	wKhCj]��]�O!O���*c(S���3������r���a���2]��jTKj����|�X8?S ��%�TEt�^Z1�H�Us]
\��TTEK����Ha���������Y�ET���Rj�fKB�G����]�����-�3�,���%V0�Vr$�	�����
��s�����mT�������H��I��&J+c��t�8�7���j�����j�����^�5��TT�?�.�
�CZ:eo���k�������48Q"=�c�9l���k �-���������������l�N�V�A���*��z�9�1��J�0���6�C���Y�s�����D4�N8�\�PC��c*\�ra�����Nf�F���s��tW*���b�nx�
�8F� ��p��I��--.��������T��?5q�L�����M���)�0R<8����U��h"���#��
��%x=#E�KNU�R�tXK��^�Erq�x����f�a�Fu��
-2nm��x���=-��*\�DF�#QR"W�|'����'~�/�D(����}G#�m$����a���b8X��&�� l�f&�dx|h�!���j���K_l!����jlCU��ks�7�i��M�����d4����T��xyI=/%�#�*�0�]/�J�X�5���4�}$f�mN4U������x{�OS({�}�({����zz�o�u��I����U��qR��5��MWG�U?�?���o����������3kDW��Z�jt�k�~nS���Co�x`�8���&��i%������E����VG7K[��v��S�t��
�1���\79�w�O������Y-�}$������)5_a?�J��������=��i��Kh�I4~�q[��������&���B�;
�������*";����`s-����+Q�T��T�5Vv�2��$G��F������}u���'�ys
���NA�[�r�������y8������!�tZ}?t�w��8�����Q�������a`��Q?C���qh�z;��^e������k�X�D]MO��]@U���u�lCB����$8	-r��0��D���������[��r�����N�[�������~<&�HP�����l���K������H���10_n�R����kn������iqik����m���|#��#K~n���hiw�x���e���}C5�&JfFf^�1���Y��jY�[���[W�P�a<��?��5~gV�	`K���!��x�!�R*�*��Zt�j�	��r����2��5bCk�������z���e}J�=>��g��6~+�G�l�&���,%r�����$=�QO\������e�����9��H�Q��9���l�~�"q�@���:_� ����#1-0��Y�����u��k��W�����k������$�E�d(�"���M7h��Wj�S]U�0���G�u��0��Y �CAo�����K~%�+��
$lQd&9���K��#U��77�����`B^w4-�aW���
]Z�?8|S�e���it:Z���W�(��������,��;�=!=Y
"9��������T6�rY�nj�W�k����svZt�5�MK���B0���=F[~S��t�7hN��k��"�����Y�u_a?��+�w���%�@�;�����o�u����x�v)����<=�0��+�,�VY��XQ4��H���vT��'��3zN[OE���3�Sx�-5���K2Z(���X����)`��O�1'�����s�7-�Cq�j���]+���Z|+��bz���71>��hn:-V��K�{��	��2
j�Q��9��=/Z"��#^�j�}�S��i�c���l#�+qk-�l�V�6^$�X��F�*�R�k�T�<+��j�������\�K�lD��U�wm�R,DUtD�%�j�
��I\���)9��R����m
_KN���~"���'b�E�wnId����������K�~>$��Z<���{����|�������7H�7�]/�W�yhaP�a
<��3���++�)	��Ek\����}H�jMF�=)-?)Vz^��V�����=<����~�B�/�
6��5��������R��^J��I��O���),�d�%���knnb�s-����|p�^�����x�Mb*�(K^4X-�
]���!�R*��*��*�}��R'�Vs~���i�=�s��F�[��}�!Q�eEV�
-.����e�������a����G^��k���j��OC���xbZ���}]*p�m'Ab9�����������j��(T�KE��V���>cGu���ih����������I�6�Rx����N��i��+����k��r}���)�5�zF&��*�������F@k^��:.��E�[-�R��k@
<�|�"��h�D��MZz�%�w���w�]+�_�T�f�����Ibx�������nF2$��9�5�u1Q]�����O�Xl�	������V��K���P^�P[��RY��1�Nbz$d���k`���X����������u[q�orxO{f5Co�-� �-��[CE�ZZzZ����JR�����4zt�g���iVBs��j%�@*p�l?���$�4��g���>I�uh��������[�*P2��K��J�x��K���I
�����*��eTsZ���5�
���i�I�����]HzF���w�@�����IUUSU��%�a�mRk��x���������I�=�n���Dk�l��j�)���B�Iz����ay�yZ�h+Z<��	��IU�tK�j]v0����=�2�w��S���+;�"��3������\W���tP9�9���Y����I�����9XoX�"�]����Y�s�9��-�ZY5'����Fa�7U���Z���u�3�9,�5�����|
����.��4��1�q��c�N�e1���	��nmj�J������v�����d��	*Q]��bE�L�9Wu^�TK��h]���{�X,z�\	O�<;-��lz�
�.b�7Y��J���zEk��Urq""y/eI��:hg%W����a�#3�6fb:#��V�*�1u%�[�[��#��*��-��38z���/3b,�����k�^��*����W��r�_���+�Tjq1�R��T�7�*�]��
��(�MV��SU��F�O3�2��Y�;Nv,'B��[M���j��GK�/ Eb�A!����n�aH��X����j'}UU��a�z�@���<S���J���v�a��A��_�����h��+���
��K��H�J�#+?&�Et	�-�
��t�\��e�@v8�����������
�.�xz��0��b�OZ]JU��+���r�w$r5l�5Qt��[�YJ�f�L�9����`��
jb�P�r>!�b5��D��U��_�/�lq�����R�Sd�S��l��5`��K5��MDN$������Ye��S�����K
Ra5W�f� �Jc
��\%�j����za�
i��	�W���~���dm��V�)���c���6l�uH��qd9}�YT~���+���_��\7C�L���Q�������J���}��UBQ�f�54mk[U����P1����>�-U�b9�x0`K;M����������zV^?"����?/���L����%:U������K�!�5z�E��~T���R��
�7j�E�H����ZV'��W5T�U�4�����S������MAlV_�g"�t8PsC41�F��4xb&�:���LEW=t��j9QU>��T��.�c
a����Z����P�VJ=.3�fV4+��k�od�����e6�t��6Q���YyIXRi
����d<S8j�5UmNf�L�Rb��n$�7FEN+=R��G�|�������N�8��1��f��DT�*5���[�F6,7�z]�Ej���#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�#������#�i�o���H�FmO����R�#=e��
����K��U����v�,Q�U�p��V�5��<�8�[f�Z���������������Q���B����qa����n��P��Gcz�z�~n^������N��Z�����������"
�a���7���O�]�D��gB�e�G9��E���"��Z��i�V~�a�t�%��BXu���Z�����������Lv$�����'����?��u_:���*�ll?�?���d��5�0���t�Z��������������X�(�3��a*v���-�
N;������}���G��5)�$�pGp-?��w|G����w��K����7���O�]�D���7���O�]�D���7���O�]�D���7���O�]�D���7���O�]�D���7���=r��e!�9vh1WIR������|�����s�i������N��&n�m����(����u`��������?:�F,��������0U�&�/{6�M���an[������Tp�����k:3����(���&nY@� �����Oe�0���O��=�T�wG�3p�8#o]�7�ET�d��F,�������"1g'��������6Q{���o�l�cr��<?g����<?g��Y������EW�q3r��qX���e�R$U^�R�D����Z����^������*SV�K��3��:��tl���J�**����a��W>��v����k|a��D��8���.j5�Mh�WE����-��Z��$���7E4��h����~�>^T�]S�9�4�yS�uL't7s�6����U@p@:�fK�Db�O�z�	r#r~k���p^�*�e��m&�f��0�-���~�n*8��~�n5���]�U{�7,�	��CX���9��V.�#'	����Tj'y;��H��S)����Jd:�K.���I�/��
�[	�QIZ/��6`Qq�cIP$�iI��^�W{jd�����(���]Mb5Qn�_"*��f$j�%��X��a�E�d�h�����et8��:����e�h��y�R�X�J��P��T��/=;
i�������EX�O"k�gaf~~Y�,X������SvlDv����������X�d�9�R������UGR���N�e���;�6+Z���~��ET��Y���%0��L�,I3f�����0Qm�#�5u'����)}��Lj�>�L��W�*��O�z9X�%���59����**���D����n����L*��%]�����g��SI�K.����[�
�']�IU)��4��&��5]�K�����������R�����z�5������%G���"��]K��z_z�s�!������id��?��r��7����d�M)D��E����3���������'��������j�s0�%a�AU���K��b#	�~W���K�}��������4�����r�����������Ot��Ot��n����A�wbe��U�-
	�~W���K�O�����R\���	r��i�O�9�����|�����s�i������N��&n�m����(����u`2����J�R�%�>�EtTb�KGB��gl��\�+�aJx����.����+��G^�f������DF,��������ot7����zkG5l�o�Y���U�MNDtH�f�����F6�*������
�:����e�rF���^�J������2D���<?g����<?g��Y������EW�q3r��q<�:|�VF,�NR^rN*"D�1	�!�n�k�Qu�/�d���bbd��\
�NjK�fZ��z\��v���5K�15�6C@�:�G�������jbC�12�5��U�Ek\���W��a��t(T�����H�Y�E��������_�,���H�`���|+H��t�3���=&�Xj�~�%���<�������SdKq�4����kr�8�����"�Ti��sH��{���N>�Q"�T^�B��^4E��F�8�&6�o�
�:n�Ye$e)���iV9{w"��W/����w����	8������|���&��[��D2)����RQ��-sSrV�wr��>�M5t���
��0r��C�B�L>o{� �tI���Z������I��U�fPHv%��bO�t�X�8����,����k�Q�����X���B�����,�K�G��it�l�y��N/���&����k!���Q^��i8�������P�a�|�&bA������g�H��Dj_Z�u}�J�X����q�����K+��%�L{���"���gl�E��yZk�R3)�e����t����|M��G�W���D������%k7��p�D�.�n��*���)�*�v����j���	���Z�^�%�h���Ux�����e��1h�)�#�1=�R&�833P":~V�UX�v�����&���������=���#��_���O�&�\�oV7M���������0���4�1�$`�&+������UUU~�%���um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����k��������m=�G_���Q����H-&�����v&^yEQ2������Qe�Gu��V~���l��K}�9��i��|����:���u���K�N�}�������o��6��$A��Go��6������;Iv��soN�������Go��6������;Iv��soN�������Go��6������;Iv��soN�������Go��6������;Iv��soN�������Go��6��+X�,�]��e�i��":�W�]��?�
��Ai7fU������2��*�����
�N2�.�2;�4���dm�e�[����;O��������6{�G�\���v���h��P����}�9��i"��;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv�^e�#��E��nQs�4���m��
s�i������N��&n�m����(����h��b/o�m�����#������w?�C���"1Kn�nY7������D6�#c�T�=����n��������bf��{�Jz�q�S
_�l�:A���.QD���`�������������}o)e�um��A`�����tg�)w�QU�L���;}�9��h�um��H�u���um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P����}�9��i"��um��7�C�zv� ��P�����R$h��f n���h����}�e���+��R"�Y��f�M��s��������T�yJ�d�W������]�j�.��D��^����=m&�������4��c,#��#������^~���7+���7Acn�����sQ����Ns:3���#�j�Ou�����|PF^�Z4-T#JXK���T5uC�T@n�4\'��_O��.Da>O���Isg��Tx%�
��m>����Tq��m=��Tq��m=�I�2�w�m����QT@L�4\'��_O��.Da>O���Isg��Tx%�
��m>����s�i������t	��������a;����c������������v��dF���!������m7,����������=)����n�G�������0��&���*\r�@�\+v����[������[�gF{2�{�^����(u��<���U�uL*%�6���v]S
�\���|Z�����.P�����Ns:3���8oN����A}�l��������p�'�}?�����?+��u%����Q��(*6�����;�!����A��uY�p"=4��.{���QV�Wd�)������5i�Q���N�i�/JJsFB+qQ����OU3a���9Y:�7G��,k����\��<���6��	���=�V2�_��A�c�R��Yj�c����dU�,m	�~W���K�O�����R\���	r��i�O�9�����|�����s�i������N��&n�m����(����i����������v��dF)m�M�&�pv}>~(���{�Jz�q���=)����-L5~	�x����E3
��<?g����<?g��Y������EW�q3r��qv��p�>i��U&63E�c����QR���G9\�s�U��U^�/u]�[c7��aj����Nu�k<%�������9�<_�IZ&#���R�Y�K��V��g"-�i-�T�7S����H���77/��V���*3T�h��#^�g'%���%�hCTb�f�t���aQ-�������TL���m�m�v��@:3���9����Ot��;2�w�E�������3�3E�|����:��F�������6{�G�\���v���`�x9�6���4UUF�$O���TvmwA�����_k6��[�	K�\���{��k����QQS��t^?�
��s�����6���zve~��{�g�eDgNv���?+��u%��'��_O��.l�^
��AQ�������~�>^T�]S�9�4�yS�uL't7s�6����U@p@4��������s��;�2#�����z�;>�?Cu�=��=?�8�������P�����<t�eK�\����n����qQ����q���fR�x���8��eN�����9L�h�������? ��g/)�^��P��g/)�^����
S�9#l�i7�7�6���v]S
�n���5]�T��c5��o���	K�\���i���tf?�
��p���_���"�����AQ����>O���Ir#	�~W���K�=����.PTm;i��0w�3������XRT���c����=+w�4WQ��N\�*6�����W������-(�^�uk���/�Ok��Xpn?��u�|��&U��&���[�/d�m|^S-yY��f��y+k�l�P�y%�B1���������%�h��c����]�j�.��D�f�t���aQ2��Y���]�J\2����Ot�3�1��m=��������&���
����	�~W���K�O�����R\���	r��i�O�9�����|�����s�i������N��&n�m����(����i����������v��dF)m�M�&�pv}>~(���{�Jz�q���=)����-L5~	�x����E3
��<?g����<?g��Y������EW�q3r��q>s1R�X���!�_����d��;�a�������%3����'�Z������L55^��k��{��{2l}��V��?N�����N��(��{HOM��;=%4�x�������DD��m�O�NB��*��c2�f$�,�$I�I�e��Ma>\��;���/����l3)���uq��K��_�+V��4���q*]���-�Vq�db��0���0�������Knh��m��Z�T���YC�p*�����A���Yz<�.*C��p�E"EK}"���t�]v�'�3K����Dt�Iz����A45�d��k}M|V�N+j&����#.�hU	���~zj�#=��ts��~&�x��/���C�,�8���--s���j���Y�o�NT��cf$�j\�/f4�We�� ��'R�� ���#\�-���*��/����3��k�5g0}.�P�P��P�g�)�Hs�E��6%�E������U���ZJ���I�M@���3���pb#�:�f>��Y�|Z��0�����3���l���SKtwd����m����n;"���x����v(����5[��RV�!;4��.��How�����_�!������;V����o�����F��e�a�t+v�DMk���m�G-��������%�b��D��MI��{�������+U>Q�*<�DRj�3�!B�S��^�GC�	4U��h��Y8��n0-9O���r��_���bj
�f�R$V9X�'y�UO�P�����M�����6�C��ftU-������W�4���Cz1�TTGZ�_-��;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H��t���s�@��6�{�^hk��s�^Vr�����U>Z2#��9�b�Qt��t��n���5N���m������U�uL*%�4������wX��]�:7�&w�������/�.�%.r�tf?�
��s�����-��fw{���6�W�p���_���"�����AY����>O���Ir����E�t���Y��:����-�P�.���u���K�N�}��"��P�.��zT9��i��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r���gM�J�9t
9��h�������? ��g/)�^��P���"?S�.�wK?GJ�F��;m�T��H�6�M������5]�T��Z�I���m�u�������gx���}������R��(Fc��{�9�	��LB�{�gw�����m�|�
�����("/�M�����9�.�������+�n^q�YwA����������/|���r�l�^
��AQ����� ���r�7�C�����#��C����P�.��H�;zT9��h��r��#��C����P�.��H�;zT9��h��r��#��C����P�.��H���|��������s�@���Df6��4]�"nw~������wG�3p�8#o]�7�EXO����l��W<�LB�M�3��w��4m���Q�[v�r��\�O��!�������aF��aLv*���~����?����|z�Z�j�d��
�.9r��3
��<?g����!LE���gp����������zT9��i���fR�x���8��e�#��C����P�.����"��P�.��zT9��h ���r�7�C���"��P�.��zT9��h ���r�7�C���"��P�.��zT9��h ���r�z�!��
Rb>������y,�}���9J�bUrnWt�R�+d���M�"7E��.��5kE�z���:/���%�@�K5�MlDTV/��EG5~�C���&�P�3�%�7kn�j������*�S���0��r����U+�~ZjV�d{����4���Z�k���i���%�2�LH;�znI�nv�������I�S�0�c�hxzVZ��V�W�"���=������'��F���e�E&� ��79\���z/j���tE�d,`l'���F���r��)�������y\���x���]�����Y��2Qj~��*��^�����r�������1�cQ�DkZ�DD�"��ua�e�G=�/45��9��+9yL��O*��+9yL��O'm�j�1�f�I���������TKvmwA�������|_@]�J\2����Ot�3�1��m=��������&���
����	�~W���K�O�����R\���	r��i�O�9����ua�e�G=�/45��9��+9yL��O*��+9yL��O'm�j�1�f�I���������TKvmwA�������|_@]�J\2����Ot�3�1��m=��������&���
����	�~W���K�O�����R\���	r��i�O�9�����|�����s�i������N��&n�m����(����i����������v��dF)m�M�&�pv}>~(���{�Jz�q���=)����-L5~	�x����E3
��<?g����<?g��Y������EW�q3r��q)�Xr�f��y��
{~~Ak��^S6�S��k��^S6�S��n�rF���o�o&mwA������]�j�.��D�k�f��v��@:3���9����Ot��;2�w�E�������3�3E�|����:��F�������6{�G�\���v���`�x)�Xr�f��y��
{~~Ak��^S6�S��k��^S6�S��n�rF���o�o&mwA������]�j�.��D�k�f��v��@:3���9����Ot��;2�w�E�������3�3E�|����:��F�������6{�G�\���v���`�x?f�/*{.��@���|������?���9�z�a��*�� ~��o�fDk�������[v�r��\�O��!�������aF��{�Jz�q�S
_�l�:A���.QD���`�����������tg�)w�QU�L���'\H�� ����T��K�,�u��DU#���8�?�3<�U\iPEUTD���}J�O��
��y��,5K�������G,��IRy��4!V�o���O��*��%�jtg���1���\���9��s��U��*w��rV���[�?��!uu|���"o����=cM����w�h0�3�S,�9�9y��o���-yY��f��yT-yY��f��y;m�T��H�6�M������5]�T��[�k�
We�0���}������R��(Fc��{�9���i��7�fW������6~�PT@t�h�O�����R\��|����:���u���K�N�}����3�S,�9�9y��o���-yY��f��yT-yY��f��y;m�T��H�6�M������5]�T��[�k�
We�0���}������R��(Fc��{�9���i��7�fW������6~�PT@t�h�O�����R\��|����:���u���K�N�}��������Oe�0���O��=�T�wG�3p�8#o]�7�ETO����l��w?�C���"1Kn�nY7������D7\#�zS��C�(�p�q�OO�9ja��M��H6T���(���V���?x���?x����e.��*�����P���fW-j;>��d�fW-j;>��d�/U^)��R����e��'32X�[��|��*�]� [U�3�S,�9�9y��o���-yY��f��yT-yY��f��y;m�T��H�6�M������5]�T��[�k�
We�0���}������R��(Fc��{�9���i��7�fW������6~�PT@t�h�O�����R\��|����:���u���K�N�}���7�g�*3q&f��,W�\������!����3G�4u��K,%�FXj�����1L��"�{�5�|;����E���m����m:��y���5�0�z��PQt�XKk�a�a�@`W@���9L�h�������? ��g/)�^��P��g/)�^����
S�9#l�i7�7�6���v]S
�n���5]�T��c5��o���	K�\���i���tf?�
��p���_���"�����AQ����>O���Ir#	�~W���K�=����.PTm;i��0w���O��=�T��N~�>^T�]S	�������v0��P\
??�C���"5��������-�i�d�������p�q�OO�0�u�=��=?�8�����6O �R��(�b�[������Tp����	��&js�f�c>,h�W*�ok����a�h�oEvQ�~��*M�m���V���(�|������i8q_��[��~�l�r-��O�RfKWJhCR��
�!$d�����>���Z���iY,���Z���iY3��W�l����%.r�t���L�-���5sJ��{��@"g�&CS(sS��V�$GJ�YU��W9u%���������IL�#���D��Tc���u�-��9�0ba\�������f�Q�Q|�����V��5�bjcIl�������KF���T�F��9�E_����Q5�u.�E]G���5�`F����'�
�EkQ�{/���"9u���,Ss���4��J�D�T�"]����=+q*�5<��1��*���
|6A�[������Gvl8m{��Tr�����l�Y����Y)��A-+?�J
1����z
]Z�kj-���R���e�,-�}�T��;_OK���s&����|����e���b	������z�M���u���]\Z�����,�8.X���`�Q����_�&�U:�����	�bY���Z�{����Dw��U%1�6��.R�'����������=���cQ\��^���g���&J���W������������X
���6�IgqqjO ����?��[=?
�]D��K�$D��������-�R�z�
+	��&.�p��@�Jih9��QX�#��V����z����^u�A�1���g��%GDDV5S�uKw��L�
�<n�#,���$�j��s���5^�_�\�Z2�R&��K���ZCw�s����"���W�\�Ie�K%���X���{�k����EC�f?�MJb���r,�����[IV�l8�c�j5?L������[1��U�"�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w�p�?�=G|�w
S���w��Wq,�f�:#[	���������x�4n�>���Z�z�����y�t$\��1�|��g��Jbz�8�t�V��~���M_��i��i��49C��#uV�5���������������z���bD�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��T�8��D�3��j����2m�������������3k�<�|��_S��v�6��[_�������p�$m�m&�f�f�t���aQ-Y��;����(n�����&!U1��Y�������.P�����Ns:���o=�OCOK�T��|�q�zve~����g��dgNf���?+��u%���I��e�F���WM_�*�	.��������x*<�F���|sH�;���q�;�8j����{���%k�6Z������nf���������
�����f�6n������x��,�Q�
&�����qH�;���q�;�8j�����\H�;���q�;�8j����H�;���q�;�8j����H��t�
S���w��B���������������mz��B������jtX������k��r��5N���m������U�uL*%�4����u8�]�
�����d�*�3_k6�������e�����6���gBc9�y����i�i�v������8oN����A}�l��������p�'�}?���\�uI9z,�(��b7J�����_!%�T�8��6{�G�\���v���iGp�?�=G|
S���w��x$A�T�8��5O��Q�$A�T�8��5O��Q�$A�T�8��5O��Q�$A�T�8��5O��Q�$A�T�8��5O��Q�$N~�>^T�]S
����q�;�ay��6����(n�����6�;����c������������v��dF��s���	��i�n�]�����~�*1Kn�nY7������D7\#�zS��C�(�0���b�I]���~�E��x���Z�j�d��
�.9r��3
��<?g��o2v^O~o��z=���������^����E�i^&�d�k��^���S��g�8����6,gh�n�u���=	J_~��.�M��)�\���������Q��b���]�
�����f�+���U��8������e��'3�T�8��,Z-�{�j��{���D�5O��Q��T�8��-��D�5O��Q��T�8��D�5O��Q��T�8��D�5O��Q��)5n��~�tUl����f8��?3�Ym?/%5FRE&&Y	��]��r%�u���4�S����a�{��Q�-���N���[wt.�]
?��r�����T�a�U���jN+�P*4��JNC���4^��u�j7��8��c`�R����[�W+�	9b�7K��!�E�HqW�������uZ��$��C&�����i���p�M�i"�ZW��Kq�Ix����&/��L��H�`��Zbf��(i(P��s`�����Yl�\\j�X�gf��/��*�fJ��
d4�14-LzC�i5[���MW�@���Lq��F,����,�B��)	)�M���UX�"7���]I�����1&6��!0K�=j~�%	�R5	Xh�����:*�idT��Mz=��R�|*���JW�$yX*���%��+�e�����D-��J��Rv,��g����W-�~�y�<���1���K,%�%���'*SR��VJg�Y��lHz]���r-��o���r�k���gK�����h���^��B3Hc�����Z���Z��?�(o�\���ua�e�G=�/45��9��+9yL��O*��+9yL��O'm�j�1�f�I���������TKvmwA�������|_@]�J\2����Ot�3�1��m=��������&���
����bv=?+f&e":v�E�j�SJ6����U0���r��W9u���T��Gr9�G�bf�f���B��&[�Q��#
��6<����H��&"�$)wCt=%�����>��p��#���;,���
���,!��,�g�Vl���;i����3-ykN�un72��8Y���M'������ua�e�G=�/45��9��+9yL��O*��+9yL��O'm�j�1�f�I���������TKvmwA�������|_@]�J\2����Ot�3�1��m=��������&���
����	�~W���K�O�����R\���	r��i�O�9��������Fn$��W���+�\����D<�4�Fh������e����
P\r�~b)��dW�y����~�SU����
��2�����V�s/:54f�F�T��
.�K	mp�!�,3�,
�s�i������t	��������a;����c������������v��dF���!������m7,����������=)����n�G�������0��&���*\r�@�\+v����s����~����r�e��,�J�0�{%^U������U^U������JR����Yv�o�i���Gg����l���Gg����{�W�l����R��(L��t�b�o{��W4��w��mT������cbD���k����j_��������QF���^����3O�u=�Z���M�e�V������^�k�AyV�%HK,��5���(�?g��I�e�(��r3l�#3*��#[���^�N+���a��^;�~��{Y(Ie�y$�#
]P�����^���,�F0������@-�K�����h���^��C8�^�4��T�G�����?"�C�����Z��p�Q��5��)@V)���@����|HMM����j_�����%�$g������t��<)����iPa�����UI�j/y C�V����G�=?�H��_���C��f-�/���n`�e�#����2Af%����m$u�t��R�M��ZiB�>���6i��cF�\��S:��2���������������mz��B�����mz����5N���m���L���U�uL*%�6���v]S
������/�.�%.r�tf?�
��s�����6���zve~����g��DgNd�(�G3���L�1Gr9�G�bf�f����&Y�X��^����<�~���Dx�a��]�i
����Qg�9}�e�-i�N���a�k�Zv��q��������i>.^gX���3�S,�9�9y��o���-yY��f��yT-yY��f��y;m�T��H�6�M������5]�T��[�k�
We�0���}������R��(Fc��{�9���i��7�fW������6~�PT@t�hXf,8fX�l8Lk���l�Ds���{�����%��"��F������V��������$k�*�S���?r\����|�kL�Iu�>��td</�����~�S��J$�:b�[i1x��������k9�++LU]X*�������6����9,w�J�aNxC��g`1o6Z������nf����m:��x��,�Q�
&�����q�\~�>^T�]S�9�4�yS�uL't7s�6����U@p@4��������s��;�2#�����z�;>�?Cu�=��=?�8�������P�����<t�eK�\����n����:x~��9���^����E�i^&�d�k��^���S��k��^���S�J_~��.�M��!�\���������\��������xb��M�_B��J\2������,Z-�{�j��{���-����|������ti����Xs�p��4�Q��m
�q
["�w���)5l�������?,������,lc�������J��O4|Y�	�C�!�e/tV��y���PMz��!��x����y��8?�(�D~�g�e�����><h�V���������K��)=��������=���A�W�<���*�����$��CW����j�%Z�����?
,x�����3���;a9�IeO�m�~�[�����p��?�9���{=XM��d���eZ��rT��\#��g<m�7%�q���0��_�|�
��S���8���9x`	t@r���)�m�������������3k�<�����3k�<����p�$m�m&�f�f�t���aQ-�������TLf��m�}wa)p���1��m=��������N��+�|PD_x�?(* :s&�Gr9�G�bf��;���?�C5{6��2�"�����5|��������#���:��H8n�L���=����-ykN�un73�^Z���[���h�n(�I�r��8���)�Xr�f��y��
{~~Ak��^S6�S��k��^S6�S��n�rF���o�o&mwA������]�j�.��D�k�f��v��@:3���9����Ot��;2�w�E�������3�2kw#��P�&m�������13W�ahpK�,�,l��lCW��l?p�
_"<y������������(�����Y������V�s0����;i�������f��PQ4�/3�,J�s�i������t	��������a;����c������������v��dF���!������m7,����������=)����n�G�������0��&���*\r�@�\+v����s����~����r�e��,�J�0�{%^U������U^U������JR����Yv�o�i���Gg����l���Gg����{�W�l����R��(L��t�b�o{��W4��w��mTL��Oe�����sO�u=�Z���������P@^�hn��j�����I�dW���t�89�g�����Qc`}�e/tV��y���PMz��){����S�r�k���Fif��Uk��a��QD
#��������e���#�����i=y�f��\�����&Fk���[��������6I=�{���8��r_�|��%�q���0��,>~(���S���D)�Xr�f��y��
{~~Ak��^S6�S��k��^S6�S��n�rF���o�o&mwA������]�j�.��D�k�f��v��@:3���9����Ot��;2�w�E�������3�2kw#��P�&m�������13W�ahpK�,�,l��lCW��l?p�
_"<y������������(�����Y������V�s0����;i�������f��PQ4�/3�,J�r���)�m�������������3k�<�����3k�<����p�$m�m&�f�f�t���aQ-�������TLf��m�}wa)p���1��m=��������N��+�|PD_x�?(* :s&�Gr9�G�bf��;���?�C5{6��2�"�����5|��������#���:��H8n�L���=����-ykN�un73�^Z���[���h�n(�I�r��8���?f�/*{.��@���|������?���9�z�a��*�� ~��o�fDk�������[v�r��\�O��!�������aF��{�Jz�q�S
_�l�:A���.QD���`���?��<?g��iw/fQ�x������P��_/i{^��P��_/i{^���/�
���o&�f���Z�v}[J�f��Z�v}[J�G�1Ux��/�n�%.r�t���L�-���5sJ��{��@}���B��4�!��[�tiRH���
\�.������z���a��4�����v����J���N����y1�:�U
�l����P�s/��H������dD��Wj_/$�<���%���HFd�(�&�f��$�;Cq����{_G]��
@�|{�p��*{0�t88z�0�g�������X�Z�����_o,�5�����p~�S�+i+�f�j1�iHJ�j9�+�{��T��`�A��{P��Wh�FV��(��3������/�CU����^+�~��i|��J�
ca�3F��!�H�e���l��*��m�F�El���M-W
�CI��1M�SU��z
Eb;�7���QQQ~�%��f� �{.��?�|������\4
7r���Cw�CV���?p�M["�w��������8<n�D�����e�){����S�r�k���gK�����h���^��B3H{6^8xb�^}��� Q��=�G���(5|��������H���7,���p��?�23\�����'=]�^	�I��h�����v���8����fy/����i��q`)��E�_����0� 9L����6�{�^hk��s�^Vr�����U^Vr�����N�p�8c�6���|3y3k�
We�0������U�uL*&3_k6�������e�����6���gFc��{�
�����("/�M����9�X������13l���}�!����C�\�f�cf�{b�Dx�a��Pj�����v}�7f&^yE����R�����m:���e�-i�N���^4g7r�����xa�@bW���9L�h�������? ��g/)�^��P��g/)�^����
S�9#l�i7�7�6���v]S
�n���5]�T��c5��o���	K�\���i���tf?�
��p���_���"�����AQ��5�;���?�C6�Q��g��B�����8%��i6o��!��G�6�e��<�~��g�A�vbe��Y��_E,�k�Zv��q��fZ������ne�Fp�qG((�O����%p9�4�yS�uL:�����Oe�0���L�1���c
�QU�����;�2#]����l��R����M����|�Q
�����P�
7\#�zS��C�Z�j�d��
�.9r� f.�x~��9����?x��K�{2�{�u�x�~����*�{K��O*��*�{K��O%)}�oW,�y7�4�er�����VK6er�����VJ=����6q}w�)p���Nf:d�h������U���@���205#c�<��#����G��yx�bbK5��G�������������K�^q`�v�5
��f�*$w�c	����%UE����H�]W����x�~�W�U��X�����f������u=�e���S��r�5/�&�8�k��Yx�69hk���6��]w_�|�7�����'��aNM2a;�uu���P����v�6�������wr�������R����f�5���C
�U��<��)VGb�w�|%r��U����vXa�e��>���C����vtM--�J�_K_���� ?�ega
�7k�$�!'}_�!����)��������_�DIk���X�Q��i�~R��jT��<kQ�t�Lf����,�-�����U>��UE��yy��d��Ma�G-ew�-���Uo���D]i���,�e2,�rb�S��[��P�h��b?E>�m4L��*�/������K��/y4[���E>�}�!ah�Z�F�1]�U����4�j��K5�bjk;��DD��%���&��LuU�����bR/
�c�|8q�t������������Y���TTEt���������b+�
����U{]|�<t:T��%K�AHRp�4�����_}~���;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�����;}�9��h�um��H�#��C�zv��P�������s��5������w���}#{�ti����Xs�p��4�Q��m
�q
;%��B���-��q��Dm����M["�w��������8<n�D�����e�+_����0�X����Z7�'��>#�S�r��7f���T���R�A�mz��>,������������h�����G�����x|�n��~����`j���=���A�Z6�c��&nY@5IX��[{}����?������rW�k���vx&�'���O|<Pg���&QE�����[w������})�o��6��(�/����i��q`)��E�_����0Go��6������;IK������;F��soN�D�����o��6��$@�����;Nd:��2���������������mNt[�D�,�-���`��g/)�^����
S�9#l�i7�7�4�������(��]�ZV�&w��n���5]�T��c5��o���	K�\�	���E�{�[p����#��|�=���i��7�fW������6~�PT@t�H�����a�o��~�tO����>=F8m�������13W�ahpK�,�,l��lCO�H��g{Kn��t��o�<�`j�����v}�7f&^yE����R�����_ �0�X��Y�Z7�7w��}�9��i���-i�N���^4g7r�����xa�Q�����;F��soN�D%q�����o��6��$@�����;F��soN�D������C�S,�9�9y��o���,�h�����A���M��������V^Vr�����N�p�8c�6���|3x3I���n�r������o�g|������U�uL*&3_k6�������e����,�]��e�i��":�W�s�����6���zve~����g��DgNd�(�1��f�����D�~�����c���;���?�C5{6��2�"�����4���1�w���}�K�����f��<�~��g�A�vbe��Y��_E,Yz�����u�������w|���C�zv��Z������ne�Fp�qG((�O���������o��6��$AbW�����;F��soN�D�����o��6��$@�����;L/2�������(����ZV�6���9�4�yS�uL't7s�6����U@p@4|��1�7���m�G�����eF���!������m7,��������f�1��R�����D�~�����13u�=��=?�8�����6O �R��(�b�Y�dY�[�{Kn����Dm����������4���(��QgZW����(Y�������u�������|�����/k�<�����\�������|G���F��D]��������WK6er�����VJ=����6q}w�)p���7�C�zv��t�b�o{��W4��w�������;F��soN�D�A�����o��6��$@�����;F��soN�D�������D���
���7Mt�����|������ti����Xs�p��4�Q��m
�q
["�w���)5l�������?,������,lc�������J��O4|Y�	�C�!�e/tV��y���PMz��!��x����y��8?�(�D~\����������{�2�b�m"��<L���k���[�����rW�k���vx&�'���O|<Pg���nK��c��a����6>�������qc�r�����3�S,�9�9y��o���-yY��f��yT-yY��f��y;m�T��H�6�M������5]�T��[�k�
We�0���}������R��(Fc��{�9���i��7�fW������6~�PT@t�Mb��s>�����w#��P�&j�l-	re�E�����j�����A��G�6�u���p���y�z3��K6Z������nf����m:��x��,�Q�
&�����q�\S:��2���������������mz��B�����mz����5N���m���L���U�uL*%�6���v]S
������/�.�%.r�tf?�
��s�����6���zve~����g��DgNd�(�G3���L�1Gr9�G�bf�f����&Y�X��^����<�~���Dx�a��]�i
����Qg�9}�e�-i�N���a�k�Zv��q��������i>.^gX�������Oe�0���O��=�T�wG�3p�8#o]�7�ETO����l��w?�C���"1Kn�nY7������D7\#�zS��C�(�p�q�OO�9ja��M��H6T���(���V���?x���0������.���=�Y���a��J����/k�<�����/k�<�����\�������Z���iY,���Z���iY(��*����-����.P�9����������iW��y��*���:���a��@�� �{.��?
�M����v�����+�{�2�V���?p��ps���/<���<�n�^�4��T�G�����?"�R�A�mz��>,������������h�����G�����l�
_=�G���(6+F�,z����(�+�5���L��%{����I�WaW�l�z-�4���p(-�����6>���K��c��a��X
|�Qg�>�/.�S:��2���������������mz��B�����mz����5N���m���L���U�uL*%�6���v]S
������/�.�%.r�tf?�
��s�����6���zve~����g��DgNd�(�G3���L�1Gr9�G�bf�f����&Y�X��^����<�~���Dx�a��]�i
����Qg�9}�e�-i�N���a�k�Zv��q��������i>.^gX���3�S,�9�9y��o���-yY��f��yT-yY��f��y;m�T��H�6�M������5]�T��[�k�
We�0���}������R��(Fc��{�9���i��7�fW������6~�PT@t�Mb��s>�����w#��P�&j�l-	re�E�����j�����A��G�6�u���p���y�z3��K6Z������nf����m:��x��,�Q�
&�����q�\~�>^T�]S�9�4�yS�uL't7s�6����U@p@4��������s��;�2#�����z�;>�?Cu�=��=?�8�������P�����<t�eK�\����n����:x~��9���^����E�i^&�d�k��^���S��k��^���S�J_~��.�M��!�\���������\��������xb��M�_B��J\2������,Z-�{�j��{���-����|������ti����Xs�p��4�Q��m
�q
["�w���)5l�������?,������,lc�������J��O4|Y�	�C�!�e/tV��y���PMz��!��x����y��8?�(�D~\����������{�2�b�m"��<L���k���[�����rW�k���vx&�'���O|<Pg���nK��c��a����6>�������qc�r�����3�S,�9�9y��o���-yY��f��yT-yY��f��y;m�T��H�6�M������5]�T��[�k�
We�0���}������R��(Fc��{�9���i��7�fW������6~�PT@t�Mb��s>�����w#��P�&j�l-	re�E�����j�����A��G�6�u���p���y�z3��K6Z������nf����m:��x��,�Q�
&�����q�\S:��2���������������mz��B�����mz����5N���m���L���U�uL*%�6���v]S
������/�.�%.r�tf?�
��s�����6���zve~����g��DgNd�(�G3���L�1Gr9�G�bf�f����&Y�X��^����<�~���Dx�a��]�i
����Qg�9}�e�-i�N���a�k�Zv��q��������i>.^gX�������Oe�0���O��=�T�wG�3p�8#o]�7�ETO����l��w?�C���"1Kn�nY7������D7\#�zS��C�(�p�q�OO�9ja��M��H6T���(���V���?x���0������.���=�Y���a��J����/k�<�����/k�<�����\�������Z���iY,���Z���iY(��*����-����.P�9����������iW��y����j�T�]�Q�d���;��DN5_�����������{S�V�
rI����5jrGT�B��U��a�F�QW�"'�V�5j��2��&��8iJ���k[��UK}����a����|5?t��i�-8��~	d��V��B��������h������?�?M����)��/��_��A]�X�W�E�/	�"�r6,'��o����]�O��/�3�U�X�dz�F�����^v[��^}P���:����U�$��?����/����$���NgptA�����I���/���R���4z�Q�pg���,f�_T7'[�������8�H�9�����b/��Og��I+������R�:����-`�Sq��M�ZR�
���k^��I|��DK��:��.[|T&�����8�F���3F0��)�Z:�~��^OX+?�xs�z����c�8�#R��_,���Zs�>���_�P�n����������
#"�dHOk���sV���J�~��G[�0��f� �{.��?�|������\4
7r���Cw�CV���?p�M["�w��������8<n�D�����e�){����S�r�k���gK�����h���^��B3H{6^8xb�^}��� Q��=�G���(5|��������H���7,���p��?�23\�����'=]�^	�I��h�����v���8����fy/����i��q`)��E�_����0� 9L����6�{�^hk��s�^Vr�����U^Vr�����N�p�8c�6���|3y3k�
We�0������U�uL*&3_k6�������e�����6���gFc��{�
�����("/�M����9�X������13l���}�!����C�\�f�cf�{b�Dx�a��Pj�����v}�7f&^yE����R�����m:���e�-i�N���^4g7r�����xa�@bW���9L�h�������? ��g/)�^��P��g/)�^����
S�9#l�i7�7�6���v]S
�n���5]�T��c5��o���	K�\���i���tf?�
��p���_���"�����AQ��5�;���?�C6�Q��g��B�����8%��i6o��!��G�6�e��<�~��g�A�vbe��Y��_E,�k�Zv��q��fZ������ne�Fp�qG((�O����%p9�4�yS�uL:�����Oe�0���L�1���c
�QU�����;�2#]����l��R����M����|�Q
�����P�
7\#�zS��C�Z�j�d��
�.9r� f.�x~��9����?x��K�{2�{�u�x�~����*�{K��O*��*�{K��O%)}�oW,�y7�4�er�����VK6er�����VJ=����6q}w�)p���Nf:d�h������U���@���OTJ�U��{d%"�+W��b���a�G*���^K���F�U :zM�q��u�
�����D��o-�5[m�4����r�	Q"OH��j�+�����C�Y����N�@�T`I�(r�%��3+��l5���kMI}v��|�O3�2��Y�;Nv,'B��[M���j��GK�/&w�>Pf��Pa�
�'B�Xo��
�g3t���*w��EO-��JJ^�N�������B|��V�1����kK���g}��0{d����0{d��Dw�[s�L|��[s�L|�4@g}��0{d����0{d��Dw�[s�L|��[s�L|�#����Dm�EV2��(6����=�c����=�c��;���e�����������]���jt�k�~l\�&���"��Ul����H��D�al��1�al��1�{�������t~z��#�>.����D����F���G��a ��``����;``����U�������������������?��al��1�al��1�����JrzN�C_�^�z�p�!�^�;EJ������T5|������������0v����0�������[�k�����G_�?5o_I$d�u�W_�����q�IUQ'��M*f��[s�L|��[s�L|��5��4c�O�tzI����}_���C{^�V��tT�)�gB���[*�UO�����l������l��������'�����������Z�������|��������������0v����0��-�_���/�_����`��9�.��UU�D�#�!������0v����0�Y�zR����C�AV�U��f��WJ1���Dw�[s�L|��[s�L|����������0v����0
�al��1�al��1���>A��]k~6����=�c����=�c�}��*q����:��|>Q�8-V�O	�Z�_.j�����]��0{d����0{d��=m�����G��^�o�_�������*t���~l$�al��1�al��1�
w������C�Y����7�^�4��T�G�����?"����=�c����=�c������4,�=Z��k���j���i�}=�������_�^��8&;``����;``����A}��?���������Fg�����e��[s�L|��[s�L|��R���3kS�7_���=Z�_�5�^���R{�``����;``����y�b�S�N���a��u]vO��2�5��j�uuF��P5^���=�c����=�c�e���������_��<�/����i�w�[s�L|��[s�L|��a����5�������\����h���Z�z���4@g}��0{d����0{d��:�m�al��1�al��1��������0v����0
�3h�-�9��&>`�-�9��&>a'w^?A�~��z�������q���I����������3k�<������0v����0��zZsS�z��0���r��:�~�T~�7���5]�T��n���0{d����0{d��)s��9�7K����>��
2R�Z�0�>���$���Ot��-�9��&>`�-�9��&>a�j�~�e�f���j����}Z�v���u)��z��������al��1�al��1�
�����o��_\���O�(�G3���L��``����;``����Z�X}))t����W^�u�������[��_�V�lc���SW��l?p��-�9��&>`�-�9��&>a�N���kxY��AV:z�|�6T
W���l������l���s�����h�j����������m:��������0v����0��,@�].��������?�x}aVz=P�����!�h���``����;``����H�Z 3����=�c����=�c��;�-�9��&>`�-�9��&>`!�f��[s�L|��[s�L|�N��~����z�{uuk�G����>����S-yY��f��y}�-�9��&>`�-�9��&>a�^�������k�a�����vt'��>��?6o�]�j�.��D��``����;``����R���sFn�_��h6}'�d����a}��5|I����6���[s�L|��[s�L|���t�"�R���������:��������S���t5�u���	����=�c����=�c���_����<��������Q��g��B��v����0v����0����:RR�}�a��P�U��kF�KV�f�~����b���<�~�'�[s�L|��[s�L|�����4&���]~��*t���~l��al��1�al��1�
��o��o��~���%C-ykN�un73;�-�9��&>`�-�9��&>a9vX>�J4�]-q�����~+�����*�z:���_�1�C��������0v����0�F�@g}��0{d����0{d��Dw�[s�L|��[s�L|�4C��O��=�T���[s�L|��[s�L|�����
�jtu����_
�5����!&�^�.
����l������l�����?��_��O���������l������l������l���*�����Z�����H~�g�����_�^���?n�G����������=�c����=�c�S]��������^��~�������$��=]��}Z�]HpLv����0v����0�}��?�������?�������6����=�c����=�c�k��[,�n����^��u�kWojX��M}����b��*�{K��O/���0{d����0{d��:e��f�u�i]�yg��F������Gg�����al��1�al��1�;F��j�U���Q�z�1����K�iIK�k����j�2��L����l������l�����u�_����KW�WV��~(���������n��_(| 3����=�c����=�c��������l������l���h���``����;``���������l����
a�>�D�����V$U��n��w�"*�������L�38j�5UmNf�L�Rb��n$�7FEN+=R��K���a:Z�Q&�l6����bE{��cZ�j����2|������q/�$�M�Ot���=���lFDwn�K����TU����a��6���S����Z��2v�5�X�/�	�T�:�D������
l�����S0��	�I\�I9���U���F������<�h�1%-ju�,YE�~�<���%E5De�4�����S��@3:��fW�5F�'��R��J�t�)���k����Cm�"j_�
0&�:����W
bL79������N�e��kQR���"jD���!kQssa��{�c��
�5Au5����������/��N5un�C�a�U^7C�P��6�;����E��I��3W��J.�K-
�Sc�cN��M�R�������u%�]>��i�p��cQqc$1,��%��\��/
R,�T�k��{f��ok��g�`�a?K�#bY�h�,�Q�)/
/a�����Qu+��e@50A�Y�����	�����:^<h�(�'�\��Y���5�3�����$l;���9e�8�e�%�L��K���"�}Q���K���J�4��K����rO�+�m���$�)Re��i���sQR�}�V��*�e�1�r��*��z�B{e�jQ$�5beR�c���%^;�o�5�S��&1l�FG���1)FV�,�k'�*}e��v9R��&���pmt��+�����u7�y�����b�G^��4����:e������ct��E[�W�, �>���h�d�m���h�b�%!�cP��"�Cu�
�psQ���V��������������eB=2Z4���I��|=�����u�j=���[����F��8��p4i�H�J�l�������_I��U]&�5�"��q��c*ZM~U&%�{l�k��8��&�ry~�N%R��r��@������+�)V�%b�'Vg{"��sK"&�]�40�Vu2$��)
�RR-2op�ZlW����EMs����n�E��S>��QD�U���3��Z;��qe�0���R���{_�X������7�Q�%�R��hqaTQ���dV9�H�K"��6W.�K~#�R���dwSi)��P��M����ek�����������be.��*F������s0#"��v�*��9���e�k*�����%JV��`f$��$8p�V��cQ,�j$;"'��e%��J�����P ����jY�K"'�����Q��2�'#�����A�Q7H�tQI�����U���`9�@Z��%C������Tp��D��V�Rg�tzkD^�V��S�o�c`z.7���Y���bn��r�Vyw���N.$��5'���)2�K�8vKo�������(�������dF#�U����S�S��D�0.8���+&'�J�d��ZB�RN
5R&���a55����U��y�gE�Ui��9��Z����Zif#�?�����e�YlI�jTY���+?��]Q�rYU����
�G�xo-���uZv�����������g�B��P���W/���L�E�S�r���yxr��Q\�cQ��TDK�5��3���P�����J)��X���.m�DTM8Z+���D�"*��T4b��0}'��N�������F�a���Oc��R����g9[��X��I�P�_�IJu�A��a���*��dDD�_)������g�R�bnbRR�<:[&_���T��b��U�j6�d��Mce}��i����e��)��������k����{*�}O�2��&&�p�g*�Z���(��y����������{Y/�����(��`�3�
<��bG��:��s��w����z��6��pm�O�A��D�������<E�{��k�E�2J�d��9�]����2XjV�b8�����e�V�M��~�/��b��1��y���vJYc���t$DU����MH�i������B�L^���)�SU)�[����v�V�����u����R�����_N���4�h9!���w���V��Cz��0�
V�3U�KK�����
��Q;M;��+Q�)}%TO�������e]��?_���DUH:������jiY5�����*8�2p���/�k��S1I�t_%���b*�}��c5���et�$���di�xL���Dj���j��Q��t��AG�����6��q$�j��nL���c������U^$E[�����>�9�SU�O��rZ�$�DD]����^�"�_��<����!Q��l��1c���K5b���"~�?m��.��)�X�Lb	����Z$paL�M$�h^���m�U��UEN���oC����	'OL�����������;g�w���P��T���O�N�Hrm�����
�������KE,��}H�U%����������F�1$b���?�+�+UV#��v���4�W�Y�a�K���RE����|��iQ��
�ZH�j"-��5"�vFa�M	x�C=F��(ty���$��������U_��epd��k�2um���Vk"C��k��wj��K%��4������W5/)�V�,��j5R�������
;*��|'M�;T�
M����D��{��v�EF%��K���Z��S��1-F�9=jZ4�OK�5�D��v���Y~�3��������Q�B����YU|DE{%Z���}h�D_�E<�{Qp��YWd0������h�yY�=�3���f��5TN$D��4<[�T\E_}n�j�W���j�:���jjD~�E������L��+���I�y7��z4�##�#�k��o{�n�d����V��E>����������������w�F������W���������-�C��,���Lr9�j����?��endstream
endobj
3 0 obj
<< /Filter /FlateDecode /Length 189 >>
stream
x��O�
�0��+�,4�������� T��?�y@=(f`��Lf'�� $�G�d��(31]�qwb�3Y[����fX�)^d��X�hKCF�/RdF"�o2S�=�3ri���a���;
��5����3��. ��IM'��,�j�����v��E[��4j@$�U���Y��La�<�)���}�\M�endstream
endobj
4 0 obj
<< /Type /XRef /Length 31 /Filter /FlateDecode /DecodeParms << /Columns 5 /Predictor 12 >> /W [ 1 3 1 ] /Size 5 /ID [<e060eacfc35f82fb33fb3d1bb88aa376><e060eacfc35f82fb33fb3d1bb88aa376>] >>
stream
x�cb&F���L�?�$��B0��.�O
endstream
endobj
                    
startxref
216
%%EOF
patches.tgzapplication/x-compressed-tar; name=patches.tgzDownload
#47Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Tomas Vondra (#45)
Re: [PoC] Non-volatile WAL buffer

On 22.01.2021 5:32, Tomas Vondra wrote:

On 1/21/21 3:17 AM, Masahiko Sawada wrote:

On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I think I've managed to get the 0002 patch [1] rebased to master and
working (with help from Masahiko Sawada). It's not clear to me how it
could have worked as submitted - my theory is that an incomplete patch
was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For a
pgbench on scale 500 (fits into shared buffers), an average of three
5-minute runs looks like this:

    branch                 1        16        32 64        96
----------------------------------------------------------------
    master              7291     87704    165310    150437 224186
    ntt                 7912    106095    213206    212410 237819
    simple-no-buffers   7654     96544    115416     95828 103065

NTT refers to the patch from September 10, pre-allocating a large WAL
file on PMEM, and simple-no-buffers is the simpler patch simply
removing
the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want to
keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty bad. Not
only compared to the "ntt" patch, but even to master. I'm not entirely
sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've tried
tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much faster
than
NVMe storage, but still much slower than DRAM (both in terms of latency
and bandwidth, see [2] for some data). It's not terrible, but the
latency is maybe 2-3x higher - not a huge difference, but may matter
for
WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2],
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying, because
that's how we write into WAL buffers - each thread writes it's own
data,
so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls / time for
each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

    xlog stats cnt 43000000
       map cnt 100 time 5448333 unmap cnt 100 time 3730963
       memcpy cnt 985964 time 1550442272 len 15150499
       memset cnt 0 time 0 len 0
       persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100  mmap
and
unmap calls, taking ~10ms in total. There were ~14k pmem_persist calls,
taking 10ms in total. And the most time (~1.5s) was used by pmem_memcpy
copying about 15MB of data. That's quite a lot :-(

It might also be interesting if we can see how much time spent on each
logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().

Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
maybe that could be visible in a regular perf profile. Also, I suppose
most of the time will be used by the pmem calls, shown in the stats.

My conclusion from this is that eliminating WAL buffers and writing WAL
directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not
the
right approach.

I suppose we should keep WAL buffers, and then just write the data to
mmap-ed WAL segments on PMEM. Which I think is what the NTT patch does,
except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL segments,
mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does that,
and the results look a bit nicer:

    branch                 1        16        32 64        96
----------------------------------------------------------------
    master              7291     87704    165310    150437 224186
    ntt                 7912    106095    213206    212410 237819
    simple-no-buffers   7654     96544    115416     95828 103065
    with-wal-buffers    7477     95454    181702    140167 214715

So, much better than the version without WAL buffers, somewhat better
than master (except for 64/96 clients), but still not as good as NTT.

At this point I was wondering how could the NTT patch be faster when
it's doing roughly the same thing. I'm sire there are some differences,
but it seemed strange. The main difference seems to be that it only
maps
one large file, and only once. OTOH the alternative "simple" patch maps
segments one by one, in each backend. Per the debug stats the map/unmap
calls are fairly cheap, but maybe it interferes with the memcpy
somehow.

While looking at the two methods: NTT and simple-no-buffer, I realized
that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

That's a very good question - it's quite possible the WALWriteLock is
not really needed, because the processes are actually "writing" the
WAL directly to PMEM. So it's a bit confusing, because it's only
really concerned about making sure it's flushed.

And yes, multiple processes certainly can write to PMEM at the same
time, in fact it's a requirement to get good throughput I believe. My
understanding is we need ~8 processes, at least that's what I heard
from people with more PMEM experience.

TBH I'm not convinced the code in the "simple-no-buffer" code (coming
from the 0002 patch) is actually correct. Essentially, consider the
backend needs to do a flush, but does not have a segment mapped. So it
maps it and calls pmem_drain() on it.

But does that actually flush anything? Does it properly flush changes
done by other processes that may not have called pmem_drain() yet? I
find this somewhat suspicious and I'd bet all processes that did write
something have to call pmem_drain().

So I did an experiment by increasing the size of the WAL segments. I
chose to try with 521MB and 1024MB, and the results with 1GB look
like this:

    branch                 1        16        32 64        96
----------------------------------------------------------------
    master              6635     88524    171106    163387 245307
    ntt                 7909    106826    217364    223338 242042
    simple-no-buffers   7871    101575    199403    188074 224716
    with-wal-buffers    7643    101056    206911    223860 261712

So yeah, there's a clear difference. It changes the values for "master"
a bit, but both the "simple" patches (with and without) WAL buffers are
much faster. The with-wal-buffers is almost equal to the  NTT patch,
which was using 96GB file. I presume larger WAL segments would get even
closer, if we supported them.

I'll continue investigating this, but my conclusion so far seem to be
that we can't really replace WAL buffers with PMEM - that seems to
perform much worse.

The question is what to do about the segment size. Can we reduce the
overhead of mmap-ing individual segments, so that this works even for
smaller WAL segments, to make this useful for common instances (not
everyone wants to run with 1GB WAL). Or whether we need to adopt the
design with a large file, mapped just once.

Another question is whether it's even worth the extra complexity. On
16MB segments the difference between master and NTT patch seems to be
non-trivial, but increasing the WAL segment size kinda reduces that. So
maybe just using File I/O on PMEM DAX filesystem seems good enough.
Alternatively, maybe we could switch to libpmemblk, which should
eliminate the filesystem overhead at least.

I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Well, not sure. I think the question is still open whether it's
actually safe to run on DAX, which does not have atomic writes of 512B
sectors, and I think we rely on that e.g. for pg_config. But maybe for
WAL that's not an issue.

Also, I'm interested in why the through-put of NTT patch saturated at
32 clients, which is earlier than the master's one (96 clients). How
many CPU cores are there on the machine you used?

From what I know, this is somewhat expected for PMEM devices, for a
bunch of reasons:

1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%),
so it takes fewer processes to saturate it.

2) Internally, the PMEM has a 256B buffer for writes, used for
combining etc. With too many processes sending writes, it becomes to
look more random, which is harmful for throughput.

When combined, this means the performance starts dropping at certain
number of threads, and the optimal number of threads is rather low
(something like 5-10). This is very different behavior compared to DRAM.

There's a nice overview and measurements in this paper:

Building blocks for persistent memory / How to get the most out of
your new memory?
Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons
Kemper

https://link.springer.com/article/10.1007/s00778-020-00622-9

I'm also wondering if WAL is the right usage for PMEM. Per [2]
there's a
huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

      The read-write asymmetry of PMem im-plies the necessity of
avoiding
      writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).

I think using PMEM for WAL is cost-effective but it leverages the only
low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to logging
protocol while considering storing data on PMEM.

True. I think investigating whether it's sensible to use PMEM for this
purpose. It may turn out that replacing the DRAM WAL buffers with
writes directly to PMEM is not economical, and aggregating data in a
DRAM buffer is better :-(

regards

I have heard from several DBMS experts that appearance of huge and cheap
non-volatile memory can make a revolution in database system architecture.
If all database can fit in non-volatile memory, then we do not need
buffers, WAL, ...
But although  multi-terabyte NVM announces were made by IBM several
years ago, I do not know about some successful DBMS prototypes with new
architecture.
I tried to understand why...

It was very interesting to me to read this thread, which is actually
started in 2016 with "Non-volatile Memory Logging" presentation at PGCon.
As far as I understand  from Tomas result right now using PMEM for WAL
doesn't provide some substantial increase of performance.

But the main advantage of PMEM from my point of view is that it allows
to avoid write-ahead logging at all!
Certainly we need to change our algorithms to make it possible. Speaking
about Postgres, we have to rewrite all indexes + heap
and throw away buffer manager + WAL.

What can be used instead of standard B-Tree?
For example there is description of multiword-CAS approach:

   http://justinlevandoski.org/papers/mwcas.pdf

and BzTree implementation on top of it:

   https://www.cc.gatech.edu/~jarulraj/papers/2018.bztree.vldb.pdf

There is free BzTree implementation at github:

    git@github.com:sfu-dis/bztree.git

I tried to adopt it for Postgres. It was not so easy because:
1. It was written in modern C++ (-std=c++14)
2. It supports multithreading, but not mutliprocess access

So I have to patch code of this library instead of just using it:

  git@github.com:postgrespro/bztree.git

I have not tested yet most iterating case: access to PMEM through PMDK.
And I do not have hardware for such tests.
But first results are also seem to be interesting: PMwCAS is kind of
lockless algorithm and it shows much better scaling at
NUMA host comparing with standard Postgres.

I have done simple parallel insertion test: multiple clients are
inserting data with random keys.
To make competition with vanilla Postgres more honest I used unlogged table:

create unlogged table t(pk int, payload int);
create index on t using bztree(pk);

randinsert.sql:
insert into t (payload,pk) values
(generate_series(1,1000),random()*1000000000);

pgbench -f randinsert.sql -c N -j N -M prepared -n -t 1000 -P 1 postgres

So each client is inserting one million records.
The target system has 160 virtual and 80 real cores with 256GB of RAM.
Results (TPS) are the following:

N      nbtree      bztree
1           540          455
10         993        2237
100     1479        5025

So bztree is more than 3 times faster for 100 clients.
Just for comparison: result for inserting in this table without index is
10k TPS.

I am going then try to play with PMEM.
If results will be promising, then it is possible to think about
reimplementation of heap and WAL-less Postgres!

I am sorry, that my post has no direct relation to the topic of this
thread (Non-volatile WAL buffer).
It seems to be that it is better to use PMEM to eliminate WAL at all
instead of optimizing it.
Certainly, I realize that WAL plays very important role in Postgres:
archiving and replication are based on WAL. So even if we can live
without WAL, it is still not clear whether we really want to live
without it.

One more idea: using multiword CAS approach  requires us to make changes
as editing sequences.
Such editing sequence is actually ready WAL records. So implementors of
access methods do not have to do
double work: update data structure in memory and create correspondent
WAL records. Moreover, PMwCAS operations are atomic:
we can replay or revert them in case of fault. So there is no need in
FPW (full page writes) which have very noticeable impact on WAL size and
database performance.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#48Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#45)
Re: [PoC] Non-volatile WAL buffer

On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/21/21 3:17 AM, Masahiko Sawada wrote:

On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I think I've managed to get the 0002 patch [1] rebased to master and
working (with help from Masahiko Sawada). It's not clear to me how it
could have worked as submitted - my theory is that an incomplete patch
was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For a
pgbench on scale 500 (fits into shared buffers), an average of three
5-minute runs looks like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065

NTT refers to the patch from September 10, pre-allocating a large WAL
file on PMEM, and simple-no-buffers is the simpler patch simply removing
the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want to
keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty bad. Not
only compared to the "ntt" patch, but even to master. I'm not entirely
sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've tried
tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much faster than
NVMe storage, but still much slower than DRAM (both in terms of latency
and bandwidth, see [2] for some data). It's not terrible, but the
latency is maybe 2-3x higher - not a huge difference, but may matter for
WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2],
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying, because
that's how we write into WAL buffers - each thread writes it's own data,
so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls / time for
each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

xlog stats cnt 43000000
map cnt 100 time 5448333 unmap cnt 100 time 3730963
memcpy cnt 985964 time 1550442272 len 15150499
memset cnt 0 time 0 len 0
persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100 mmap and
unmap calls, taking ~10ms in total. There were ~14k pmem_persist calls,
taking 10ms in total. And the most time (~1.5s) was used by pmem_memcpy
copying about 15MB of data. That's quite a lot :-(

It might also be interesting if we can see how much time spent on each
logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().

Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
maybe that could be visible in a regular perf profile. Also, I suppose
most of the time will be used by the pmem calls, shown in the stats.

My conclusion from this is that eliminating WAL buffers and writing WAL
directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not the
right approach.

I suppose we should keep WAL buffers, and then just write the data to
mmap-ed WAL segments on PMEM. Which I think is what the NTT patch does,
except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL segments,
mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does that,
and the results look a bit nicer:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065
with-wal-buffers 7477 95454 181702 140167 214715

So, much better than the version without WAL buffers, somewhat better
than master (except for 64/96 clients), but still not as good as NTT.

At this point I was wondering how could the NTT patch be faster when
it's doing roughly the same thing. I'm sire there are some differences,
but it seemed strange. The main difference seems to be that it only maps
one large file, and only once. OTOH the alternative "simple" patch maps
segments one by one, in each backend. Per the debug stats the map/unmap
calls are fairly cheap, but maybe it interferes with the memcpy somehow.

While looking at the two methods: NTT and simple-no-buffer, I realized
that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

That's a very good question - it's quite possible the WALWriteLock is
not really needed, because the processes are actually "writing" the WAL
directly to PMEM. So it's a bit confusing, because it's only really
concerned about making sure it's flushed.

And yes, multiple processes certainly can write to PMEM at the same
time, in fact it's a requirement to get good throughput I believe. My
understanding is we need ~8 processes, at least that's what I heard from
people with more PMEM experience.

Thanks, that's good to know.

TBH I'm not convinced the code in the "simple-no-buffer" code (coming
from the 0002 patch) is actually correct. Essentially, consider the
backend needs to do a flush, but does not have a segment mapped. So it
maps it and calls pmem_drain() on it.

But does that actually flush anything? Does it properly flush changes
done by other processes that may not have called pmem_drain() yet? I
find this somewhat suspicious and I'd bet all processes that did write
something have to call pmem_drain().

Yeah, in terms of experiments at least it's good to find out that the
approach mmapping each WAL segment is not good at performance.

So I did an experiment by increasing the size of the WAL segments. I
chose to try with 521MB and 1024MB, and the results with 1GB look like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 6635 88524 171106 163387 245307
ntt 7909 106826 217364 223338 242042
simple-no-buffers 7871 101575 199403 188074 224716
with-wal-buffers 7643 101056 206911 223860 261712

So yeah, there's a clear difference. It changes the values for "master"
a bit, but both the "simple" patches (with and without) WAL buffers are
much faster. The with-wal-buffers is almost equal to the NTT patch,
which was using 96GB file. I presume larger WAL segments would get even
closer, if we supported them.

I'll continue investigating this, but my conclusion so far seem to be
that we can't really replace WAL buffers with PMEM - that seems to
perform much worse.

The question is what to do about the segment size. Can we reduce the
overhead of mmap-ing individual segments, so that this works even for
smaller WAL segments, to make this useful for common instances (not
everyone wants to run with 1GB WAL). Or whether we need to adopt the
design with a large file, mapped just once.

Another question is whether it's even worth the extra complexity. On
16MB segments the difference between master and NTT patch seems to be
non-trivial, but increasing the WAL segment size kinda reduces that. So
maybe just using File I/O on PMEM DAX filesystem seems good enough.
Alternatively, maybe we could switch to libpmemblk, which should
eliminate the filesystem overhead at least.

I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Well, not sure. I think the question is still open whether it's actually
safe to run on DAX, which does not have atomic writes of 512B sectors,
and I think we rely on that e.g. for pg_config. But maybe for WAL that's
not an issue.

I think we can use the Block Translation Table (BTT) driver that
provides atomic sector updates.

Also, I'm interested in why the through-put of NTT patch saturated at
32 clients, which is earlier than the master's one (96 clients). How
many CPU cores are there on the machine you used?

From what I know, this is somewhat expected for PMEM devices, for a
bunch of reasons:

1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%), so
it takes fewer processes to saturate it.

2) Internally, the PMEM has a 256B buffer for writes, used for combining
etc. With too many processes sending writes, it becomes to look more
random, which is harmful for throughput.

When combined, this means the performance starts dropping at certain
number of threads, and the optimal number of threads is rather low
(something like 5-10). This is very different behavior compared to DRAM.

Makes sense.

There's a nice overview and measurements in this paper:

Building blocks for persistent memory / How to get the most out of your
new memory?
Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons
Kemper

https://link.springer.com/article/10.1007/s00778-020-00622-9

Thank you. I'll read it.

I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a
huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of avoiding
writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).

I think using PMEM for WAL is cost-effective but it leverages the only
low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to logging
protocol while considering storing data on PMEM.

True. I think investigating whether it's sensible to use PMEM for this
purpose. It may turn out that replacing the DRAM WAL buffers with writes
directly to PMEM is not economical, and aggregating data in a DRAM
buffer is better :-(

Yes. I think it might be interesting to do an analysis of the
bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
other places by removing WALWriteLock during flush, it's probably a
good sign for further performance improvements. IIRC WALWriteLock is
one of the main bottlenecks on OLTP workload, although my memory might
already be out of date.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#49Takashi Menjo
takashi.menjo@gmail.com
In reply to: Masahiko Sawada (#48)
Re: [PoC] Non-volatile WAL buffer

Dear everyone,

I'm sorry for the late reply. I rebase my two patchsets onto the latest
master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL
buffer; the other prefixed with v3 is for msync.

I will reply to your thankful feedbacks one by one within days. Please wait
for a moment.

Best regards,
Takashi

01/25/2021(Mon) 11:56 Masahiko Sawada <sawada.mshk@gmail.com>:

On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/21/21 3:17 AM, Masahiko Sawada wrote:

On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I think I've managed to get the 0002 patch [1] rebased to master and
working (with help from Masahiko Sawada). It's not clear to me how it
could have worked as submitted - my theory is that an incomplete patch
was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For a
pgbench on scale 500 (fits into shared buffers), an average of three
5-minute runs looks like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065

NTT refers to the patch from September 10, pre-allocating a large WAL
file on PMEM, and simple-no-buffers is the simpler patch simply

removing

the WAL buffers and writing directly to a mmap-ed WAL segment on PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want to
keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty bad.

Not

only compared to the "ntt" patch, but even to master. I'm not entirely
sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've

tried

tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much faster

than

NVMe storage, but still much slower than DRAM (both in terms of

latency

and bandwidth, see [2] for some data). It's not terrible, but the
latency is maybe 2-3x higher - not a huge difference, but may matter

for

WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2],
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying, because
that's how we write into WAL buffers - each thread writes it's own

data,

so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls / time

for

each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates data
for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

xlog stats cnt 43000000
map cnt 100 time 5448333 unmap cnt 100 time 3730963
memcpy cnt 985964 time 1550442272 len 15150499
memset cnt 0 time 0 len 0
persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100 mmap

and

unmap calls, taking ~10ms in total. There were ~14k pmem_persist

calls,

taking 10ms in total. And the most time (~1.5s) was used by

pmem_memcpy

copying about 15MB of data. That's quite a lot :-(

It might also be interesting if we can see how much time spent on each
logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().

Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
maybe that could be visible in a regular perf profile. Also, I suppose
most of the time will be used by the pmem calls, shown in the stats.

My conclusion from this is that eliminating WAL buffers and writing

WAL

directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not

the

right approach.

I suppose we should keep WAL buffers, and then just write the data to
mmap-ed WAL segments on PMEM. Which I think is what the NTT patch

does,

except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL

segments,

mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does

that,

and the results look a bit nicer:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065
with-wal-buffers 7477 95454 181702 140167 214715

So, much better than the version without WAL buffers, somewhat better
than master (except for 64/96 clients), but still not as good as NTT.

At this point I was wondering how could the NTT patch be faster when
it's doing roughly the same thing. I'm sire there are some

differences,

but it seemed strange. The main difference seems to be that it only

maps

one large file, and only once. OTOH the alternative "simple" patch

maps

segments one by one, in each backend. Per the debug stats the

map/unmap

calls are fairly cheap, but maybe it interferes with the memcpy

somehow.

While looking at the two methods: NTT and simple-no-buffer, I realized
that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

That's a very good question - it's quite possible the WALWriteLock is
not really needed, because the processes are actually "writing" the WAL
directly to PMEM. So it's a bit confusing, because it's only really
concerned about making sure it's flushed.

And yes, multiple processes certainly can write to PMEM at the same
time, in fact it's a requirement to get good throughput I believe. My
understanding is we need ~8 processes, at least that's what I heard from
people with more PMEM experience.

Thanks, that's good to know.

TBH I'm not convinced the code in the "simple-no-buffer" code (coming
from the 0002 patch) is actually correct. Essentially, consider the
backend needs to do a flush, but does not have a segment mapped. So it
maps it and calls pmem_drain() on it.

But does that actually flush anything? Does it properly flush changes
done by other processes that may not have called pmem_drain() yet? I
find this somewhat suspicious and I'd bet all processes that did write
something have to call pmem_drain().

Yeah, in terms of experiments at least it's good to find out that the
approach mmapping each WAL segment is not good at performance.

So I did an experiment by increasing the size of the WAL segments. I
chose to try with 521MB and 1024MB, and the results with 1GB look

like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 6635 88524 171106 163387 245307
ntt 7909 106826 217364 223338 242042
simple-no-buffers 7871 101575 199403 188074 224716
with-wal-buffers 7643 101056 206911 223860 261712

So yeah, there's a clear difference. It changes the values for

"master"

a bit, but both the "simple" patches (with and without) WAL buffers

are

much faster. The with-wal-buffers is almost equal to the NTT patch,
which was using 96GB file. I presume larger WAL segments would get

even

closer, if we supported them.

I'll continue investigating this, but my conclusion so far seem to be
that we can't really replace WAL buffers with PMEM - that seems to
perform much worse.

The question is what to do about the segment size. Can we reduce the
overhead of mmap-ing individual segments, so that this works even for
smaller WAL segments, to make this useful for common instances (not
everyone wants to run with 1GB WAL). Or whether we need to adopt the
design with a large file, mapped just once.

Another question is whether it's even worth the extra complexity. On
16MB segments the difference between master and NTT patch seems to be
non-trivial, but increasing the WAL segment size kinda reduces that.

So

maybe just using File I/O on PMEM DAX filesystem seems good enough.
Alternatively, maybe we could switch to libpmemblk, which should
eliminate the filesystem overhead at least.

I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Well, not sure. I think the question is still open whether it's actually
safe to run on DAX, which does not have atomic writes of 512B sectors,
and I think we rely on that e.g. for pg_config. But maybe for WAL that's
not an issue.

I think we can use the Block Translation Table (BTT) driver that
provides atomic sector updates.

Also, I'm interested in why the through-put of NTT patch saturated at
32 clients, which is earlier than the master's one (96 clients). How
many CPU cores are there on the machine you used?

From what I know, this is somewhat expected for PMEM devices, for a
bunch of reasons:

1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%), so
it takes fewer processes to saturate it.

2) Internally, the PMEM has a 256B buffer for writes, used for combining
etc. With too many processes sending writes, it becomes to look more
random, which is harmful for throughput.

When combined, this means the performance starts dropping at certain
number of threads, and the optimal number of threads is rather low
(something like 5-10). This is very different behavior compared to DRAM.

Makes sense.

There's a nice overview and measurements in this paper:

Building blocks for persistent memory / How to get the most out of your
new memory?
Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons
Kemper

https://link.springer.com/article/10.1007/s00778-020-00622-9

Thank you. I'll read it.

I'm also wondering if WAL is the right usage for PMEM. Per [2]

there's a

huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of

avoiding

writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).

I think using PMEM for WAL is cost-effective but it leverages the only
low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to logging
protocol while considering storing data on PMEM.

True. I think investigating whether it's sensible to use PMEM for this
purpose. It may turn out that replacing the DRAM WAL buffers with writes
directly to PMEM is not economical, and aggregating data in a DRAM
buffer is better :-(

Yes. I think it might be interesting to do an analysis of the
bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
other places by removing WALWriteLock during flush, it's probably a
good sign for further performance improvements. IIRC WALWriteLock is
one of the main bottlenecks on OLTP workload, although my memory might
already be out of date.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

--
Takashi Menjo <takashi.menjo@gmail.com>

#50Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#49)
16 attachment(s)
Re: [PoC] Non-volatile WAL buffer

Dear everyone,

Sorry but I forgot to attach my patchsets... Please see the files attached
to this mail. Please also note that they contain some fixes.

Best regards,
Takashi

2021年1月26日(火) 17:46 Takashi Menjo <takashi.menjo@gmail.com>:

Dear everyone,

I'm sorry for the late reply. I rebase my two patchsets onto the latest
master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL
buffer; the other prefixed with v3 is for msync.

I will reply to your thankful feedbacks one by one within days. Please
wait for a moment.

Best regards,
Takashi

01/25/2021(Mon) 11:56 Masahiko Sawada <sawada.mshk@gmail.com>:

On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/21/21 3:17 AM, Masahiko Sawada wrote:

On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I think I've managed to get the 0002 patch [1] rebased to master and
working (with help from Masahiko Sawada). It's not clear to me how it
could have worked as submitted - my theory is that an incomplete

patch

was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For a
pgbench on scale 500 (fits into shared buffers), an average of three
5-minute runs looks like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065

NTT refers to the patch from September 10, pre-allocating a large WAL
file on PMEM, and simple-no-buffers is the simpler patch simply

removing

the WAL buffers and writing directly to a mmap-ed WAL segment on

PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want to
keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty bad.

Not

only compared to the "ntt" patch, but even to master. I'm not

entirely

sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've

tried

tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much faster

than

NVMe storage, but still much slower than DRAM (both in terms of

latency

and bandwidth, see [2] for some data). It's not terrible, but the
latency is maybe 2-3x higher - not a huge difference, but may matter

for

WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2],
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying, because
that's how we write into WAL buffers - each thread writes it's own

data,

so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls / time

for

each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates

data

for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

xlog stats cnt 43000000
map cnt 100 time 5448333 unmap cnt 100 time 3730963
memcpy cnt 985964 time 1550442272 len 15150499
memset cnt 0 time 0 len 0
persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100 mmap

and

unmap calls, taking ~10ms in total. There were ~14k pmem_persist

calls,

taking 10ms in total. And the most time (~1.5s) was used by

pmem_memcpy

copying about 15MB of data. That's quite a lot :-(

It might also be interesting if we can see how much time spent on each
logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().

Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
maybe that could be visible in a regular perf profile. Also, I suppose
most of the time will be used by the pmem calls, shown in the stats.

My conclusion from this is that eliminating WAL buffers and writing

WAL

directly to PMEM (by memcpy to mmap-ed WAL segments) is probably not

the

right approach.

I suppose we should keep WAL buffers, and then just write the data to
mmap-ed WAL segments on PMEM. Which I think is what the NTT patch

does,

except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL

segments,

mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does

that,

and the results look a bit nicer:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065
with-wal-buffers 7477 95454 181702 140167 214715

So, much better than the version without WAL buffers, somewhat better
than master (except for 64/96 clients), but still not as good as NTT.

At this point I was wondering how could the NTT patch be faster when
it's doing roughly the same thing. I'm sire there are some

differences,

but it seemed strange. The main difference seems to be that it only

maps

one large file, and only once. OTOH the alternative "simple" patch

maps

segments one by one, in each backend. Per the debug stats the

map/unmap

calls are fairly cheap, but maybe it interferes with the memcpy

somehow.

While looking at the two methods: NTT and simple-no-buffer, I realized
that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

That's a very good question - it's quite possible the WALWriteLock is
not really needed, because the processes are actually "writing" the WAL
directly to PMEM. So it's a bit confusing, because it's only really
concerned about making sure it's flushed.

And yes, multiple processes certainly can write to PMEM at the same
time, in fact it's a requirement to get good throughput I believe. My
understanding is we need ~8 processes, at least that's what I heard from
people with more PMEM experience.

Thanks, that's good to know.

TBH I'm not convinced the code in the "simple-no-buffer" code (coming
from the 0002 patch) is actually correct. Essentially, consider the
backend needs to do a flush, but does not have a segment mapped. So it
maps it and calls pmem_drain() on it.

But does that actually flush anything? Does it properly flush changes
done by other processes that may not have called pmem_drain() yet? I
find this somewhat suspicious and I'd bet all processes that did write
something have to call pmem_drain().

Yeah, in terms of experiments at least it's good to find out that the
approach mmapping each WAL segment is not good at performance.

So I did an experiment by increasing the size of the WAL segments. I
chose to try with 521MB and 1024MB, and the results with 1GB look

like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 6635 88524 171106 163387 245307
ntt 7909 106826 217364 223338 242042
simple-no-buffers 7871 101575 199403 188074 224716
with-wal-buffers 7643 101056 206911 223860 261712

So yeah, there's a clear difference. It changes the values for

"master"

a bit, but both the "simple" patches (with and without) WAL buffers

are

much faster. The with-wal-buffers is almost equal to the NTT patch,
which was using 96GB file. I presume larger WAL segments would get

even

closer, if we supported them.

I'll continue investigating this, but my conclusion so far seem to be
that we can't really replace WAL buffers with PMEM - that seems to
perform much worse.

The question is what to do about the segment size. Can we reduce the
overhead of mmap-ing individual segments, so that this works even for
smaller WAL segments, to make this useful for common instances (not
everyone wants to run with 1GB WAL). Or whether we need to adopt the
design with a large file, mapped just once.

Another question is whether it's even worth the extra complexity. On
16MB segments the difference between master and NTT patch seems to be
non-trivial, but increasing the WAL segment size kinda reduces that.

So

maybe just using File I/O on PMEM DAX filesystem seems good enough.
Alternatively, maybe we could switch to libpmemblk, which should
eliminate the filesystem overhead at least.

I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Well, not sure. I think the question is still open whether it's actually
safe to run on DAX, which does not have atomic writes of 512B sectors,
and I think we rely on that e.g. for pg_config. But maybe for WAL that's
not an issue.

I think we can use the Block Translation Table (BTT) driver that
provides atomic sector updates.

Also, I'm interested in why the through-put of NTT patch saturated at
32 clients, which is earlier than the master's one (96 clients). How
many CPU cores are there on the machine you used?

From what I know, this is somewhat expected for PMEM devices, for a
bunch of reasons:

1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%), so
it takes fewer processes to saturate it.

2) Internally, the PMEM has a 256B buffer for writes, used for combining
etc. With too many processes sending writes, it becomes to look more
random, which is harmful for throughput.

When combined, this means the performance starts dropping at certain
number of threads, and the optimal number of threads is rather low
(something like 5-10). This is very different behavior compared to DRAM.

Makes sense.

There's a nice overview and measurements in this paper:

Building blocks for persistent memory / How to get the most out of your
new memory?
Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons
Kemper

https://link.springer.com/article/10.1007/s00778-020-00622-9

Thank you. I'll read it.

I'm also wondering if WAL is the right usage for PMEM. Per [2]

there's a

huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of

avoiding

writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).

I think using PMEM for WAL is cost-effective but it leverages the only
low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to logging
protocol while considering storing data on PMEM.

True. I think investigating whether it's sensible to use PMEM for this
purpose. It may turn out that replacing the DRAM WAL buffers with writes
directly to PMEM is not economical, and aggregating data in a DRAM
buffer is better :-(

Yes. I think it might be interesting to do an analysis of the
bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
other places by removing WALWriteLock during flush, it's probably a
good sign for further performance improvements. IIRC WALWriteLock is
one of the main bottlenecks on OLTP workload, although my memory might
already be out of date.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

--
Takashi Menjo <takashi.menjo@gmail.com>

--
Takashi Menjo <takashi.menjo@gmail.com>

Attachments:

v3-0007-Set-wal_buffers-to-the-same-pages-as-WAL-segment.patchapplication/octet-stream; name=v3-0007-Set-wal_buffers-to-the-same-pages-as-WAL-segment.patchDownload
From a14375f3735f79cb0d6b155a60a51aee3a8d9cd6 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 25 Mar 2020 10:20:16 +0900
Subject: [PATCH v3 07/10] Set wal_buffers to the same #pages as WAL segment

---
 src/backend/access/transam/xlog.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b7d99cacba..777a9e921c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -4914,10 +4914,10 @@ XLOGShmemSize(void)
 	{
 		char		buf[32];
 
-		snprintf(buf, sizeof(buf), "%d", XLOGChooseNumBuffers());
+		snprintf(buf, sizeof(buf), "%d", wal_segment_size / XLOG_BLCKSZ);
 		SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
 	}
-	Assert(XLOGbuffers > 0);
+	Assert(XLOGbuffers == wal_segment_size / XLOG_BLCKSZ);
 
 	/* XLogCtl */
 	size = sizeof(XLogCtlData);
-- 
2.25.1

v4-0001-Support-GUCs-for-external-WAL-buffer.patchapplication/octet-stream; name=v4-0001-Support-GUCs-for-external-WAL-buffer.patchDownload
From 68154cbca8494274c54e2b1c607c2859afd7bd3b Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:56 +0900
Subject: [PATCH v4 1/6] Support GUCs for external WAL buffer

To implement non-volatile WAL buffer, we add two new GUCs nvwal_path
and nvwal_size.  Now postgres maps a file at that path onto memory to
use it as WAL buffer.  Note that the buffer is still volatile for now.
---
 configure                                     | 262 ++++++++++++++++++
 configure.ac                                  |  43 +++
 src/backend/access/transam/Makefile           |   3 +-
 src/backend/access/transam/nv_xlog_buffer.c   |  95 +++++++
 src/backend/access/transam/xlog.c             | 163 ++++++++++-
 src/backend/utils/misc/guc.c                  |  23 +-
 src/backend/utils/misc/postgresql.conf.sample |   2 +
 src/bin/initdb/initdb.c                       |  95 ++++++-
 src/include/access/nv_xlog_buffer.h           |  71 +++++
 src/include/access/xlog.h                     |   2 +
 src/include/pg_config.h.in                    |   6 +
 src/include/utils/guc.h                       |   4 +
 12 files changed, 747 insertions(+), 22 deletions(-)
 create mode 100644 src/backend/access/transam/nv_xlog_buffer.c
 create mode 100644 src/include/access/nv_xlog_buffer.h

diff --git a/configure b/configure
index 8af4b99021..76b5662262 100755
--- a/configure
+++ b/configure
@@ -866,6 +866,7 @@ with_libxml
 with_libxslt
 with_system_tzdata
 with_zlib
+with_nvwal
 with_gnu_ld
 enable_largefile
 '
@@ -1570,6 +1571,7 @@ Optional Packages:
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
+  --with-nvwal            use non-volatile WAL buffer (NVWAL)
   --with-gnu-ld           assume the C compiler uses GNU ld [default=no]
 
 Some influential environment variables:
@@ -8601,6 +8603,203 @@ fi
 
 
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking whether to build with non-volatile WAL buffer (NVWAL)" >&5
+$as_echo_n "checking whether to build with non-volatile WAL buffer (NVWAL)... " >&6; }
+
+
+
+# Check whether --with-nvwal was given.
+if test "${with_nvwal+set}" = set; then :
+  withval=$with_nvwal;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_NVWAL 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-nvwal option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_nvwal=no
+
+fi
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $with_nvwal" >&5
+$as_echo "$with_nvwal" >&6; }
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+    freebsd1*|freebsd2*) elf=no;;
+    freebsd3*|freebsd4*) elf=yes;;
+esac
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for grep that handles long lines and -e" >&5
+$as_echo_n "checking for grep that handles long lines and -e... " >&6; }
+if ${ac_cv_path_GREP+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  if test -z "$GREP"; then
+  ac_path_GREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in grep ggrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_GREP="$as_dir/$ac_prog$ac_exec_ext"
+      as_fn_executable_p "$ac_path_GREP" || continue
+# Check for GNU ac_path_GREP and select it if it is found.
+  # Check for GNU $ac_path_GREP
+case `"$ac_path_GREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_GREP="$ac_path_GREP" ac_path_GREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'GREP' >> "conftest.nl"
+    "$ac_path_GREP" -e 'GREP$' -e '-(cannot match)-' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_GREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_GREP="$ac_path_GREP"
+      ac_path_GREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_GREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_GREP"; then
+    as_fn_error $? "no acceptable grep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_GREP=$GREP
+fi
+
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_GREP" >&5
+$as_echo "$ac_cv_path_GREP" >&6; }
+ GREP="$ac_cv_path_GREP"
+
+
+{ $as_echo "$as_me:${as_lineno-$LINENO}: checking for egrep" >&5
+$as_echo_n "checking for egrep... " >&6; }
+if ${ac_cv_path_EGREP+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  if echo a | $GREP -E '(a|b)' >/dev/null 2>&1
+   then ac_cv_path_EGREP="$GREP -E"
+   else
+     if test -z "$EGREP"; then
+  ac_path_EGREP_found=false
+  # Loop through the user's path and test for each of PROGNAME-LIST
+  as_save_IFS=$IFS; IFS=$PATH_SEPARATOR
+for as_dir in $PATH$PATH_SEPARATOR/usr/xpg4/bin
+do
+  IFS=$as_save_IFS
+  test -z "$as_dir" && as_dir=.
+    for ac_prog in egrep; do
+    for ac_exec_ext in '' $ac_executable_extensions; do
+      ac_path_EGREP="$as_dir/$ac_prog$ac_exec_ext"
+      as_fn_executable_p "$ac_path_EGREP" || continue
+# Check for GNU ac_path_EGREP and select it if it is found.
+  # Check for GNU $ac_path_EGREP
+case `"$ac_path_EGREP" --version 2>&1` in
+*GNU*)
+  ac_cv_path_EGREP="$ac_path_EGREP" ac_path_EGREP_found=:;;
+*)
+  ac_count=0
+  $as_echo_n 0123456789 >"conftest.in"
+  while :
+  do
+    cat "conftest.in" "conftest.in" >"conftest.tmp"
+    mv "conftest.tmp" "conftest.in"
+    cp "conftest.in" "conftest.nl"
+    $as_echo 'EGREP' >> "conftest.nl"
+    "$ac_path_EGREP" 'EGREP$' < "conftest.nl" >"conftest.out" 2>/dev/null || break
+    diff "conftest.out" "conftest.nl" >/dev/null 2>&1 || break
+    as_fn_arith $ac_count + 1 && ac_count=$as_val
+    if test $ac_count -gt ${ac_path_EGREP_max-0}; then
+      # Best one so far, save it but keep looking for a better one
+      ac_cv_path_EGREP="$ac_path_EGREP"
+      ac_path_EGREP_max=$ac_count
+    fi
+    # 10*(2^10) chars as input seems more than enough
+    test $ac_count -gt 10 && break
+  done
+  rm -f conftest.in conftest.tmp conftest.nl conftest.out;;
+esac
+
+      $ac_path_EGREP_found && break 3
+    done
+  done
+  done
+IFS=$as_save_IFS
+  if test -z "$ac_cv_path_EGREP"; then
+    as_fn_error $? "no acceptable egrep could be found in $PATH$PATH_SEPARATOR/usr/xpg4/bin" "$LINENO" 5
+  fi
+else
+  ac_cv_path_EGREP=$EGREP
+fi
+
+   fi
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_path_EGREP" >&5
+$as_echo "$ac_cv_path_EGREP" >&6; }
+ EGREP="$ac_cv_path_EGREP"
+
+
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+#if __ELF__
+  yes
+#endif
+
+_ACEOF
+if (eval "$ac_cpp conftest.$ac_ext") 2>&5 |
+  $EGREP "yes" >/dev/null 2>&1; then :
+  ELF_SYS=true
+else
+  if test "X$elf" = "Xyes" ; then
+  ELF_SYS=true
+else
+  ELF_SYS=
+fi
+fi
+rm -f conftest*
+
+
+
 #
 # Assignments
 #
@@ -12962,6 +13161,57 @@ fi
 fi
 
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'libpmem' is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+fi
+
 
 ##
 ## Header files
@@ -13641,6 +13891,18 @@ fi
 
 done
 
+fi
+
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)" "$LINENO" 5
+fi
+
+
 fi
 
 if test "$PORTNAME" = "win32" ; then
diff --git a/configure.ac b/configure.ac
index 868a94c9ba..fff1a078ab 100644
--- a/configure.ac
+++ b/configure.ac
@@ -999,6 +999,38 @@ PGAC_ARG_BOOL(with, zlib, yes,
               [do not use Zlib])
 AC_SUBST(with_zlib)
 
+#
+# Non-volatile WAL buffer (NVWAL)
+#
+AC_MSG_CHECKING([whether to build with non-volatile WAL buffer (NVWAL)])
+PGAC_ARG_BOOL(with, nvwal, no, [use non-volatile WAL buffer (NVWAL)],
+              [AC_DEFINE([USE_NVWAL], 1, [Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal)])])
+AC_MSG_RESULT([$with_nvwal])
+
+#
+# Elf
+#
+
+# Assume system is ELF if it predefines __ELF__ as 1,
+# otherwise believe host_os based default.
+case $host_os in
+    freebsd1*|freebsd2*) elf=no;;
+    freebsd3*|freebsd4*) elf=yes;;
+esac
+
+AC_EGREP_CPP(yes,
+[#if __ELF__
+  yes
+#endif
+],
+[ELF_SYS=true],
+[if test "X$elf" = "Xyes" ; then
+  ELF_SYS=true
+else
+  ELF_SYS=
+fi])
+AC_SUBST(ELF_SYS)
+
 #
 # Assignments
 #
@@ -1303,6 +1335,12 @@ elif test "$with_uuid" = ossp ; then
 fi
 AC_SUBST(UUID_LIBS)
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [],
+               [AC_MSG_ERROR([library 'libpmem' is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 
 ##
 ## Header files
@@ -1481,6 +1519,11 @@ elif test "$with_uuid" = ossp ; then
       [AC_MSG_ERROR([header file <ossp/uuid.h> or <uuid.h> is required for OSSP UUID])])])
 fi
 
+# for non-volatile WAL buffer (NVWAL)
+if test "$with_nvwal" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for non-volatile WAL buffer (NVWAL)])])
+fi
+
 if test "$PORTNAME" = "win32" ; then
    AC_CHECK_HEADERS(crtdefs.h)
 fi
diff --git a/src/backend/access/transam/Makefile b/src/backend/access/transam/Makefile
index 595e02de72..b41a710e7e 100644
--- a/src/backend/access/transam/Makefile
+++ b/src/backend/access/transam/Makefile
@@ -32,7 +32,8 @@ OBJS = \
 	xlogfuncs.o \
 	xloginsert.o \
 	xlogreader.o \
-	xlogutils.o
+	xlogutils.o \
+	nv_xlog_buffer.o
 
 include $(top_srcdir)/src/backend/common.mk
 
diff --git a/src/backend/access/transam/nv_xlog_buffer.c b/src/backend/access/transam/nv_xlog_buffer.c
new file mode 100644
index 0000000000..cfc6a6376b
--- /dev/null
+++ b/src/backend/access/transam/nv_xlog_buffer.c
@@ -0,0 +1,95 @@
+/*-------------------------------------------------------------------------
+ *
+ * nv_xlog_buffer.c
+ *		PostgreSQL non-volatile WAL buffer
+ *
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/transam/nv_xlog_buffer.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#ifdef USE_NVWAL
+
+#include <libpmem.h>
+#include "access/nv_xlog_buffer.h"
+
+#include "miscadmin.h" /* IsBootstrapProcessingMode */
+#include "common/file_perm.h" /* pg_file_create_mode */
+
+/*
+ * Maps non-volatile WAL buffer on shared memory.
+ *
+ * Returns a mapped address if success; PANICs and never return otherwise.
+ */
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+	void	   *addr;
+	size_t		map_len = 0;
+	int			is_pmem = 0;
+
+	Assert(fname != NULL);
+	Assert(fsize > 0);
+
+	if (IsBootstrapProcessingMode())
+	{
+		/*
+		 * Create and map a new file if we are in bootstrap mode (typically
+		 * executed by initdb).
+		 */
+		addr = pmem_map_file(fname, fsize, PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+							 pg_file_create_mode, &map_len, &is_pmem);
+	}
+	else
+	{
+		/*
+		 * Map an existing file.  The second argument (len) should be zero,
+		 * the third argument (flags) should have neither PMEM_FILE_CREATE nor
+		 * PMEM_FILE_EXCL, and the fourth argument (mode) will be ignored.
+		 */
+		addr = pmem_map_file(fname, 0, 0, 0, &map_len, &is_pmem);
+	}
+
+	if (addr == NULL)
+		elog(PANIC, "could not map non-volatile WAL buffer '%s': %m", fname);
+
+	if (map_len != fsize)
+		elog(PANIC, "size of non-volatile WAL buffer '%s' is invalid; "
+					"expected %zu; actual %zu",
+			 fname, fsize, map_len);
+
+	if (!is_pmem)
+		elog(PANIC, "non-volatile WAL buffer '%s' is not on persistent memory",
+			 fname);
+
+	/*
+	 * Assert page boundary alignment (8KiB as default).  It should pass because
+	 * PMDK considers hugepage boundary alignment (2MiB or 1GiB on x64).
+	 */
+	Assert((uint64) addr % XLOG_BLCKSZ == 0);
+
+	elog(LOG, "non-volatile WAL buffer '%s' is mapped on [%p-%p)",
+		 fname, addr, (char *) addr + map_len);
+	return addr;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+	Assert(addr != NULL);
+
+	if (pmem_unmap(addr, fsize) < 0)
+	{
+		elog(WARNING, "could not unmap non-volatile WAL buffer: %m");
+		return;
+	}
+
+	elog(LOG, "non-volatile WAL buffer unmapped");
+}
+
+#endif
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 470e113b33..8a125193aa 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -37,6 +37,7 @@
 #include "access/xloginsert.h"
 #include "access/xlogreader.h"
 #include "access/xlogutils.h"
+#include "access/nv_xlog_buffer.h"
 #include "catalog/catversion.h"
 #include "catalog/pg_control.h"
 #include "catalog/pg_database.h"
@@ -885,6 +886,12 @@ static bool InRedo = false;
 /* Have we launched bgwriter during recovery? */
 static bool bgwriterLaunched = false;
 
+/* For non-volatile WAL buffer (NVWAL) */
+char	   *NvwalPath = NULL;	/* a GUC parameter */
+int			NvwalSizeMB = 1024;	/* a direct GUC parameter */
+static Size	NvwalSize = 0;		/* an indirect GUC parameter */
+static bool	NvwalAvail = false;
+
 /* For WALInsertLockAcquire/Release functions */
 static int	MyLockNo = 0;
 static bool holdingAllLocks = false;
@@ -5045,6 +5052,76 @@ check_wal_buffers(int *newval, void **extra, GucSource source)
 	return true;
 }
 
+/*
+ * GUC check_hook for nvwal_path.
+ */
+bool
+check_nvwal_path(char **newval, void **extra, GucSource source)
+{
+#ifndef USE_NVWAL
+	Assert(!NvwalAvail);
+
+	if (**newval != '\0')
+	{
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("nvwal_path is invalid parameter without NVWAL");
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_nvwal_path(const char *newval, void *extra)
+{
+	/* true if not empty; false if empty */
+	NvwalAvail = (bool) (*newval != '\0');
+}
+
+/*
+ * GUC check_hook for nvwal_size.
+ *
+ * It checks the boundary only and DOES NOT check if the size is multiple
+ * of wal_segment_size because the segment size (probably stored in the
+ * control file) have not been set properly here yet.
+ *
+ * See XLOGShmemSize for more validation.
+ */
+bool
+check_nvwal_size(int *newval, void **extra, GucSource source)
+{
+#ifdef USE_NVWAL
+	Size		buf_size;
+	int64		npages;
+
+	Assert(*newval > 0);
+
+	buf_size = (Size) (*newval) * 1024 * 1024;
+	npages = (int64) buf_size / XLOG_BLCKSZ;
+	Assert(npages > 0);
+
+	if (npages > INT_MAX)
+	{
+		/* XLOG_BLCKSZ could be so small that npages exceeds INT_MAX */
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("invalid value for nvwal_size (%dMB): "
+						 "the number of WAL pages too large; "
+						 "buf_size %zu; XLOG_BLCKSZ %d",
+						 *newval, buf_size, (int) XLOG_BLCKSZ);
+		return false;
+	}
+#endif
+
+	return true;
+}
+
+void
+assign_nvwal_size(int newval, void *extra)
+{
+	NvwalSize = (Size) newval * 1024 * 1024;
+}
+
 /*
  * Read the control file, set respective GUCs.
  *
@@ -5073,13 +5150,49 @@ XLOGShmemSize(void)
 {
 	Size		size;
 
+	/*
+	 * If we use non-volatile WAL buffer, we don't use the given wal_buffers.
+	 * Instead, we set it the value based on the size of the file for the
+	 * buffer. This should be done here because of xlblocks array calculation.
+	 */
+	if (NvwalAvail)
+	{
+		char		buf[32];
+		int64		npages;
+
+		Assert(NvwalSizeMB > 0);
+		Assert(NvwalSize > 0);
+		Assert(wal_segment_size > 0);
+		Assert(wal_segment_size % XLOG_BLCKSZ == 0);
+
+		/*
+		 * At last, we can check if the size of non-volatile WAL buffer
+		 * (nvwal_size) is multiple of WAL segment size.
+		 *
+		 * Note that NvwalSize has already been calculated in assign_nvwal_size.
+		 */
+		if (NvwalSize % wal_segment_size != 0)
+		{
+			elog(PANIC,
+				 "invalid value for nvwal_size (%dMB): "
+				 "it should be multiple of WAL segment size; "
+				 "NvwalSize %zu; wal_segment_size %d",
+				 NvwalSizeMB, NvwalSize, wal_segment_size);
+		}
+
+		npages = (int64) NvwalSize / XLOG_BLCKSZ;
+		Assert(npages > 0 && npages <= INT_MAX);
+
+		snprintf(buf, sizeof(buf), "%d", (int) npages);
+		SetConfigOption("wal_buffers", buf, PGC_POSTMASTER, PGC_S_OVERRIDE);
+	}
 	/*
 	 * If the value of wal_buffers is -1, use the preferred auto-tune value.
 	 * This isn't an amazingly clean place to do this, but we must wait till
 	 * NBuffers has received its final value, and must do it before using the
 	 * value of XLOGbuffers to do anything important.
 	 */
-	if (XLOGbuffers == -1)
+	else if (XLOGbuffers == -1)
 	{
 		char		buf[32];
 
@@ -5095,10 +5208,13 @@ XLOGShmemSize(void)
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
-	/* extra alignment padding for XLOG I/O buffers */
-	size = add_size(size, XLOG_BLCKSZ);
-	/* and the buffers themselves */
-	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+	if (!NvwalAvail)
+	{
+		/* extra alignment padding for XLOG I/O buffers */
+		size = add_size(size, XLOG_BLCKSZ);
+		/* and the buffers themselves */
+		size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
+	}
 
 	/*
 	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5192,13 +5308,32 @@ XLOGShmemInit(void)
 	}
 
 	/*
-	 * Align the start of the page buffers to a full xlog block size boundary.
-	 * This simplifies some calculations in XLOG insertion. It is also
-	 * required for O_DIRECT.
+	 * Open and memory-map a file for non-volatile XLOG buffer. The PMDK will
+	 * align the start of the buffer to 2-MiB boundary if the size of the
+	 * buffer is larger than or equal to 4 MiB.
 	 */
-	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
-	XLogCtl->pages = allocptr;
-	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+	if (NvwalAvail)
+	{
+		/* Logging and error-handling should be done in the function */
+		XLogCtl->pages = MapNonVolatileXLogBuffer(NvwalPath, NvwalSize);
+
+		/*
+		 * Do not memset non-volatile XLOG buffer (XLogCtl->pages) here
+		 * because it would contain records for recovery. We should do so in
+		 * checkpoint after the recovery completes successfully.
+		 */
+	}
+	else
+	{
+		/*
+		 * Align the start of the page buffers to a full xlog block size
+		 * boundary. This simplifies some calculations in XLOG insertion. It
+		 * is also required for O_DIRECT.
+		 */
+		allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
+		XLogCtl->pages = allocptr;
+		memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
+	}
 
 	/*
 	 * Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
@@ -8602,6 +8737,12 @@ ShutdownXLOG(int code, Datum arg)
 
 		CreateCheckPoint(CHECKPOINT_IS_SHUTDOWN | CHECKPOINT_IMMEDIATE);
 	}
+
+	/*
+	 * If we use non-volatile XLOG buffer, unmap it.
+	 */
+	if (NvwalAvail)
+		UnmapNonVolatileXLogBuffer(XLogCtl->pages, NvwalSize);
 }
 
 /*
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca..6e9a45fba2 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2737,7 +2737,7 @@ static struct config_int ConfigureNamesInt[] =
 			GUC_UNIT_XBLOCKS
 		},
 		&XLOGbuffers,
-		-1, -1, (INT_MAX / XLOG_BLCKSZ),
+		-1, -1, INT_MAX,
 		check_wal_buffers, NULL, NULL
 	},
 
@@ -3445,6 +3445,17 @@ static struct config_int ConfigureNamesInt[] =
 		NULL, NULL, NULL
 	},
 
+	{
+		{"nvwal_size", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Size of non-volatile WAL buffer (NVWAL)."),
+			NULL,
+			GUC_UNIT_MB
+		},
+		&NvwalSizeMB,
+		1024, 1, INT_MAX,
+		check_nvwal_size, assign_nvwal_size, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, 0, 0, 0, NULL, NULL, NULL
@@ -4494,6 +4505,16 @@ static struct config_string ConfigureNamesString[] =
 		check_backtrace_functions, assign_backtrace_functions, NULL
 	},
 
+	{
+		{"nvwal_path", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Path to file for non-volatile WAL buffer (NVWAL)."),
+			NULL
+		},
+		&NvwalPath,
+		"",
+		check_nvwal_path, assign_nvwal_path, NULL
+	},
+
 	/* End-of-list marker */
 	{
 		{NULL, 0, 0, NULL, NULL}, NULL, NULL, NULL, NULL, NULL
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8930a94fff..4b6ced0852 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -230,6 +230,8 @@
 #checkpoint_timeout = 5min		# range 30s-1d
 #max_wal_size = 1GB
 #min_wal_size = 80MB
+#nvwal_path = '/path/to/nvwal'
+#nvwal_size = 1GB
 #checkpoint_completion_target = 0.5	# checkpoint target duration, 0.0 - 1.0
 #checkpoint_flush_after = 0		# measured in pages, 0 disables
 #checkpoint_warning = 30s		# 0 disables
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index e242a4a5b5..a2a87a8ec2 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -146,7 +146,10 @@ static bool show_setting = false;
 static bool data_checksums = false;
 static char *xlog_dir = NULL;
 static char *str_wal_segment_size_mb = NULL;
+static char *nvwal_path = NULL;
+static char *str_nvwal_size_mb = NULL;
 static int	wal_segment_size_mb;
+static int	nvwal_size_mb;
 
 
 /* internal vars */
@@ -1077,14 +1080,78 @@ setup_config(void)
 	conflines = replace_token(conflines, "#port = 5432", repltok);
 #endif
 
-	/* set default max_wal_size and min_wal_size */
-	snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
-			 pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
-	conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
-
-	snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
-			 pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
-	conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+	if (nvwal_path != NULL)
+	{
+		int nr_segs;
+
+		if (str_nvwal_size_mb == NULL)
+			nvwal_size_mb = 1024;
+		else
+		{
+			char *endptr;
+
+			/* check that the argument is a number */
+			nvwal_size_mb = strtol(str_nvwal_size_mb, &endptr, 10);
+
+			/* verify that the size of non-volatile WAL buffer is valid */
+			if (endptr == str_nvwal_size_mb || *endptr != '\0')
+			{
+				pg_log_error("argument of --nvwal-size must be a number; "
+							 "str_nvwal_size_mb '%s'",
+							 str_nvwal_size_mb);
+				exit(1);
+			}
+			if (nvwal_size_mb <= 0)
+			{
+				pg_log_error("argument of --nvwal-size must be a positive number; "
+							 "str_nvwal_size_mb '%s'; nvwal_size_mb %d",
+							 str_nvwal_size_mb, nvwal_size_mb);
+				exit(1);
+			}
+			if (nvwal_size_mb % wal_segment_size_mb != 0)
+			{
+				pg_log_error("argument of --nvwal-size must be multiple of WAL segment size; "
+							 "str_nvwal_size_mb '%s'; nvwal_size_mb %d; wal_segment_size_mb %d",
+							 str_nvwal_size_mb, nvwal_size_mb, wal_segment_size_mb);
+				exit(1);
+			}
+		}
+
+		/*
+		 * XXX We set {min_,max_,nv}wal_size to the same value.  Note that
+		 * postgres might bootstrap and run if the three config does not have
+		 * the same value, but have not been tested yet.
+		 */
+		nr_segs = nvwal_size_mb / wal_segment_size_mb;
+
+		snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "nvwal_path = '%s'",
+				 nvwal_path);
+		conflines = replace_token(conflines,
+								  "#nvwal_path = '/path/to/nvwal'", repltok);
+
+		snprintf(repltok, sizeof(repltok), "nvwal_size = %s",
+				 pretty_wal_size(nr_segs));
+		conflines = replace_token(conflines, "#nvwal_size = 1GB", repltok);
+	}
+	else
+	{
+		/* set default max_wal_size and min_wal_size */
+		snprintf(repltok, sizeof(repltok), "min_wal_size = %s",
+				 pretty_wal_size(DEFAULT_MIN_WAL_SEGS));
+		conflines = replace_token(conflines, "#min_wal_size = 80MB", repltok);
+
+		snprintf(repltok, sizeof(repltok), "max_wal_size = %s",
+				 pretty_wal_size(DEFAULT_MAX_WAL_SEGS));
+		conflines = replace_token(conflines, "#max_wal_size = 1GB", repltok);
+	}
 
 	snprintf(repltok, sizeof(repltok), "lc_messages = '%s'",
 			 escape_quotes(lc_messages));
@@ -2290,6 +2357,8 @@ usage(const char *progname)
 	printf(_("  -W, --pwprompt            prompt for a password for the new superuser\n"));
 	printf(_("  -X, --waldir=WALDIR       location for the write-ahead log directory\n"));
 	printf(_("      --wal-segsize=SIZE    size of WAL segments, in megabytes\n"));
+	printf(_("  -P, --nvwal-path=FILE     path to file for non-volatile WAL buffer (NVWAL)\n"));
+	printf(_("  -Q, --nvwal-size=SIZE     size of NVWAL, in megabytes\n"));
 	printf(_("\nLess commonly used options:\n"));
 	printf(_("  -d, --debug               generate lots of debugging output\n"));
 	printf(_("  -L DIRECTORY              where to find the input files\n"));
@@ -2961,6 +3030,8 @@ main(int argc, char *argv[])
 		{"sync-only", no_argument, NULL, 'S'},
 		{"waldir", required_argument, NULL, 'X'},
 		{"wal-segsize", required_argument, NULL, 12},
+		{"nvwal-path", required_argument, NULL, 'P'},
+		{"nvwal-size", required_argument, NULL, 'Q'},
 		{"data-checksums", no_argument, NULL, 'k'},
 		{"allow-group-access", no_argument, NULL, 'g'},
 		{NULL, 0, NULL, 0}
@@ -3004,7 +3075,7 @@ main(int argc, char *argv[])
 
 	/* process command-line options */
 
-	while ((c = getopt_long(argc, argv, "A:dD:E:gkL:nNsST:U:WX:", long_options, &option_index)) != -1)
+	while ((c = getopt_long(argc, argv, "A:dD:E:gkL:nNP:Q:sST:U:WX:", long_options, &option_index)) != -1)
 	{
 		switch (c)
 		{
@@ -3101,6 +3172,12 @@ main(int argc, char *argv[])
 			case 13:
 				noinstructions = true;
 				break;
+			case 'P':
+				nvwal_path = pg_strdup(optarg);
+				break;
+			case 'Q':
+				str_nvwal_size_mb = pg_strdup(optarg);
+				break;
 			case 'g':
 				SetDataDirectoryCreatePerm(PG_DIR_MODE_GROUP);
 				break;
diff --git a/src/include/access/nv_xlog_buffer.h b/src/include/access/nv_xlog_buffer.h
new file mode 100644
index 0000000000..b58878c92b
--- /dev/null
+++ b/src/include/access/nv_xlog_buffer.h
@@ -0,0 +1,71 @@
+/*
+ * nv_xlog_buffer.h
+ *
+ * PostgreSQL non-volatile WAL buffer
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/nv_xlog_buffer.h
+ */
+#ifndef NV_XLOG_BUFFER_H
+#define NV_XLOG_BUFFER_H
+
+extern void *MapNonVolatileXLogBuffer(const char *fname, Size fsize);
+extern void	UnmapNonVolatileXLogBuffer(void *addr, Size fsize);
+
+#ifdef USE_NVWAL
+#include <libpmem.h>
+
+#define nv_memset_persist	pmem_memset_persist
+#define nv_memcpy_nodrain	pmem_memcpy_nodrain
+#define nv_flush			pmem_flush
+#define nv_drain			pmem_drain
+#define nv_persist			pmem_persist
+
+#else
+void *
+MapNonVolatileXLogBuffer(const char *fname, Size fsize)
+{
+	return NULL;
+}
+
+void
+UnmapNonVolatileXLogBuffer(void *addr, Size fsize)
+{
+	return;
+}
+
+static inline void *
+nv_memset_persist(void *pmemdest, int c, size_t len)
+{
+	return NULL;
+}
+
+static inline void *
+nv_memcpy_nodrain(void *pmemdest, const void *src,
+				  size_t len)
+{
+	return NULL;
+}
+
+static inline void
+nv_flush(void *pmemdest, size_t len)
+{
+	return;
+}
+
+static inline void
+nv_drain(void)
+{
+	return;
+}
+
+static inline void
+nv_persist(const void *addr, size_t len)
+{
+	return;
+}
+
+#endif							/* USE_NVWAL */
+#endif							/* NV_XLOG_BUFFER_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..1ad6132f67 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -131,6 +131,8 @@ extern int	recovery_min_apply_delay;
 extern char *PrimaryConnInfo;
 extern char *PrimarySlotName;
 extern bool wal_receiver_create_temp_slot;
+extern char *NvwalPath;
+extern int  NvwalSizeMB;
 
 /* indirectly set via GUC system */
 extern TransactionId recoveryTargetXid;
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f4d9f3b408..6fd2e40d74 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -325,6 +325,9 @@
 /* Define to 1 if you have the `pam' library (-lpam). */
 #undef HAVE_LIBPAM
 
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
 /* Define if you have a function readline library */
 #undef HAVE_LIBREADLINE
 
@@ -899,6 +902,9 @@
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
+/* Define to 1 to use non-volatile WAL buffer (NVWAL). (--with-nvwal) */
+#undef USE_NVWAL
+
 /* Define to build with OpenSSL support. (--with-openssl) */
 #undef USE_OPENSSL
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 5004ee4177..e6394cd16f 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,6 +440,10 @@ extern void assign_search_path(const char *newval, void *extra);
 
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_nvwal_path(char **newval, void **extra, GucSource source);
+extern void assign_nvwal_path(const char *newval, void *extra);
+extern bool check_nvwal_size(int *newval, void **extra, GucSource source);
+extern void assign_nvwal_size(int newval, void *extra);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
 #endif							/* GUC_H */
-- 
2.25.1

v4-0005-README-for-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0005-README-for-non-volatile-WAL-buffer.patchDownload
From 16f5e3e99e501abb6df0e1b58975e45689aa04f7 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:08:00 +0900
Subject: [PATCH v4 5/6] README for non-volatile WAL buffer

---
 README.nvwal | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 184 insertions(+)
 create mode 100644 README.nvwal

diff --git a/README.nvwal b/README.nvwal
new file mode 100644
index 0000000000..b6b9d576e7
--- /dev/null
+++ b/README.nvwal
@@ -0,0 +1,184 @@
+Non-volatile WAL buffer
+=======================
+Here is a PostgreSQL branch with a proof-of-concept "non-volatile WAL buffer"
+(NVWAL) feature. Putting the WAL buffer pages on persistent memory (PMEM) [1],
+inserting WAL records into it directly, and eliminating I/O for WAL segment
+files, PostgreSQL gets lower latency and higher throughput.
+
+
+Prerequisites and recommends
+----------------------------
+* An x64 system
+  * (Recommended) Supporting CLFLUSHOPT or CLWB instruction
+    * See if lscpu shows "clflushopt" or "clwb" flag
+* An OS supporting PMEM
+  * Linux: 4.15 or later (tested on 5.2)
+  * Windows: (Sorry but we have not tested on Windows yet.)
+* A filesystem supporting DAX (tested on ext4)
+* libpmem in PMDK [2] 1.4 or later (tested on 1.7)
+* ndctl [3] (tested on 61.2)
+* ipmctl [4] if you use Intel DCPMM
+* sudo privilege
+* All other prerequisites of original PostgreSQL
+* (Recommended) PMEM module(s) (NVDIMM-N or Intel DCPMM)
+  * You can emulate PMEM using DRAM [5] even if you have no PMEM module.
+* (Recommended) numactl
+
+
+Build and install PostgreSQL with NVWAL feature
+-----------------------------------------------
+We have a new configure option --with-nvwal.
+
+I believe it is good to install under your home directory with --prefix option.
+If you do so, please DO NOT forget "export PATH".
+
+  $ ./configure --with-nvwal --prefix="$HOME/postgres"
+  $ make
+  $ make install
+  $ export PATH="$HOME/postgres/bin:$PATH"
+
+NOTE: ./configure --with-nvwal will fail if libpmem is not found.
+
+
+Prepare DAX filesystem
+----------------------
+Here we use NVDIMM-N or emulated PMEM, make ext4 filesystem on namespace0.0
+(/dev/pmem0), and mount it onto /mnt/pmem0. Please DO NOT forget "-o dax" option
+on mount. For Intel DCPMM and ipmctl, please see [4].
+
+  $ ndctl list
+  [
+    {
+      "dev":"namespace1.0",
+      "mode":"raw",
+      "size":103079215104,
+      "sector_size":512,
+      "blockdev":"pmem1",
+      "numa_node":1
+    },
+    {
+      "dev":"namespace0.0",
+      "mode":"raw",
+      "size":103079215104,
+      "sector_size":512,
+      "blockdev":"pmem0",
+      "numa_node":0
+    }
+  ]
+
+  $ sudo ndctl create-namespace -f -t pmem -m fsdax -M dev -e namespace0.0
+  {
+    "dev":"namespace0.0",
+    "mode":"fsdax",
+    "map":"dev",
+    "size":"94.50 GiB (101.47 GB)",
+    "uuid":"e7da9d65-140b-4e1e-90ec-6548023a1b6e",
+    "sector_size":512,
+    "blockdev":"pmem0",
+    "numa_node":0
+  }
+
+  $ ls -l /dev/pmem0
+  brw-rw---- 1 root disk 259, 3 Jan  6 17:06 /dev/pmem0
+
+  $ sudo mkfs.ext4 -q -F /dev/pmem0
+  $ sudo mkdir -p /mnt/pmem0
+  $ sudo mount -o dax /dev/pmem0 /mnt/pmem0
+  $ mount -l | grep ^/dev/pmem0
+  /dev/pmem0 on /mnt/pmem0 type ext4 (rw,relatime,dax)
+
+
+Enable transparent huge page
+----------------------------
+Of course transparent huge page would not be suitable for database workload,
+but it improves performance of PMEM by reducing overhead of page walk.
+
+  $ ls -l /sys/kernel/mm/transparent_hugepage/enabled
+  -rw-r--r-- 1 root root 4096 Dec  3 10:38 /sys/kernel/mm/transparent_hugepage/enabled
+
+  $ echo always | sudo dd of=/sys/kernel/mm/transparent_hugepage/enabled 2>/dev/null
+  $ cat /sys/kernel/mm/transparent_hugepage/enabled
+  [always] madvise never
+
+
+initdb
+------
+We have two new options:
+
+  -P, --nvwal-path=FILE  path to file for non-volatile WAL buffer (NVWAL)
+  -Q, --nvwal-size=SIZE  size of NVWAL, in megabytes
+
+If you want to create a new 80GB (81920MB) NVWAL file on /mnt/pmem0/pgsql/nvwal,
+please run initdb as follows:
+
+  $ sudo mkdir -p /mnt/pmem0/pgsql
+  $ sudo chown "$USER:$USER" /mnt/pmem0/pgsql
+  $ export PGDATA="$HOME/pgdata"
+  $ initdb -P /mnt/pmem0/pgsql/nvwal -Q 81920
+
+You will find there is no WAL segment file to be created in PGDATA/pg_wal
+directory. That is okay; your NVWAL file has the content of the first WAL
+segment file.
+
+NOTE:
+* initdb will fail if the given NVWAL size is not multiple of WAL segment
+  size. The segment size is given with initdb --wal-segsize, or is 16MB as
+  default.
+* postgres (executed by initdb) will fail in bootstrap if the directory in
+  which the NVWAL file is being created (/mnt/pmem0/pgsql for example
+  above) does not exist.
+* postgres (executed by initdb) will fail in bootstrap if an entry already
+  exists on the given path.
+* postgres (executed by initdb) will fail in bootstrap if the given path is
+  not on PMEM or you forget "-o dax" option on mount.
+* Resizing an NVWAL file is NOT supported yet. Please be careful to decide
+  how large your NVWAL file is to be.
+* "-Q 1024" (1024MB) will be assumed if -P is given but -Q is not.
+
+
+postgresql.conf
+---------------
+We have two new parameters nvwal_path and nvwal_size, corresponding to the two
+new options of initdb. If you run initdb as above, you will find postgresql.conf
+in your PGDATA directory like as follows:
+
+  max_wal_size = 80GB
+  min_wal_size = 80GB
+  nvwal_path = '/mnt/pmem0/pgsql/nvwal'
+  nvwal_size = 80GB
+
+NOTE:
+* postgres will fail in startup if no file exists on the given nvwal_path.
+* postgres will fail in startup if the given nvwal_size is not equal to the
+  actual NVWAL file size,
+* postgres will fail in startup if the given nvwal_path is not on PMEM or you
+  forget "-o dax" option on mount.
+* wal_buffers will be ignored if nvwal_path is given.
+* You SHOULD give both max_wal_size and min_wal_size the same value as
+  nvwal_size. postgres could possibly run even though the three values are
+  not same, however, we have not tested such a case yet.
+
+
+Startup
+-------
+Same as you know:
+
+  $ pg_ctl start
+
+or use numactl as follows to let postgres run on the specified NUMA node (typi-
+cally the one on which your NVWAL file is) if you need stable performance:
+
+  $ numactl --cpunodebind=0 --membind=0 -- pg_ctl start
+
+
+References
+----------
+[1] https://pmem.io/
+[2] https://pmem.io/pmdk/
+[3] https://docs.pmem.io/ndctl-user-guide/
+[4] https://docs.pmem.io/ipmctl-user-guide/
+[5] https://software.intel.com/en-us/articles/how-to-emulate-persistent-memory-on-an-intel-architecture-server
+
+
+--
+Takashi Menjo <takashi.menjou.vg AT hco.ntt.co.jp>
-- 
2.25.1

v4-0003-walreceiver-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0003-walreceiver-supports-non-volatile-WAL-buffer.patchDownload
From 175ebd6fb15d172cd3029148db4f82a39225b6de Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:58 +0900
Subject: [PATCH v4 3/6] walreceiver supports non-volatile WAL buffer

Now walreceiver stores received records directly to non-volatile
WAL buffer if applicable.
---
 src/backend/access/transam/xlog.c     | 31 +++++++++++++++-
 src/backend/replication/walreceiver.c | 53 ++++++++++++++++++++++++++-
 src/include/access/xlog.h             |  4 ++
 3 files changed, 85 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a74ed6f6c6..a3caf85f1f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -937,6 +937,8 @@ static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
+static bool CopyXLogRecordsOnNVWAL(char *buf, Size count, XLogRecPtr startptr,
+								   bool store);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
@@ -12827,6 +12829,21 @@ GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
  */
 bool
 CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+	return CopyXLogRecordsOnNVWAL(buf, count, startptr, false);
+}
+
+/*
+ * Called by walreceiver.
+ */
+bool
+CopyXLogRecordsToNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+	return CopyXLogRecordsOnNVWAL(buf, count, startptr, true);
+}
+
+static bool
+CopyXLogRecordsOnNVWAL(char *buf, Size count, XLogRecPtr startptr, bool store)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -12876,7 +12893,13 @@ CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
 		max_copy = NvwalSize - off;
 		copybytes = Min(nbytes, max_copy);
 
-		memcpy(p, q, copybytes);
+		if (store)
+		{
+			memcpy(q, p, copybytes);
+			nv_flush(q, copybytes);
+		}
+		else
+			memcpy(p, q, copybytes);
 
 		/* Update state for copy */
 		recptr += copybytes;
@@ -12888,6 +12911,12 @@ CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
 	return true;
 }
 
+void
+SyncNVWAL(void)
+{
+	nv_drain();
+}
+
 static bool
 IsXLogSourceFromStream(XLogSource source)
 {
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 723f513d8b..d799ef81c4 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -123,6 +123,7 @@ static void WalRcvWaitForStartPosition(XLogRecPtr *startpoint, TimeLineID *start
 static void WalRcvDie(int code, Datum arg);
 static void XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len);
 static void XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr);
+static void XLogWalRcvStore(char *buf, Size nbytes, XLogRecPtr recptr);
 static void XLogWalRcvFlush(bool dying);
 static void XLogWalRcvSendReply(bool force, bool requestReply);
 static void XLogWalRcvSendHSFeedback(bool immed);
@@ -829,7 +830,10 @@ XLogWalRcvProcessMsg(unsigned char type, char *buf, Size len)
 
 				buf += hdrlen;
 				len -= hdrlen;
-				XLogWalRcvWrite(buf, len, dataStart);
+				if (IsNvwalAvail())
+					XLogWalRcvStore(buf, len, dataStart);
+				else
+					XLogWalRcvWrite(buf, len, dataStart);
 				break;
 			}
 		case 'k':				/* Keepalive */
@@ -964,6 +968,42 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
 }
 
+/*
+ * Like XLogWalRcvWrite, but store to non-volatile WAL buffer.
+ */
+static void
+XLogWalRcvStore(char *buf, Size nbytes, XLogRecPtr recptr)
+{
+	Assert(IsNvwalAvail());
+
+	CopyXLogRecordsToNVWAL(buf, nbytes, recptr);
+
+	/*
+	 * Also write out to file if we have to archive segments.
+	 *
+	 * We could do this segment by segment but we reuse existing method to
+	 * do it record by record because the former gives us more complexity
+	 * (locking WalBufMappingLock, getting the address of the segment on
+	 * non-volatile WAL buffer, etc).
+	 */
+	if (XLogArchiveMode == ARCHIVE_MODE_ALWAYS)
+		XLogWalRcvWrite(buf, nbytes, recptr);
+	else
+	{
+		/*
+		 * Update status as like XLogWalRcvWrite does.
+		 */
+
+		/* Update process-local status */
+		XLByteToSeg(recptr + nbytes, recvSegNo, wal_segment_size);
+		recvFileTLI = ThisTimeLineID;
+		LogstreamResult.Write = recptr + nbytes;
+
+		/* Update shared-memory status */
+		pg_atomic_write_u64(&WalRcv->writtenUpto, LogstreamResult.Write);
+	}
+}
+
 /*
  * Flush the log to disk.
  *
@@ -977,7 +1017,16 @@ XLogWalRcvFlush(bool dying)
 	{
 		WalRcvData *walrcv = WalRcv;
 
-		issue_xlog_fsync(recvFile, recvSegNo);
+		/*
+		 * We should call both SyncNVWAL and issue_xlog_fsync if we use NVWAL
+		 * and WAL archive.  So we have the following two if-statements, not
+		 * one if-else-statement.
+		 */
+		if (IsNvwalAvail())
+			SyncNVWAL();
+
+		if (recvFile >= 0)
+			issue_xlog_fsync(recvFile, recvSegNo);
 
 		LogstreamResult.Flush = LogstreamResult.Write;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 53f9fef527..2f02b5a45f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -361,6 +361,10 @@ extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
 extern bool CopyXLogRecordsFromNVWAL(char *buf,
 									 Size count,
 									 XLogRecPtr startptr);
+extern bool CopyXLogRecordsToNVWAL(char *buf,
+								   Size count,
+								   XLogRecPtr startptr);
+extern void SyncNVWAL(void);
 
 /*
  * Routines to start, stop, and get status of a base backup.
-- 
2.25.1

v4-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0004-pg_basebackup-supports-non-volatile-WAL-buffer.patchDownload
From 83a63da1220ba85aacbed8c124b532b04edff803 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:59 +0900
Subject: [PATCH v4 4/6] pg_basebackup supports non-volatile WAL buffer

Now pg_basebackup copies received WAL segments onto non-volatile
WAL buffer if you run it with "nvwal" mode (-Fn).

You should specify a new NVWAL path with --nvwal-path option.
The path will be written to postgresql.auto.conf or recovery.conf.
The size of the new NVWAL is same as the master's one.
---
 src/bin/pg_basebackup/pg_basebackup.c | 335 +++++++++++++++++++++++++-
 src/bin/pg_basebackup/streamutil.c    |  69 ++++++
 src/bin/pg_basebackup/streamutil.h    |   3 +
 src/bin/pg_rewind/pg_rewind.c         |   4 +-
 src/fe_utils/recovery_gen.c           |   9 +-
 src/include/fe_utils/recovery_gen.h   |   3 +-
 6 files changed, 407 insertions(+), 16 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_basebackup.c b/src/bin/pg_basebackup/pg_basebackup.c
index 16d8929b23..134c4a67b8 100644
--- a/src/bin/pg_basebackup/pg_basebackup.c
+++ b/src/bin/pg_basebackup/pg_basebackup.c
@@ -25,6 +25,9 @@
 #ifdef HAVE_LIBZ
 #include <zlib.h>
 #endif
+#ifdef USE_NVWAL
+#include <libpmem.h>
+#endif
 
 #include "access/xlog_internal.h"
 #include "common/file_perm.h"
@@ -127,7 +130,8 @@ typedef enum
 static char *basedir = NULL;
 static TablespaceList tablespace_dirs = {NULL, NULL};
 static char *xlog_dir = NULL;
-static char format = 'p';		/* p(lain)/t(ar) */
+static char format = 'p';			/* p(lain)/t(ar); 'p' even if 'nvwal' given */
+static bool format_nvwal = false;	/* true if 'nvwal' given */
 static char *label = "pg_basebackup base backup";
 static bool noclean = false;
 static bool checksum_failure = false;
@@ -150,14 +154,24 @@ static bool verify_checksums = true;
 static bool manifest = true;
 static bool manifest_force_encode = false;
 static char *manifest_checksums = NULL;
+static char *nvwal_path = NULL;
+#ifdef USE_NVWAL
+static size_t nvwal_size = 0;
+static char *nvwal_pages = NULL;
+static size_t nvwal_mapped_len = 0;
+#endif
 
 static bool success = false;
+static bool xlogdir_is_pg_xlog = false;
 static bool made_new_pgdata = false;
 static bool found_existing_pgdata = false;
 static bool made_new_xlogdir = false;
 static bool found_existing_xlogdir = false;
 static bool made_tablespace_dirs = false;
 static bool found_tablespace_dirs = false;
+#ifdef USE_NVWAL
+static bool made_new_nvwal = false;
+#endif
 
 /* Progress counters */
 static uint64 totalsize_kb;
@@ -382,7 +396,7 @@ usage(void)
 	printf(_("  %s [OPTION]...\n"), progname);
 	printf(_("\nOptions controlling the output:\n"));
 	printf(_("  -D, --pgdata=DIRECTORY receive base backup into directory\n"));
-	printf(_("  -F, --format=p|t       output format (plain (default), tar)\n"));
+	printf(_("  -F, --format=p|t|n     output format (plain (default), tar, nvwal)\n"));
 	printf(_("  -r, --max-rate=RATE    maximum transfer rate to transfer data directory\n"
 			 "                         (in kB/s, or use suffix \"k\" or \"M\")\n"));
 	printf(_("  -R, --write-recovery-conf\n"
@@ -390,6 +404,7 @@ usage(void)
 	printf(_("  -T, --tablespace-mapping=OLDDIR=NEWDIR\n"
 			 "                         relocate tablespace in OLDDIR to NEWDIR\n"));
 	printf(_("      --waldir=WALDIR    location for the write-ahead log directory\n"));
+	printf(_("      --nvwal-path=NVWAL location for the NVWAL file\n"));
 	printf(_("  -X, --wal-method=none|fetch|stream\n"
 			 "                         include required WAL files with specified method\n"));
 	printf(_("  -z, --gzip             compress tar output\n"));
@@ -630,9 +645,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 
 	/* In post-10 cluster, pg_xlog has been renamed to pg_wal */
 	snprintf(param->xlog, sizeof(param->xlog), "%s/%s",
-			 basedir,
-			 PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
-			 "pg_xlog" : "pg_wal");
+			 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
 
 	/* Temporary replication slots are only supported in 10 and newer */
 	if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_TEMP_SLOTS)
@@ -669,9 +682,7 @@ StartLogStreamer(char *startpos, uint32 timeline, char *sysidentifier)
 		 * tar file may arrive later.
 		 */
 		snprintf(statusdir, sizeof(statusdir), "%s/%s/archive_status",
-				 basedir,
-				 PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
-				 "pg_xlog" : "pg_wal");
+				 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
 
 		if (pg_mkdir_p(statusdir, pg_dir_create_mode) != 0 && errno != EEXIST)
 		{
@@ -1793,6 +1804,135 @@ ReceiveBackupManifestInMemoryChunk(size_t r, char *copybuf,
 	appendPQExpBuffer(buf, copybuf, r);
 }
 
+#ifdef USE_NVWAL
+static void
+cleanup_nvwal_atexit(void)
+{
+	if (success || in_log_streamer)
+		return;
+
+	if (nvwal_pages != NULL)
+	{
+		pg_log_info("unmapping NVWAL");
+		if (pmem_unmap(nvwal_pages, nvwal_mapped_len) < 0)
+		{
+			pg_log_error("could not unmap NVWAL: %m");
+			return;
+		}
+	}
+
+	if (nvwal_path != NULL && made_new_nvwal)
+	{
+		pg_log_info("removing NVWAL file \"%s\"", nvwal_path);
+		if (unlink(nvwal_path) < 0)
+		{
+			pg_log_error("could not remove NVWAL file \"%s\": %m", nvwal_path);
+			return;
+		}
+	}
+}
+
+static int
+filter_walseg(const struct dirent *d)
+{
+	char			fullpath[MAXPGPATH];
+	struct stat		statbuf;
+
+	if (!IsXLogFileName(d->d_name))
+		return 0;
+
+	snprintf(fullpath, sizeof(fullpath), "%s/%s/%s",
+			 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal", d->d_name);
+
+	if (stat(fullpath, &statbuf) < 0)
+		return 0;
+
+	if (!S_ISREG(statbuf.st_mode))
+		return 0;
+
+	if (statbuf.st_size != WalSegSz)
+		return 0;
+
+	return 1;
+}
+
+static int
+compare_walseg(const struct dirent **a, const struct dirent **b)
+{
+	return strcmp((*a)->d_name, (*b)->d_name);
+}
+
+static void
+free_namelist(struct dirent **namelist, int nr)
+{
+	for (int i = 0; i < nr; ++i)
+		free(namelist[i]);
+
+	free(namelist);
+}
+
+static bool
+copy_walseg_onto_nvwal(const char *segname)
+{
+	char			fullpath[MAXPGPATH];
+	int				fd;
+	size_t			off;
+	struct stat		statbuf;
+	TimeLineID		tli;
+	XLogSegNo		segno;
+
+	snprintf(fullpath, sizeof(fullpath), "%s/%s/%s",
+			 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal", segname);
+
+	fd = open(fullpath, O_RDONLY);
+	if (fd < 0)
+	{
+		pg_log_error("could not open xlog segment \"%s\": %m", fullpath);
+		return false;
+	}
+
+	if (fstat(fd, &statbuf) < 0)
+	{
+		pg_log_error("could not fstat xlog segment \"%s\": %m", fullpath);
+		goto close_on_error;
+	}
+
+	if (!S_ISREG(statbuf.st_mode))
+	{
+		pg_log_error("xlog segment \"%s\" is not a regular file", fullpath);
+		goto close_on_error;
+	}
+
+	if (statbuf.st_size != WalSegSz)
+	{
+		pg_log_error("invalid size of xlog segment \"%s\"; expected %d, actual %zd",
+					 fullpath, WalSegSz, (ssize_t) statbuf.st_size);
+		goto close_on_error;
+	}
+
+	XLogFromFileName(segname, &tli, &segno, WalSegSz);
+	off = ((size_t) segno * WalSegSz) % nvwal_size;
+
+	if (read(fd, &nvwal_pages[off], WalSegSz) < WalSegSz)
+	{
+		pg_log_error("could not fully read xlog segment \"%s\": %m", fullpath);
+		goto close_on_error;
+	}
+
+	if (close(fd) < 0)
+	{
+		pg_log_error("could not close xlog segment \"%s\": %m", fullpath);
+		return false;
+	}
+
+	return true;
+
+close_on_error:
+	(void) close(fd);
+	return false;
+}
+#endif
+
 static void
 BaseBackup(void)
 {
@@ -1851,7 +1991,8 @@ BaseBackup(void)
 	 * Build contents of configuration file if requested
 	 */
 	if (writerecoveryconf)
-		recoveryconfcontents = GenerateRecoveryConfig(conn, replication_slot);
+		recoveryconfcontents = GenerateRecoveryConfig(conn, replication_slot,
+													  nvwal_path);
 
 	/*
 	 * Run IDENTIFY_SYSTEM so we can get the timeline
@@ -2216,6 +2357,69 @@ BaseBackup(void)
 			exit(1);
 	}
 
+#ifdef USE_NVWAL
+	/* Copy xlog segments into NVWAL when nvwal mode */
+	if (format_nvwal)
+	{
+		char	xldr_path[MAXPGPATH];
+		int		nr_segs;
+		struct dirent **namelist;
+
+		/* clear NVWAL before copying xlog segments */
+		pmem_memset_persist(nvwal_pages, 0, nvwal_size);
+
+		snprintf(xldr_path, sizeof(xldr_path), "%s/%s",
+				 basedir, xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
+
+		/*
+		 * Sort xlog segments in ascending order, filtering out non-segment
+		 * files and directories.
+		 */
+		nr_segs = scandir(xldr_path, &namelist, filter_walseg, compare_walseg);
+		if (nr_segs < 0)
+		{
+			pg_log_error("could not scan xlog directory \"%s\": %m", xldr_path);
+			exit(1);
+		}
+
+		/* Copy xlog segments onto NVWAL */
+		for (int i = 0; i < nr_segs; ++i)
+		{
+			if (!copy_walseg_onto_nvwal(namelist[i]->d_name))
+			{
+				free_namelist(namelist, nr_segs);
+				exit(1);
+			}
+		}
+
+		/* Copy compelete; now remove all the xlog segments */
+		for (int i = 0; i < nr_segs; ++i)
+		{
+			char		fullpath[MAXPGPATH];
+
+			snprintf(fullpath, sizeof(fullpath), "%s/%s",
+					 xldr_path, namelist[i]->d_name);
+
+			if (unlink(fullpath) < 0)
+			{
+				pg_log_error("could not remove xlog segment \"%s\": %m", fullpath);
+				free_namelist(namelist, nr_segs);
+				exit(1);
+			}
+		}
+
+		free_namelist(namelist, nr_segs);
+
+		if (pmem_unmap(nvwal_pages, nvwal_mapped_len) < 0)
+		{
+			pg_log_error("could not unmap NVWAL: %m");
+			exit(1);
+		}
+		nvwal_pages = NULL;
+		nvwal_mapped_len = 0;
+	}
+#endif
+
 	if (verbose)
 		pg_log_info("base backup completed");
 }
@@ -2257,6 +2461,7 @@ main(int argc, char **argv)
 		{"no-manifest", no_argument, NULL, 5},
 		{"manifest-force-encode", no_argument, NULL, 6},
 		{"manifest-checksums", required_argument, NULL, 7},
+		{"nvwal-path", required_argument, NULL, 8},
 		{NULL, 0, NULL, 0}
 	};
 	int			c;
@@ -2297,9 +2502,27 @@ main(int argc, char **argv)
 				break;
 			case 'F':
 				if (strcmp(optarg, "p") == 0 || strcmp(optarg, "plain") == 0)
+				{
+					/* See the comment for "nvwal" below */
 					format = 'p';
+					format_nvwal = false;
+				}
 				else if (strcmp(optarg, "t") == 0 || strcmp(optarg, "tar") == 0)
+				{
+					/* See the comment for "nvwal" below */
 					format = 't';
+					format_nvwal = false;
+				}
+				else if (strcmp(optarg, "n") == 0 || strcmp(optarg, "nvwal") == 0)
+				{
+					/*
+					 * If "nvwal" mode given, we set two variables as follows
+					 * because it is almost same as "plain" mode, except NVWAL
+					 * handling.
+					 */
+					format = 'p';
+					format_nvwal = true;
+				}
 				else
 				{
 					pg_log_error("invalid output format \"%s\", must be \"plain\" or \"tar\"",
@@ -2354,6 +2577,9 @@ main(int argc, char **argv)
 			case 1:
 				xlog_dir = pg_strdup(optarg);
 				break;
+			case 8:
+				nvwal_path = pg_strdup(optarg);
+				break;
 			case 'l':
 				label = pg_strdup(optarg);
 				break;
@@ -2535,7 +2761,7 @@ main(int argc, char **argv)
 	{
 		if (format != 'p')
 		{
-			pg_log_error("WAL directory location can only be specified in plain mode");
+			pg_log_error("WAL directory location can only be specified in plain or nvwal mode");
 			fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
 					progname);
 			exit(1);
@@ -2552,6 +2778,44 @@ main(int argc, char **argv)
 		}
 	}
 
+#ifdef USE_NVWAL
+	if (format_nvwal)
+	{
+		if (nvwal_path == NULL)
+		{
+			pg_log_error("NVWAL file location must be given in nvwal mode");
+			fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+					progname);
+			exit(1);
+		}
+
+		/* clean up NVWAL file name and check if it is absolute */
+		canonicalize_path(nvwal_path);
+		if (!is_absolute_path(nvwal_path))
+		{
+			pg_log_error("NVWAL file location must be an absolute path");
+			fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+					progname);
+			exit(1);
+		}
+
+		/* We do not map NVWAL file here because we do not know its size yet */
+	}
+	else if (nvwal_path != NULL)
+	{
+		pg_log_error("NVWAL file location can only be specified in plain or nvwal mode");
+		fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
+				progname);
+		exit(1);
+	}
+#else
+	if (format_nvwal || nvwal_path != NULL)
+	{
+		pg_log_error("this build does not support nvwal mode");
+		exit(1);
+	}
+#endif /* USE_NVWAL */
+
 #ifndef HAVE_LIBZ
 	if (compresslevel != 0)
 	{
@@ -2596,6 +2860,9 @@ main(int argc, char **argv)
 	}
 	atexit(disconnect_atexit);
 
+	/* Remember the predicate for use after disconnection */
+	xlogdir_is_pg_xlog = (PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL);
+
 	/*
 	 * Set umask so that directories/files are created with the same
 	 * permissions as directories/files in the source data directory.
@@ -2622,6 +2889,16 @@ main(int argc, char **argv)
 	if (!RetrieveWalSegSize(conn))
 		exit(1);
 
+#ifdef USE_NVWAL
+	/* determine remote server's NVWAL size */
+	if (format_nvwal)
+	{
+		nvwal_size = RetrieveNvwalSize(conn);
+		if (nvwal_size == 0)
+			exit(1);
+	}
+#endif
+
 	/* Create pg_wal symlink, if required */
 	if (xlog_dir)
 	{
@@ -2634,8 +2911,7 @@ main(int argc, char **argv)
 		 * renamed to pg_wal in post-10 clusters.
 		 */
 		linkloc = psprintf("%s/%s", basedir,
-						   PQserverVersion(conn) < MINIMUM_VERSION_FOR_PG_WAL ?
-						   "pg_xlog" : "pg_wal");
+						   xlogdir_is_pg_xlog ? "pg_xlog" : "pg_wal");
 
 #ifdef HAVE_SYMLINK
 		if (symlink(xlog_dir, linkloc) != 0)
@@ -2650,6 +2926,41 @@ main(int argc, char **argv)
 		free(linkloc);
 	}
 
+#ifdef USE_NVWAL
+	/* Create and map NVWAL file if required */
+	if (format_nvwal)
+	{
+		int		is_pmem = 0;
+
+		nvwal_pages = pmem_map_file(nvwal_path, nvwal_size,
+									PMEM_FILE_CREATE|PMEM_FILE_EXCL,
+									pg_file_create_mode,
+									&nvwal_mapped_len, &is_pmem);
+		if (nvwal_pages == NULL)
+		{
+			pg_log_error("could not map a new NVWAL file \"%s\": %m",
+						 nvwal_path);
+			exit(1);
+		}
+
+		made_new_nvwal = true;
+		atexit(cleanup_nvwal_atexit);
+
+		if (!is_pmem)
+		{
+			pg_log_error("NVWAL file \"%s\" is not on PMEM", nvwal_path);
+			exit(1);
+		}
+
+		if (nvwal_size != nvwal_mapped_len)
+		{
+			pg_log_error("invalid size of NVWAL file \"%s\"; expected %zu, actual %zu",
+						 nvwal_path, nvwal_size, nvwal_mapped_len);
+			exit(1);
+		}
+	}
+#endif
+
 	BaseBackup();
 
 	success = true;
diff --git a/src/bin/pg_basebackup/streamutil.c b/src/bin/pg_basebackup/streamutil.c
index 99daf0e972..d21605ddbd 100644
--- a/src/bin/pg_basebackup/streamutil.c
+++ b/src/bin/pg_basebackup/streamutil.c
@@ -397,6 +397,75 @@ RetrieveDataDirCreatePerm(PGconn *conn)
 	return true;
 }
 
+#ifdef USE_NVWAL
+/*
+ * Returns nvwal_size in bytes if available, 0 otherwise.
+ * Note that it is caller's responsibility to check if the returned
+ * nvwal_size is really valid, that is, multiple of WAL segment size.
+ */
+size_t
+RetrieveNvwalSize(PGconn *conn)
+{
+	PGresult   *res;
+	char		unit[3];
+	int			val;
+	size_t		nvwal_size;
+
+	/* check connection existence */
+	Assert(conn != NULL);
+
+	/* fail if we do not have SHOW command */
+	if (PQserverVersion(conn) < MINIMUM_VERSION_FOR_SHOW_CMD)
+	{
+		pg_log_error("SHOW command is not supported for retrieving nvwal_size");
+		return 0;
+	}
+
+	res = PQexec(conn, "SHOW nvwal_size");
+	if (PQresultStatus(res) != PGRES_TUPLES_OK)
+	{
+		pg_log_error("could not send replication command \"%s\": %s",
+					 "SHOW nvwal_size", PQerrorMessage(conn));
+
+		PQclear(res);
+		return 0;
+	}
+	if (PQntuples(res) != 1 || PQnfields(res) < 1)
+	{
+		pg_log_error("could not fetch NVWAL size: got %d rows and %d fields, expected %d rows and %d or more fields",
+					 PQntuples(res), PQnfields(res), 1, 1);
+
+		PQclear(res);
+		return 0;
+	}
+
+	/* fetch value and unit from the result */
+	if (sscanf(PQgetvalue(res, 0, 0), "%d%s", &val, unit) != 2)
+	{
+		pg_log_error("NVWAL size could not be parsed");
+		PQclear(res);
+		return 0;
+	}
+
+	PQclear(res);
+
+	/* convert to bytes */
+	if (strcmp(unit, "MB") == 0)
+		nvwal_size = ((size_t) val) << 20;
+	else if (strcmp(unit, "GB") == 0)
+		nvwal_size = ((size_t) val) << 30;
+	else if (strcmp(unit, "TB") == 0)
+		nvwal_size = ((size_t) val) << 40;
+	else
+	{
+		pg_log_error("unsupported NVWAL unit");
+		return 0;
+	}
+
+	return nvwal_size;
+}
+#endif
+
 /*
  * Run IDENTIFY_SYSTEM through a given connection and give back to caller
  * some result information if requested:
diff --git a/src/bin/pg_basebackup/streamutil.h b/src/bin/pg_basebackup/streamutil.h
index 10f87ad0c1..516240c05d 100644
--- a/src/bin/pg_basebackup/streamutil.h
+++ b/src/bin/pg_basebackup/streamutil.h
@@ -41,6 +41,9 @@ extern bool RunIdentifySystem(PGconn *conn, char **sysid,
 							  XLogRecPtr *startpos,
 							  char **db_name);
 extern bool RetrieveWalSegSize(PGconn *conn);
+#ifdef USE_NVWAL
+extern size_t RetrieveNvwalSize(PGconn *conn);
+#endif
 extern TimestampTz feGetCurrentTimestamp(void);
 extern void feTimestampDifference(TimestampTz start_time, TimestampTz stop_time,
 								  long *secs, int *microsecs);
diff --git a/src/bin/pg_rewind/pg_rewind.c b/src/bin/pg_rewind/pg_rewind.c
index 359a6a587c..138b6dbb43 100644
--- a/src/bin/pg_rewind/pg_rewind.c
+++ b/src/bin/pg_rewind/pg_rewind.c
@@ -394,7 +394,7 @@ main(int argc, char **argv)
 		pg_log_info("no rewind required");
 		if (writerecoveryconf && !dry_run)
 			WriteRecoveryConfig(conn, datadir_target,
-								GenerateRecoveryConfig(conn, NULL));
+								GenerateRecoveryConfig(conn, NULL, NULL));
 		exit(0);
 	}
 
@@ -469,7 +469,7 @@ main(int argc, char **argv)
 	/* Also update the standby configuration, if requested. */
 	if (writerecoveryconf && !dry_run)
 		WriteRecoveryConfig(conn, datadir_target,
-							GenerateRecoveryConfig(conn, NULL));
+							GenerateRecoveryConfig(conn, NULL, NULL));
 
 	/* don't need the source connection anymore */
 	source->destroy(source);
diff --git a/src/fe_utils/recovery_gen.c b/src/fe_utils/recovery_gen.c
index 2643ecd6f3..2da08cbd8e 100644
--- a/src/fe_utils/recovery_gen.c
+++ b/src/fe_utils/recovery_gen.c
@@ -20,7 +20,7 @@ static char *escape_quotes(const char *src);
  * return it.
  */
 PQExpBuffer
-GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
+GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot, char *nvwal_path)
 {
 	PQconninfoOption *connOptions;
 	PQExpBufferData conninfo_buf;
@@ -95,6 +95,13 @@ GenerateRecoveryConfig(PGconn *pgconn, char *replication_slot)
 						  replication_slot);
 	}
 
+	if (nvwal_path)
+	{
+		escaped = escape_quotes(nvwal_path);
+		appendPQExpBuffer(contents, "nvwal_path = '%s'\n", escaped);
+		free(escaped);
+	}
+
 	if (PQExpBufferBroken(contents))
 	{
 		pg_log_error("out of memory");
diff --git a/src/include/fe_utils/recovery_gen.h b/src/include/fe_utils/recovery_gen.h
index 7ac8953943..169c3f1337 100644
--- a/src/include/fe_utils/recovery_gen.h
+++ b/src/include/fe_utils/recovery_gen.h
@@ -21,7 +21,8 @@
 #define MINIMUM_VERSION_FOR_RECOVERY_GUC 120000
 
 extern PQExpBuffer GenerateRecoveryConfig(PGconn *pgconn,
-										  char *pg_replication_slot);
+										  char *pg_replication_slot,
+										  char *nvwal_path);
 extern void WriteRecoveryConfig(PGconn *pgconn, char *target_dir,
 								PQExpBuffer contents);
 
-- 
2.25.1

v4-0002-Non-volatile-WAL-buffer.patchapplication/octet-stream; name=v4-0002-Non-volatile-WAL-buffer.patchDownload
From 5baddf98e2dcc4b3dc5550f93570a50f2b4f43fd Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 24 Jun 2020 15:07:57 +0900
Subject: [PATCH v4 2/6] Non-volatile WAL buffer

Now external WAL buffer becomes non-volatile.

Bumps PG_CONTROL_VERSION.
---
 src/backend/access/transam/xlog.c            | 1158 ++++++++++++++++--
 src/backend/access/transam/xlogreader.c      |   24 +
 src/bin/pg_controldata/pg_controldata.c      |    3 +
 src/include/access/xlog.h                    |    8 +
 src/include/catalog/pg_control.h             |   17 +-
 src/test/regress/expected/misc_functions.out |   14 +-
 src/test/regress/sql/misc_functions.sql      |   14 +-
 7 files changed, 1099 insertions(+), 139 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8a125193aa..a74ed6f6c6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -654,6 +654,13 @@ typedef struct XLogCtlData
 	TimeLineID	ThisTimeLineID;
 	TimeLineID	PrevTimeLineID;
 
+	/*
+	 * Used for non-volatile WAL buffer (NVWAL).
+	 *
+	 * All the records up to this LSN are persistent in NVWAL.
+	 */
+	XLogRecPtr	persistentUpTo;
+
 	/*
 	 * SharedRecoveryState indicates if we're still in crash or archive
 	 * recovery.  Protected by info_lck.
@@ -795,11 +802,13 @@ typedef enum
 	XLOG_FROM_ANY = 0,			/* request to read WAL from any source */
 	XLOG_FROM_ARCHIVE,			/* restored using restore_command */
 	XLOG_FROM_PG_WAL,			/* existing file in pg_wal */
-	XLOG_FROM_STREAM			/* streamed from primary */
+	XLOG_FROM_NVWAL,			/* non-volatile WAL buffer */
+	XLOG_FROM_STREAM,			/* streamed from primary via segment file */
+	XLOG_FROM_STREAM_NVWAL		/* same as above, but via NVWAL */
 } XLogSource;
 
 /* human-readable names for XLogSources, for debugging output */
-static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
+static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "nvwal", "stream", "stream_nvwal"};
 
 /*
  * openLogFile is -1 or a kernel FD for an open log file segment.
@@ -934,6 +943,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
+static void PreallocNonVolatileXlogBuffer(void);
 static void PreallocXlogFiles(XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
@@ -1218,6 +1228,43 @@ XLogInsertRecord(XLogRecData *rdata,
 		}
 	}
 
+	/*
+	 * Request a checkpoint here if non-volatile WAL buffer is used and we
+	 * have consumed too much WAL since the last checkpoint.
+	 *
+	 * We first screen under the condition (1) OR (2) below:
+	 *
+	 * (1) The record was the first one in a certain segment.
+	 * (2) The record was inserted across segments.
+	 *
+	 * We then check the segment number which the record was inserted into.
+	 */
+	if (NvwalAvail && inserted &&
+		(StartPos % wal_segment_size == SizeOfXLogLongPHD ||
+		 StartPos / wal_segment_size < EndPos / wal_segment_size))
+	{
+		XLogSegNo	end_segno;
+
+		XLByteToSeg(EndPos, end_segno, wal_segment_size);
+
+		/*
+		 * NOTE: We do not signal walsender here because the inserted record
+		 * have not drained by NVWAL buffer yet.
+		 *
+		 * NOTE: We do not signal walarchiver here because the inserted record
+		 * have not flushed to a segment file.  So we don't need to update
+		 * XLogCtl->lastSegSwitch{Time,LSN}, used only by CheckArchiveTimeout.
+		 */
+
+		/* Two-step checking for speed (see also XLogWrite) */
+		if (IsUnderPostmaster && XLogCheckpointNeeded(end_segno))
+		{
+			(void) GetRedoRecPtr();
+			if (XLogCheckpointNeeded(end_segno))
+				RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
+		}
+	}
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
 	{
@@ -2151,6 +2198,15 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	XLogRecPtr	NewPageBeginPtr;
 	XLogPageHeader NewPage;
 	int			npages = 0;
+	bool		is_firstpage;
+
+	if (NvwalAvail)
+		elog(DEBUG1, "XLogCtl->InitializedUpTo %X/%X; upto %X/%X; opportunistic %s",
+			 (uint32) (XLogCtl->InitializedUpTo >> 32),
+			 (uint32) XLogCtl->InitializedUpTo,
+			 (uint32) (upto >> 32),
+			 (uint32) upto,
+			 opportunistic ? "true" : "false");
 
 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
 
@@ -2212,7 +2268,25 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 				{
 					/* Have to write it ourselves */
 					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
-					WriteRqst.Write = OldPageRqstPtr;
+
+					if (NvwalAvail)
+					{
+						/*
+						 * If we use non-volatile WAL buffer, it is a special
+						 * but expected case to write the buffer pages out to
+						 * segment files, and for simplicity, it is done in
+						 * segment by segment.
+						 */
+						XLogRecPtr		OldSegEndPtr;
+
+						OldSegEndPtr = OldPageRqstPtr - XLOG_BLCKSZ + wal_segment_size;
+						Assert(OldSegEndPtr % wal_segment_size == 0);
+
+						WriteRqst.Write = OldSegEndPtr;
+					}
+					else
+						WriteRqst.Write = OldPageRqstPtr;
+
 					WriteRqst.Flush = 0;
 					XLogWrite(WriteRqst, false);
 					LWLockRelease(WALWriteLock);
@@ -2240,7 +2314,20 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		 * Be sure to re-zero the buffer so that bytes beyond what we've
 		 * written will look like zeroes and not valid XLOG records...
 		 */
-		MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
+		if (NvwalAvail)
+		{
+			/*
+			 * We do not take the way that combines MemSet() and pmem_persist()
+			 * because pmem_persist() may use slow and strong-ordered cache
+			 * flush instruction if weak-ordered fast one is not supported.
+			 * Instead, we first fill the buffer with zero by
+			 * pmem_memset_persist() that can leverage non-temporal fast store
+			 * instructions, then make the header persistent later.
+			 */
+			nv_memset_persist(NewPage, 0, XLOG_BLCKSZ);
+		}
+		else
+			MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
 
 		/*
 		 * Fill the new page's header
@@ -2272,7 +2359,8 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		/*
 		 * If first page of an XLOG segment file, make it a long header.
 		 */
-		if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+		is_firstpage = ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0);
+		if (is_firstpage)
 		{
 			XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
 
@@ -2287,7 +2375,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 		 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
 		 * holding a lock.
 		 */
-		pg_write_barrier();
+		if (NvwalAvail)
+		{
+			/* Make the header persistent on PMEM */
+			nv_persist(NewPage, is_firstpage ? SizeOfXLogLongPHD : SizeOfXLogShortPHD);
+		}
+		else
+			pg_write_barrier();
 
 		*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
 
@@ -2297,6 +2391,13 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	}
 	LWLockRelease(WALBufMappingLock);
 
+	if (NvwalAvail)
+		elog(DEBUG1, "ControlFile->discardedUpTo %X/%X; XLogCtl->InitializedUpTo %X/%X",
+			 (uint32) (ControlFile->discardedUpTo >> 32),
+			 (uint32) ControlFile->discardedUpTo,
+			 (uint32) (XLogCtl->InitializedUpTo >> 32),
+			 (uint32) XLogCtl->InitializedUpTo);
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG && npages > 0)
 	{
@@ -2678,6 +2779,23 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		LogwrtResult.Flush = LogwrtResult.Write;
 	}
 
+	/*
+	 * Update discardedUpTo if NVWAL is used.  A new value should not fall
+	 * behind the old one.
+	 */
+	if (NvwalAvail)
+	{
+		Assert(LogwrtResult.Write == LogwrtResult.Flush);
+
+		LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+		if (ControlFile->discardedUpTo < LogwrtResult.Write)
+		{
+			ControlFile->discardedUpTo = LogwrtResult.Write;
+			UpdateControlFile();
+		}
+		LWLockRelease(ControlFileLock);
+	}
+
 	/*
 	 * Update shared-memory status
 	 *
@@ -2882,6 +3000,123 @@ XLogFlush(XLogRecPtr record)
 		return;
 	}
 
+	if (NvwalAvail)
+	{
+		XLogRecPtr	FromPos;
+
+		/*
+		 * No page on the NVWAL is to be flushed to segment files.  Instead,
+		 * we wait all the insertions preceding this one complete.  We will
+		 * wait for all the records to be persistent on the NVWAL below.
+		 */
+		record = WaitXLogInsertionsToFinish(record);
+
+		/*
+		 * Check if another backend already have done what I am doing.
+		 *
+		 * We can compare something <= XLogCtl->persistentUpTo without
+		 * holding XLogCtl->info_lck spinlock because persistentUpTo is
+		 * monotonically increasing and can be loaded atomically on each
+		 * NVWAL-supported platform (now x64 only).
+		 */
+		FromPos = *((volatile XLogRecPtr *) &XLogCtl->persistentUpTo);
+		if (record <= FromPos)
+			return;
+
+		/*
+		 * In a very rare case, we rounded whole the NVWAL.  We do not need
+		 * to care old pages here because they already have been evicted to
+		 * segment files at record insertion.
+		 *
+		 * In such a case, we flush whole the NVWAL.  We also log it as
+		 * warning because it can be time-consuming operation.
+		 *
+		 * TODO Advance XLogCtl->persistentUpTo at the end of XLogWrite, and
+		 * we can remove the following first if-block.
+		 */
+		if (record - FromPos > NvwalSize)
+		{
+			elog(WARNING, "flush whole the NVWAL; FromPos %X/%X; record %X/%X",
+				 (uint32) (FromPos >> 32), (uint32) FromPos,
+				 (uint32) (record >> 32), (uint32) record);
+
+			nv_flush(XLogCtl->pages, NvwalSize);
+		}
+		else
+		{
+			char   *frompos;
+			char   *uptopos;
+			size_t	fromoff;
+			size_t	uptooff;
+
+			/*
+			 * Flush each record that is probably not flushed yet.
+			 *
+			 * We have two reasons why we say "probably".  The first is because
+			 * such a record copied with non-temporal store instruction has
+			 * already "flushed" but we cannot distinguish it.  nv_flush is
+			 * harmless for it in consistency.
+			 *
+			 * The second reason is that the target record might have already
+			 * been evicted to a segment file until now.  Also in this case,
+			 * nv_flush is harmless in consistency.
+			 */
+			uptooff = record % NvwalSize;
+			uptopos = XLogCtl->pages + uptooff;
+			fromoff = FromPos % NvwalSize;
+			frompos = XLogCtl->pages + fromoff;
+
+			/* Handles rotation */
+			if (uptopos <= frompos)
+			{
+				nv_flush(frompos, NvwalSize - fromoff);
+				fromoff = 0;
+				frompos = XLogCtl->pages;
+			}
+
+			nv_flush(frompos, uptooff - fromoff);
+		}
+
+		/*
+		 * To guarantee durability ("D" of ACID), we should satisfy the
+		 * following two for each transaction X:
+		 *
+		 *  (1) All the WAL records inserted by X, including the commit record
+		 *      of X, should persist on NVWAL before the server commits X.
+		 *
+		 *  (2) All the WAL records inserted by any other transactions than
+		 *      X, that have less LSN than the commit record just inserted
+		 *      by X, should persist on NVWAL before the server commits X.
+		 *
+		 * The (1) can be satisfied by a store barrier after the commit record
+		 * of X is flushed because each WAL record on X is already flushed in
+		 * the end of its insertion.  The (2) can be satisfied by waiting for
+		 * any record insertions that have less LSN than the commit record just
+		 * inserted by X, and by a store barrier as well.
+		 *
+		 * Now is the time.  Have a store barrier.
+		 */
+		nv_drain();
+
+		/*
+		 * Remember where the last persistent record is.  A new value should
+		 * not fall behind the old one.
+		 */
+		SpinLockAcquire(&XLogCtl->info_lck);
+		if (XLogCtl->persistentUpTo < record)
+			XLogCtl->persistentUpTo = record;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		/*
+		 * The records up to the returned "record" have been persisntent on
+		 * NVWAL.  Now signal walsenders.
+		 */
+		WalSndWakeupRequest();
+		WalSndWakeupProcessRequests();
+
+		return;
+	}
+
 	/* Quick exit if already known flushed */
 	if (record <= LogwrtResult.Flush)
 		return;
@@ -3065,6 +3300,13 @@ XLogBackgroundFlush(void)
 	if (RecoveryInProgress())
 		return false;
 
+	/*
+	 * Quick exit if NVWAL buffer is used and archiving is not active. In this
+	 * case, we need no WAL segment file in pg_wal directory.
+	 */
+	if (NvwalAvail && !XLogArchivingActive())
+		return false;
+
 	/* read LogwrtResult and update local state */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
@@ -3083,6 +3325,18 @@ XLogBackgroundFlush(void)
 		flexible = false;		/* ensure it all gets written */
 	}
 
+	/*
+	 * If NVWAL is used, back off to the last compeleted segment boundary
+	 * for writing the buffer page to files in segment by segment.  We do so
+	 * nowhere but here after XLogCtl->asyncXactLSN is loaded because it
+	 * should be considered.
+	 */
+	if (NvwalAvail)
+	{
+		WriteRqst.Write -= WriteRqst.Write % wal_segment_size;
+		flexible = false;		/* ensure it all gets written */
+	}
+
 	/*
 	 * If already known flushed, we're done. Just need to check if we are
 	 * holding an open file handle to a logfile that's no longer in use,
@@ -3109,7 +3363,12 @@ XLogBackgroundFlush(void)
 	flushbytes =
 		WriteRqst.Write / XLOG_BLCKSZ - LogwrtResult.Flush / XLOG_BLCKSZ;
 
-	if (WalWriterFlushAfter == 0 || lastflush == 0)
+	if (NvwalAvail)
+	{
+		WriteRqst.Flush = WriteRqst.Write;
+		lastflush = now;
+	}
+	else if (WalWriterFlushAfter == 0 || lastflush == 0)
 	{
 		/* first call, or block based limits disabled */
 		WriteRqst.Flush = WriteRqst.Write;
@@ -3168,7 +3427,28 @@ XLogBackgroundFlush(void)
 	 * Great, done. To take some work off the critical path, try to initialize
 	 * as many of the no-longer-needed WAL buffers for future use as we can.
 	 */
-	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
+	if (NvwalAvail && max_wal_senders == 0)
+	{
+		XLogRecPtr		upto;
+
+		/*
+		 * If NVWAL is used and there is no walsender, nobody is to load
+		 * segments on the buffer.  So let's recycle segments up to {where we
+		 * have requested to write and flush} + NvwalSize.
+		 *
+		 * Note that if NVWAL is used and a walsender seems running, we have to
+		 * do nothing; keep the written pages on the buffer for walsenders to be
+		 * loaded from the buffer, not from the segment files.  Note that the
+		 * buffer pages are eventually to be recycled by checkpoint.
+		 */
+		Assert(WriteRqst.Write == WriteRqst.Flush);
+		Assert(WriteRqst.Write % wal_segment_size == 0);
+
+		upto = WriteRqst.Write + NvwalSize;
+		AdvanceXLInsertBuffer(upto - XLOG_BLCKSZ, false);
+	}
+	else
+		AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
 
 	/*
 	 * If we determined that we need to write data, but somebody else
@@ -3916,6 +4196,43 @@ XLogFileClose(void)
 	ReleaseExternalFD();
 }
 
+/*
+ * Preallocate non-volatile XLOG buffers.
+ *
+ * This zeroes buffers and prepare page headers up to
+ * ControlFile->discardedUpTo + S, where S is the total size of
+ * the non-volatile XLOG buffers.
+ *
+ * It is caller's responsibility to update ControlFile->discardedUpTo
+ * and to set XLogCtl->InitializedUpTo properly.
+ */
+static void
+PreallocNonVolatileXlogBuffer(void)
+{
+	XLogRecPtr	newupto,
+				InitializedUpTo;
+
+	Assert(NvwalAvail);
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	newupto = ControlFile->discardedUpTo;
+	LWLockRelease(ControlFileLock);
+
+	InitializedUpTo = XLogCtl->InitializedUpTo;
+
+	newupto += NvwalSize;
+	Assert(newupto % wal_segment_size == 0);
+
+	if (newupto <= InitializedUpTo)
+		return;
+
+	/*
+	 * Subtracting XLOG_BLCKSZ is important, because AdvanceXLInsertBuffer
+	 * handles the first argument as the beginning of pages, not the end.
+	 */
+	AdvanceXLInsertBuffer(newupto - XLOG_BLCKSZ, false);
+}
+
 /*
  * Preallocate log files beyond the specified log endpoint.
  *
@@ -4212,8 +4529,11 @@ RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
 	 * Before deleting the file, see if it can be recycled as a future log
 	 * segment. Only recycle normal files, pg_standby for example can create
 	 * symbolic links pointing to a separate archive directory.
+	 *
+	 * If NVWAL buffer is used, a log segment file is never to be recycled
+	 * (that is, always go into else block).
 	 */
-	if (wal_recycle &&
+	if (!NvwalAvail && wal_recycle &&
 		*endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(endlogSegNo, path,
@@ -4631,6 +4951,7 @@ InitControlFile(uint64 sysidentifier)
 	memcpy(ControlFile->mock_authentication_nonce, mock_auth_nonce, MOCK_AUTH_NONCE_LEN);
 	ControlFile->state = DB_SHUTDOWNED;
 	ControlFile->unloggedLSN = FirstNormalUnloggedLSN;
+	ControlFile->discardedUpTo = (NvwalAvail) ? wal_segment_size : InvalidXLogRecPtr;
 
 	/* Set important parameter values for use when replaying WAL */
 	ControlFile->MaxConnections = MaxConnections;
@@ -5461,41 +5782,58 @@ BootStrapXLOG(void)
 	record->xl_crc = crc;
 
 	/* Create first XLOG segment file */
-	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	if (NvwalAvail)
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		nv_memcpy_nodrain(XLogCtl->pages + wal_segment_size, page, XLOG_BLCKSZ);
+		pgstat_report_wait_end();
 
-	/*
-	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
-	 * close the file again in a moment.
-	 */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+		nv_drain();
+		pgstat_report_wait_end();
 
-	/* Write the first page with the initial record */
-	errno = 0;
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
-	{
-		/* if write didn't set errno, assume problem is no disk space */
-		if (errno == 0)
-			errno = ENOSPC;
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not write bootstrap write-ahead log file: %m")));
+		/*
+		 * Other WAL stuffs will be initialized in startup process.
+		 */
 	}
-	pgstat_report_wait_end();
+	else
+	{
+		use_existent = false;
+		openLogFile = XLogFileInit(1, &use_existent, false);
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
-	pgstat_report_wait_end();
+		/*
+		 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
+		 * close the file again in a moment.
+		 */
 
-	if (close(openLogFile) != 0)
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not close bootstrap write-ahead log file: %m")));
+		/* Write the first page with the initial record */
+		errno = 0;
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write bootstrap write-ahead log file: %m")));
+		}
+		pgstat_report_wait_end();
 
-	openLogFile = -1;
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
+		if (pg_fsync(openLogFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_end();
+
+		if (close(openLogFile) != 0)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not close bootstrap write-ahead log file: %m")));
+
+		openLogFile = -1;
+	}
 
 	/* Now create pg_control */
 	InitControlFile(sysidentifier);
@@ -5749,41 +6087,47 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * happens in the middle of a segment, copy data from the last WAL segment
 	 * of the old timeline up to the switch point, to the starting WAL segment
 	 * on the new timeline.
+	 *
+	 * If non-volatile WAL buffer is used, no new segment file is created. Data
+	 * up to the switch point will be copied into NVWAL buffer by StartupXLOG().
 	 */
-	if (endLogSegNo == startLogSegNo)
+	if (!NvwalAvail)
 	{
-		/*
-		 * Make a copy of the file on the new timeline.
-		 *
-		 * Writing WAL isn't allowed yet, so there are no locking
-		 * considerations. But we should be just as tense as XLogFileInit to
-		 * avoid emplacing a bogus file.
-		 */
-		XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
-					 XLogSegmentOffset(endOfLog, wal_segment_size));
-	}
-	else
-	{
-		/*
-		 * The switch happened at a segment boundary, so just create the next
-		 * segment on the new timeline.
-		 */
-		bool		use_existent = true;
-		int			fd;
+		if (endLogSegNo == startLogSegNo)
+		{
+			/*
+			 * Make a copy of the file on the new timeline.
+			 *
+			 * Writing WAL isn't allowed yet, so there are no locking
+			 * considerations. But we should be just as tense as XLogFileInit to
+			 * avoid emplacing a bogus file.
+			 */
+			XLogFileCopy(endLogSegNo, endTLI, endLogSegNo,
+						 XLogSegmentOffset(endOfLog, wal_segment_size));
+		}
+		else
+		{
+			/*
+			 * The switch happened at a segment boundary, so just create the next
+			 * segment on the new timeline.
+			 */
+			bool		use_existent = true;
+			int			fd;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+			fd = XLogFileInit(startLogSegNo, &use_existent, true);
 
-		if (close(fd) != 0)
-		{
-			char		xlogfname[MAXFNAMELEN];
-			int			save_errno = errno;
+			if (close(fd) != 0)
+			{
+				char		xlogfname[MAXFNAMELEN];
+				int			save_errno = errno;
 
-			XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
-						 wal_segment_size);
-			errno = save_errno;
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not close file \"%s\": %m", xlogfname)));
+				XLogFileName(xlogfname, ThisTimeLineID, startLogSegNo,
+							 wal_segment_size);
+				errno = save_errno;
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not close file \"%s\": %m", xlogfname)));
+			}
 		}
 	}
 
@@ -7084,6 +7428,11 @@ StartupXLOG(void)
 		InRecovery = true;
 	}
 
+	/* Dump discardedUpTo just before REDO */
+	elog(LOG, "ControlFile->discardedUpTo %X/%X",
+		 (uint32) (ControlFile->discardedUpTo >> 32),
+		 (uint32) ControlFile->discardedUpTo);
+
 	/* REDO */
 	if (InRecovery)
 	{
@@ -7874,10 +8223,88 @@ StartupXLOG(void)
 	Insert->PrevBytePos = XLogRecPtrToBytePos(LastRec);
 	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
 
+	if (NvwalAvail)
+	{
+		XLogRecPtr	discardedUpTo;
+
+		discardedUpTo = ControlFile->discardedUpTo;
+		Assert(discardedUpTo == InvalidXLogRecPtr ||
+			   discardedUpTo % wal_segment_size == 0);
+
+		if (discardedUpTo == InvalidXLogRecPtr)
+		{
+			elog(DEBUG1, "brand-new NVWAL");
+
+			/* The following "Tricky point" is to initialize the buffer */
+		}
+		else if (EndOfLog <= discardedUpTo)
+		{
+			elog(DEBUG1, "no record on NVWAL has been UNDONE");
+
+			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+			ControlFile->discardedUpTo = InvalidXLogRecPtr;
+			UpdateControlFile();
+			LWLockRelease(ControlFileLock);
+
+			nv_memset_persist(XLogCtl->pages, 0, NvwalSize);
+
+			/* The following "Tricky point" is to initialize the buffer */
+		}
+		else
+		{
+			int			last_idx;
+			int			idx;
+			XLogRecPtr	ptr;
+
+			elog(DEBUG1, "some records on NVWAL have been UNDONE; keep them");
+
+			/*
+			 * Initialize xlblock array because we decided to keep UNDONE
+			 * records on NVWAL buffer; or each page on the buffer that meets
+			 * xlblocks == 0 (initialized as so by XLOGShmemInit) is to be
+			 * accidentally cleared by the following AdvanceXLInsertBuffer!
+			 *
+			 * Two cases can be considered:
+			 *
+			 * 1) EndOfLog is on a page boundary (divisible by XLOG_BLCKSZ):
+			 *    Initialize up to (and including) the page containing the last
+			 *    record.  That page should end with EndOfLog.  The one more
+			 *    next page "N" beginning with EndOfLog is to be untouched
+			 *    because, in such a very corner case that all the NVWAL
+			 *    buffer pages are already filled, page N is on the same
+			 *    location as the first page "F" beginning with discardedUpTo.
+			 *    Of cource we should not overwrite the page F.
+			 *
+			 *    In this case, we first get XLogRecPtrToBufIdx(EndOfLog) as
+			 *    last_idx, indicating the page N.  Then, we go forward from
+			 *    the page F up to (but excluding) page N that have the same
+			 *    index as the page F.
+			 *
+			 * 2) EndOfLog is not on a page boundary:  Initialize all the pages
+			 *    but the page "L" having the last record. The page L is to be
+			 *    initialized by the following "Tricky point", including its
+			 *    content.
+			 *
+			 * In either case, XLogCtl->InitializedUpTo is to be initialized in
+			 * the following "Tricky" if-else block.
+			 */
+
+			last_idx = XLogRecPtrToBufIdx(EndOfLog);
+
+			ptr = discardedUpTo;
+			for (idx = XLogRecPtrToBufIdx(ptr); idx != last_idx;
+				 idx = NextBufIdx(idx))
+			{
+				ptr += XLOG_BLCKSZ;
+				XLogCtl->xlblocks[idx] = ptr;
+			}
+		}
+	}
+
 	/*
-	 * Tricky point here: readBuf contains the *last* block that the LastRec
-	 * record spans, not the one it starts in.  The last block is indeed the
-	 * one we want to use.
+	 * Tricky point here: readBuf contains the *last* block that the
+	 * LastRec record spans, not the one it starts in.  The last block is
+	 * indeed the one we want to use.
 	 */
 	if (EndOfLog % XLOG_BLCKSZ != 0)
 	{
@@ -7897,6 +8324,9 @@ StartupXLOG(void)
 		memcpy(page, xlogreader->readBuf, len);
 		memset(page + len, 0, XLOG_BLCKSZ - len);
 
+		if (NvwalAvail)
+			nv_persist(page, XLOG_BLCKSZ);
+
 		XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
 		XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
 	}
@@ -7910,12 +8340,54 @@ StartupXLOG(void)
 		XLogCtl->InitializedUpTo = EndOfLog;
 	}
 
-	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+	if (NvwalAvail)
+	{
+		XLogRecPtr	SegBeginPtr;
 
-	XLogCtl->LogwrtResult = LogwrtResult;
+		/*
+		 * If NVWAL buffer is used, writing records out to segment files should
+		 * be done in segment by segment. So Logwrt{Rqst,Result} (and also
+		 * discardedUpTo) should be multiple of wal_segment_size.  Let's get
+		 * them back off to the last segment boundary.
+		 */
 
-	XLogCtl->LogwrtRqst.Write = EndOfLog;
-	XLogCtl->LogwrtRqst.Flush = EndOfLog;
+		SegBeginPtr = EndOfLog - (EndOfLog % wal_segment_size);
+		LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+		XLogCtl->LogwrtResult = LogwrtResult;
+		XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+		XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+
+		/*
+		 * persistentUpTo does not need to be multiple of wal_segment_size,
+		 * and should be drained-up-to LSN. walsender will use it to load
+		 * records from NVWAL buffer.
+		 */
+		XLogCtl->persistentUpTo = EndOfLog;
+
+		/* Update discardedUpTo in pg_control if still invalid */
+		if (ControlFile->discardedUpTo == InvalidXLogRecPtr)
+		{
+			LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+			ControlFile->discardedUpTo = SegBeginPtr;
+			UpdateControlFile();
+			LWLockRelease(ControlFileLock);
+		}
+
+		elog(DEBUG1, "EndOfLog: %X/%X",
+			 (uint32) (EndOfLog >> 32), (uint32) EndOfLog);
+
+		elog(DEBUG1, "SegBeginPtr: %X/%X",
+			 (uint32) (SegBeginPtr >> 32), (uint32) SegBeginPtr);
+	}
+	else
+	{
+		LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
+
+		XLogCtl->LogwrtResult = LogwrtResult;
+
+		XLogCtl->LogwrtRqst.Write = EndOfLog;
+		XLogCtl->LogwrtRqst.Flush = EndOfLog;
+	}
 
 	/*
 	 * Update full_page_writes in shared memory and write an XLOG_FPW_CHANGE
@@ -8046,6 +8518,7 @@ StartupXLOG(void)
 				char		origpath[MAXPGPATH];
 				char		partialfname[MAXFNAMELEN];
 				char		partialpath[MAXPGPATH];
+				XLogRecPtr	discardedUpTo;
 
 				XLogFilePath(origpath, EndOfLogTLI, endLogSegNo, wal_segment_size);
 				snprintf(partialfname, MAXFNAMELEN, "%s.partial", origfname);
@@ -8057,6 +8530,53 @@ StartupXLOG(void)
 				 */
 				XLogArchiveCleanup(partialfname);
 
+				/*
+				 * If NVWAL is also used for archival recovery, write old
+				 * records out to segment files to archive them.  Note that we
+				 * need locks related to WAL because LocalXLogInsertAllowed
+				 * already got to -1.
+				 */
+				discardedUpTo = ControlFile->discardedUpTo;
+				if (NvwalAvail && discardedUpTo != InvalidXLogRecPtr &&
+					discardedUpTo < EndOfLog)
+				{
+					XLogwrtRqst WriteRqst;
+					TimeLineID	thisTLI = ThisTimeLineID;
+					XLogRecPtr	SegBeginPtr =
+						EndOfLog - (EndOfLog % wal_segment_size);
+
+					/*
+					 * XXX Assume that all the records have the same TLI.
+					 */
+					ThisTimeLineID = EndOfLogTLI;
+
+					WriteRqst.Write = EndOfLog;
+					WriteRqst.Flush = 0;
+
+					LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+					XLogWrite(WriteRqst, false);
+
+					/*
+					 * Force back-off to the last segment boundary.
+					 */
+					LWLockAcquire(ControlFileLock, LW_EXCLUSIVE);
+					ControlFile->discardedUpTo = SegBeginPtr;
+					UpdateControlFile();
+					LWLockRelease(ControlFileLock);
+
+					LogwrtResult.Write = LogwrtResult.Flush = SegBeginPtr;
+
+					SpinLockAcquire(&XLogCtl->info_lck);
+					XLogCtl->LogwrtResult = LogwrtResult;
+					XLogCtl->LogwrtRqst.Write = SegBeginPtr;
+					XLogCtl->LogwrtRqst.Flush = SegBeginPtr;
+					SpinLockRelease(&XLogCtl->info_lck);
+
+					LWLockRelease(WALWriteLock);
+
+					ThisTimeLineID = thisTLI;
+				}
+
 				durable_rename(origpath, partialpath, ERROR);
 				XLogArchiveNotify(partialfname);
 			}
@@ -8066,7 +8586,10 @@ StartupXLOG(void)
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
-	PreallocXlogFiles(EndOfLog);
+	if (NvwalAvail)
+		PreallocNonVolatileXlogBuffer();
+	else
+		PreallocXlogFiles(EndOfLog);
 
 	/*
 	 * Okay, we're officially UP.
@@ -8630,10 +9153,24 @@ GetInsertRecPtr(void)
 /*
  * GetFlushRecPtr -- Returns the current flush position, ie, the last WAL
  * position known to be fsync'd to disk.
+ *
+ * If NVWAL is used, this returns the last persistent WAL position instead.
  */
 XLogRecPtr
 GetFlushRecPtr(void)
 {
+	if (NvwalAvail)
+	{
+		XLogRecPtr		ret;
+
+		SpinLockAcquire(&XLogCtl->info_lck);
+		LogwrtResult = XLogCtl->LogwrtResult;
+		ret = XLogCtl->persistentUpTo;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		return ret;
+	}
+
 	SpinLockAcquire(&XLogCtl->info_lck);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	SpinLockRelease(&XLogCtl->info_lck);
@@ -8982,6 +9519,9 @@ CreateCheckPoint(int flags)
 	VirtualTransactionId *vxids;
 	int			nvxids;
 
+	/* for non-volatile WAL buffer */
+	XLogRecPtr	newDiscardedUpTo = 0;
+
 	/*
 	 * An end-of-recovery checkpoint is really a shutdown checkpoint, just
 	 * issued at a different time.
@@ -9296,6 +9836,22 @@ CreateCheckPoint(int flags)
 	 */
 	PriorRedoPtr = ControlFile->checkPointCopy.redo;
 
+	/*
+	 * If non-volatile WAL buffer is used, discardedUpTo should be updated and
+	 * persist on the control file. So the new value should be caluculated
+	 * here.
+	 *
+	 * TODO Do not copy and paste codes...
+	 */
+	if (NvwalAvail)
+	{
+		XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+		KeepLogSeg(recptr, &_logSegNo);
+		_logSegNo--;
+
+		newDiscardedUpTo = _logSegNo * wal_segment_size;
+	}
+
 	/*
 	 * Update the control file.
 	 */
@@ -9304,6 +9860,16 @@ CreateCheckPoint(int flags)
 		ControlFile->state = DB_SHUTDOWNED;
 	ControlFile->checkPoint = ProcLastRecPtr;
 	ControlFile->checkPointCopy = checkPoint;
+	if (NvwalAvail)
+	{
+		/*
+		 * A new value should not fall behind the old one.
+		 */
+		if (ControlFile->discardedUpTo < newDiscardedUpTo)
+			ControlFile->discardedUpTo = newDiscardedUpTo;
+		else
+			newDiscardedUpTo = ControlFile->discardedUpTo;
+	}
 	ControlFile->time = (pg_time_t) time(NULL);
 	/* crash recovery should always recover to the end of WAL */
 	ControlFile->minRecoveryPoint = InvalidXLogRecPtr;
@@ -9321,6 +9887,44 @@ CreateCheckPoint(int flags)
 	UpdateControlFile();
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * If we use non-volatile XLOG buffer, update XLogCtl->Logwrt{Rqst,Result}
+	 * so that the XLOG records older than newDiscardedUpTo are treated as
+	 * "already written and flushed."
+	 */
+	if (NvwalAvail)
+	{
+		Assert(newDiscardedUpTo > 0);
+
+		/* Update process-local variables */
+		LogwrtResult.Write = LogwrtResult.Flush = newDiscardedUpTo;
+
+		/*
+		 * Update shared-memory variables. We need both light-weight lock and
+		 * spin lock to update them.
+		 */
+		LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
+		SpinLockAcquire(&XLogCtl->info_lck);
+
+		/*
+		 * Note that there can be a corner case that process-local
+		 * LogwrtResult falls behind shared XLogCtl->LogwrtResult if whole the
+		 * non-volatile XLOG buffer is filled and some pages are written out
+		 * to segment files between UpdateControlFile and LWLockAcquire above.
+		 *
+		 * TODO For now, we ignore that case because it can hardly occur.
+		 */
+		XLogCtl->LogwrtResult = LogwrtResult;
+
+		if (XLogCtl->LogwrtRqst.Write < newDiscardedUpTo)
+			XLogCtl->LogwrtRqst.Write = newDiscardedUpTo;
+		if (XLogCtl->LogwrtRqst.Flush < newDiscardedUpTo)
+			XLogCtl->LogwrtRqst.Flush = newDiscardedUpTo;
+
+		SpinLockRelease(&XLogCtl->info_lck);
+		LWLockRelease(WALWriteLock);
+	}
+
 	/* Update shared-memory copy of checkpoint XID/epoch */
 	SpinLockAcquire(&XLogCtl->info_lck);
 	XLogCtl->ckptFullXid = checkPoint.nextXid;
@@ -9344,22 +9948,48 @@ CreateCheckPoint(int flags)
 	if (PriorRedoPtr != InvalidXLogRecPtr)
 		UpdateCheckPointDistanceEstimate(RedoRecPtr - PriorRedoPtr);
 
-	/*
-	 * Delete old log files, those no longer needed for last checkpoint to
-	 * prevent the disk holding the xlog from growing full.
-	 */
-	XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
-	KeepLogSeg(recptr, &_logSegNo);
-	InvalidateObsoleteReplicationSlots(_logSegNo);
-	_logSegNo--;
-	RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	if (NvwalAvail)
+	{
+		/*
+		 * We already set _logSegNo to the value equivalent to discardedUpTo.
+		 * We first increment it to call InvalidateObsoleteReplicationSlots.
+		 */
+		_logSegNo++;
+		InvalidateObsoleteReplicationSlots(_logSegNo);
+
+		/*
+		 * Then we decrement _logSegNo again to remove WAL segment files
+		 * having spilled out of non-volatile WAL buffer.
+		 *
+		 * Note that you should set wal_recycle to off to remove segment files.
+		 */
+		_logSegNo--;
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	}
+	else
+	{
+		/*
+		 * Delete old log files, those no longer needed for last checkpoint to
+		 * prevent the disk holding the xlog from growing full.
+		 */
+		XLByteToSeg(RedoRecPtr, _logSegNo, wal_segment_size);
+		KeepLogSeg(recptr, &_logSegNo);
+		InvalidateObsoleteReplicationSlots(_logSegNo);
+		_logSegNo--;
+		RemoveOldXlogFiles(_logSegNo, RedoRecPtr, recptr);
+	}
 
 	/*
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
 	if (!shutdown)
-		PreallocXlogFiles(recptr);
+	{
+		if (NvwalAvail)
+			PreallocNonVolatileXlogBuffer();
+		else
+			PreallocXlogFiles(recptr);
+	}
 
 	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
@@ -12148,6 +12778,170 @@ CancelBackup(void)
 	}
 }
 
+/*
+ * Is NVWAL used?
+ */
+bool
+IsNvwalAvail(void)
+{
+	return NvwalAvail;
+}
+
+/*
+ * Returns the size we can load from NVWAL and sets nvwalptr to load-from LSN.
+ */
+Size
+GetLoadableSizeFromNvwal(XLogRecPtr target, Size count, XLogRecPtr *nvwalptr)
+{
+	XLogRecPtr	readUpTo;
+	XLogRecPtr	discardedUpTo;
+
+	Assert(IsNvwalAvail());
+	Assert(nvwalptr != NULL);
+
+	readUpTo = target + count;
+
+	LWLockAcquire(ControlFileLock, LW_SHARED);
+	discardedUpTo = ControlFile->discardedUpTo;
+	LWLockRelease(ControlFileLock);
+
+	/* Check if all the records are on WAL segment files */
+	if (readUpTo <= discardedUpTo)
+		return 0;
+
+	/* Check if all the records are on NVWAL */
+	if (discardedUpTo <= target)
+	{
+		*nvwalptr = target;
+		return count;
+	}
+
+	/* Some on WAL segment files, some on NVWAL */
+	*nvwalptr = discardedUpTo;
+	return (Size) (readUpTo - discardedUpTo);
+}
+
+/*
+ * It is like WALRead @ xlogreader.c, but loads from non-volatile WAL
+ * buffer.
+ */
+bool
+CopyXLogRecordsFromNVWAL(char *buf, Size count, XLogRecPtr startptr)
+{
+	char	   *p;
+	XLogRecPtr	recptr;
+	Size		nbytes;
+
+	Assert(NvwalAvail);
+
+	p = buf;
+	recptr = startptr;
+	nbytes = count;
+
+	/*
+	 * Hold shared WALBufMappingLock to let others not rotate WAL buffer
+	 * while copying WAL records from it.  We do not need exclusive lock
+	 * because we will not rotate the buffer in this function.
+	 */
+	LWLockAcquire(WALBufMappingLock, LW_SHARED);
+
+	while (nbytes > 0)
+	{
+		char	   *q;
+		Size		off;
+		Size		max_copy;
+		Size		copybytes;
+		XLogRecPtr	discardedUpTo;
+
+		LWLockAcquire(ControlFileLock, LW_SHARED);
+		discardedUpTo = ControlFile->discardedUpTo;
+		LWLockRelease(ControlFileLock);
+
+		/* Check if the records we need have been already evicted or not */
+		if (recptr < discardedUpTo)
+		{
+			LWLockRelease(WALBufMappingLock);
+
+			/* TODO error handling? */
+			return false;
+		}
+
+		/*
+		 * Get the target address on non-volatile WAL buffer and the size we
+		 * can copy from it at once because the buffer can rotate and we
+		 * might have to copy what we want devided into two or more.
+		 */
+		off = recptr % NvwalSize;
+		q = XLogCtl->pages + off;
+		max_copy = NvwalSize - off;
+		copybytes = Min(nbytes, max_copy);
+
+		memcpy(p, q, copybytes);
+
+		/* Update state for copy */
+		recptr += copybytes;
+		nbytes -= copybytes;
+		p += copybytes;
+	}
+
+	LWLockRelease(WALBufMappingLock);
+	return true;
+}
+
+static bool
+IsXLogSourceFromStream(XLogSource source)
+{
+	switch (source)
+	{
+		case XLOG_FROM_STREAM:
+		case XLOG_FROM_STREAM_NVWAL:
+			return true;
+
+		default:
+			return false;
+	}
+}
+
+static bool
+IsXLogSourceFromNvwal(XLogSource source)
+{
+	switch (source)
+	{
+		case XLOG_FROM_NVWAL:
+		case XLOG_FROM_STREAM_NVWAL:
+			return true;
+
+		default:
+			return false;
+	}
+}
+
+static bool
+NeedsForMoreXLog(XLogRecPtr targetChunkEndPtr)
+{
+	switch (readSource)
+	{
+		case XLOG_FROM_ARCHIVE:
+		case XLOG_FROM_PG_WAL:
+			return (readFile < 0);
+
+		case XLOG_FROM_NVWAL:
+			Assert(NvwalAvail);
+			return false;
+
+		case XLOG_FROM_STREAM:
+			return (flushedUpto < targetChunkEndPtr);
+
+		case XLOG_FROM_STREAM_NVWAL:
+			Assert(NvwalAvail);
+			return (flushedUpto < targetChunkEndPtr);
+
+		default: /* XLOG_FROM_ANY */
+			Assert(readFile < 0);
+			return true;
+	}
+}
+
 /*
  * Read the XLOG page containing RecPtr into readBuf (if not read already).
  * Returns number of bytes read, if the page is read successfully, or -1
@@ -12189,7 +12983,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 &&
+	if ((readFile >= 0 || IsXLogSourceFromNvwal(readSource)) &&
 		!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
 	{
 		/*
@@ -12206,7 +13000,8 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			}
 		}
 
-		close(readFile);
+		if (readFile >= 0)
+			close(readFile);
 		readFile = -1;
 		readSource = XLOG_FROM_ANY;
 	}
@@ -12215,9 +13010,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
-		(readSource == XLOG_FROM_STREAM &&
-		 flushedUpto < targetPagePtr + reqLen))
+	if (NeedsForMoreXLog(targetPagePtr + reqLen))
 	{
 		if (!WaitForWALToBecomeAvailable(targetPagePtr + reqLen,
 										 private->randAccess,
@@ -12238,7 +13031,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(readFile != -1 || IsXLogSourceFromNvwal(readSource));
 
 	/*
 	 * If the current segment is being streamed from the primary, calculate how
@@ -12246,7 +13039,7 @@ retry:
 	 * requested record has been received, but this is for the benefit of
 	 * future calls, to allow quick exit at the top of this function.
 	 */
-	if (readSource == XLOG_FROM_STREAM)
+	if (IsXLogSourceFromStream(readSource))
 	{
 		if (((targetPagePtr) / XLOG_BLCKSZ) != (flushedUpto / XLOG_BLCKSZ))
 			readLen = XLOG_BLCKSZ;
@@ -12257,41 +13050,59 @@ retry:
 	else
 		readLen = XLOG_BLCKSZ;
 
-	/* Read the requested page */
 	readOff = targetPageOff;
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
-	if (r != XLOG_BLCKSZ)
+	if (IsXLogSourceFromNvwal(readSource))
 	{
-		char		fname[MAXFNAMELEN];
-		int			save_errno = errno;
+		Size		offset = (Size) (targetPagePtr % NvwalSize);
+		char	   *readpos = XLogCtl->pages + offset;
 
+		Assert(offset % XLOG_BLCKSZ == 0);
+
+		/* Load the requested page from non-volatile WAL buffer */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		memcpy(readBuf, readpos, readLen);
 		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		if (r < 0)
+
+		/* There are not any other clues of TLI... */
+		xlogreader->seg.ws_tli = ((XLogPageHeader) readBuf)->xlp_tli;
+	}
+	else
+	{
+		/* Read the requested page from file */
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+		if (r != XLOG_BLCKSZ)
 		{
-			errno = save_errno;
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
+			char		fname[MAXFNAMELEN];
+			int			save_errno = errno;
+
+			pgstat_report_wait_end();
+			XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+			if (r < 0)
+			{
+				errno = save_errno;
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode_for_file_access(),
+						 errmsg("could not read from log segment %s, offset %u: %m",
+								fname, readOff)));
+			}
+			else
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+								fname, readOff, r, (Size) XLOG_BLCKSZ)));
+			goto next_record_is_invalid;
 		}
-		else
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							fname, readOff, r, (Size) XLOG_BLCKSZ)));
-		goto next_record_is_invalid;
+		pgstat_report_wait_end();
+
+		xlogreader->seg.ws_tli = curFileTLI;
 	}
-	pgstat_report_wait_end();
 
 	Assert(targetSegNo == readSegNo);
 	Assert(targetPageOff == readOff);
 	Assert(reqLen <= readLen);
 
-	xlogreader->seg.ws_tli = curFileTLI;
-
 	/*
 	 * Check the page header immediately, so that we can retry immediately if
 	 * it's not valid. This may seem unnecessary, because XLogReadRecord()
@@ -12325,6 +13136,17 @@ retry:
 		goto next_record_is_invalid;
 	}
 
+	/*
+	 * Updating curFileTLI on each page verified if non-volatile WAL buffer
+	 * is used because there is no TimeLineID information in NVWAL's filename.
+	 */
+	if (IsXLogSourceFromNvwal(readSource) &&
+		curFileTLI != xlogreader->latestPageTLI)
+	{
+		curFileTLI = xlogreader->latestPageTLI;
+		elog(DEBUG1, "curFileTLI: %u", curFileTLI);
+	}
+
 	return readLen;
 
 next_record_is_invalid:
@@ -12406,7 +13228,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 	if (!InArchiveRecovery)
 		currentSource = XLOG_FROM_PG_WAL;
 	else if (currentSource == XLOG_FROM_ANY ||
-			 (!StandbyMode && currentSource == XLOG_FROM_STREAM))
+			 (!StandbyMode && IsXLogSourceFromStream(currentSource)))
 	{
 		lastSourceFailed = false;
 		currentSource = XLOG_FROM_ARCHIVE;
@@ -12429,6 +13251,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			{
 				case XLOG_FROM_ARCHIVE:
 				case XLOG_FROM_PG_WAL:
+				case XLOG_FROM_NVWAL:
 
 					/*
 					 * Check to see if the trigger file exists. Note that we
@@ -12442,6 +13265,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						return false;
 					}
 
+					/* Try NVWAL if available */
+					if (NvwalAvail && currentSource != XLOG_FROM_NVWAL)
+					{
+						currentSource = XLOG_FROM_NVWAL;
+						break;
+					}
+
 					/*
 					 * Not in standby mode, and we've now tried the archive
 					 * and pg_wal.
@@ -12453,11 +13283,15 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * Move to XLOG_FROM_STREAM state, and set to start a
 					 * walreceiver if necessary.
 					 */
-					currentSource = XLOG_FROM_STREAM;
+					if (currentSource == XLOG_FROM_NVWAL)
+						currentSource = XLOG_FROM_STREAM_NVWAL;
+					else
+						currentSource = XLOG_FROM_STREAM;
 					startWalReceiver = true;
 					break;
 
 				case XLOG_FROM_STREAM:
+				case XLOG_FROM_STREAM_NVWAL:
 
 					/*
 					 * Failure while streaming. Most likely, we got here
@@ -12563,6 +13397,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 		{
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
+			case XLOG_FROM_NVWAL:
 
 				/*
 				 * WAL receiver must not be running when reading WAL from
@@ -12580,6 +13415,59 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				if (randAccess)
 					curFileTLI = 0;
 
+				/* Try to load from NVWAL */
+				if (currentSource == XLOG_FROM_NVWAL)
+				{
+					XLogRecPtr		discardedUpTo;
+
+					Assert(NvwalAvail);
+
+					/*
+					 * Check if the target page exists on NVWAL.  Note that
+					 * RecPtr points to the end of the target chunk.
+					 *
+					 * TODO need ControlFileLock?
+					 */
+					discardedUpTo = ControlFile->discardedUpTo;
+					if (discardedUpTo != InvalidXLogRecPtr &&
+						discardedUpTo < RecPtr &&
+						RecPtr <= discardedUpTo + NvwalSize)
+					{
+						/* Report recovery progress in PS display */
+						set_ps_display("recovering NVWAL");
+						elog(DEBUG1, "recovering NVWAL");
+
+						/* Track source of data and receipt time */
+						readSource = XLOG_FROM_NVWAL;
+						XLogReceiptSource = XLOG_FROM_NVWAL;
+						XLogReceiptTime = GetCurrentTimestamp();
+
+						/*
+						 * Construct expectedTLEs.  This is necessary to
+						 * recover only from NVWAL because its filename does
+						 * not have any TLI information.
+						 */
+						if (!expectedTLEs)
+						{
+							TimeLineHistoryEntry	   *entry;
+
+							entry = palloc(sizeof(TimeLineHistoryEntry));
+							entry->tli = recoveryTargetTLI;
+							entry->begin = entry->end = InvalidXLogRecPtr;
+
+							expectedTLEs = list_make1(entry);
+							elog(DEBUG1, "expectedTLEs: [%u]",
+								 (uint32) recoveryTargetTLI);
+						}
+
+						return true;
+					}
+
+					/* Target page does not exist on NVWAL */
+					lastSourceFailed = true;
+					break;
+				}
+
 				/*
 				 * Try to restore the file from archive, or read an existing
 				 * file from pg_wal.
@@ -12597,6 +13485,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				break;
 
 			case XLOG_FROM_STREAM:
+			case XLOG_FROM_STREAM_NVWAL:
 				{
 					bool		havedata;
 
@@ -12721,22 +13610,35 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						 * info is set correctly and XLogReceiptTime isn't
 						 * changed.
 						 */
-						if (readFile < 0)
+						if (currentSource == XLOG_FROM_STREAM_NVWAL)
 						{
 							if (!expectedTLEs)
 								expectedTLEs = readTimeLineHistory(receiveTLI);
-							readFile = XLogFileRead(readSegNo, PANIC,
-													receiveTLI,
-													XLOG_FROM_STREAM, false);
-							Assert(readFile >= 0);
-						}
-						else
-						{
-							/* just make sure source info is correct... */
-							readSource = XLOG_FROM_STREAM;
-							XLogReceiptSource = XLOG_FROM_STREAM;
+
+							/* TODO is it ok to return, not to break switch? */
+							readSource = XLOG_FROM_STREAM_NVWAL;
+							XLogReceiptSource = XLOG_FROM_STREAM_NVWAL;
 							return true;
 						}
+						else
+						{
+							if (readFile < 0)
+							{
+								if (!expectedTLEs)
+									expectedTLEs = readTimeLineHistory(receiveTLI);
+								readFile = XLogFileRead(readSegNo, PANIC,
+														receiveTLI,
+														XLOG_FROM_STREAM, false);
+								Assert(readFile >= 0);
+							}
+							else
+							{
+								/* just make sure source info is correct... */
+								readSource = XLOG_FROM_STREAM;
+								XLogReceiptSource = XLOG_FROM_STREAM;
+								return true;
+							}
+						}
 						break;
 					}
 
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index bb95e0e527..84107e48b2 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1067,11 +1067,24 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	XLogRecPtr	recptr_nvwal = 0;
+	Size		nbytes_nvwal = 0;
+#endif
 
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
 
+#ifndef FRONTEND
+	/* Try to load records directly from NVWAL if used */
+	if (IsNvwalAvail())
+	{
+		nbytes_nvwal = GetLoadableSizeFromNvwal(startptr, count, &recptr_nvwal);
+		nbytes = count - nbytes_nvwal;
+	}
+#endif
+
 	while (nbytes > 0)
 	{
 		uint32		startoff;
@@ -1139,6 +1152,17 @@ WALRead(XLogReaderState *state,
 		p += readbytes;
 	}
 
+#ifndef FRONTEND
+	if (IsNvwalAvail())
+	{
+		if (!CopyXLogRecordsFromNVWAL(p, nbytes_nvwal, recptr_nvwal))
+		{
+			/* TODO graceful error handling */
+			elog(PANIC, "some records on NVWAL had been discarded");
+		}
+	}
+#endif
+
 	return true;
 }
 
diff --git a/src/bin/pg_controldata/pg_controldata.c b/src/bin/pg_controldata/pg_controldata.c
index 3e00ac0f70..eabcaae2ff 100644
--- a/src/bin/pg_controldata/pg_controldata.c
+++ b/src/bin/pg_controldata/pg_controldata.c
@@ -272,6 +272,9 @@ main(int argc, char *argv[])
 		   ControlFile->checkPointCopy.oldestCommitTsXid);
 	printf(_("Latest checkpoint's newestCommitTsXid:%u\n"),
 		   ControlFile->checkPointCopy.newestCommitTsXid);
+	printf(_("discarded Up To:                      %X/%X\n"),
+		   (uint32) (ControlFile->discardedUpTo >> 32),
+		   (uint32) ControlFile->discardedUpTo);
 	printf(_("Time of latest checkpoint:            %s\n"),
 		   ckpttime_str);
 	printf(_("Fake LSN counter for unlogged rels:   %X/%X\n"),
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1ad6132f67..53f9fef527 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -354,6 +354,14 @@ extern void XLogRequestWalReceiverReply(void);
 extern void assign_max_wal_size(int newval, void *extra);
 extern void assign_checkpoint_completion_target(double newval, void *extra);
 
+extern bool IsNvwalAvail(void);
+extern XLogRecPtr GetLoadableSizeFromNvwal(XLogRecPtr target,
+										   Size count,
+										   XLogRecPtr *nvwalptr);
+extern bool CopyXLogRecordsFromNVWAL(char *buf,
+									 Size count,
+									 XLogRecPtr startptr);
+
 /*
  * Routines to start, stop, and get status of a base backup.
  */
diff --git a/src/include/catalog/pg_control.h b/src/include/catalog/pg_control.h
index e3f48158ce..ac73c9aeb3 100644
--- a/src/include/catalog/pg_control.h
+++ b/src/include/catalog/pg_control.h
@@ -22,7 +22,7 @@
 
 
 /* Version identifier for this pg_control format */
-#define PG_CONTROL_VERSION	1300
+#define PG_CONTROL_VERSION	1301
 
 /* Nonce key length, see below */
 #define MOCK_AUTH_NONCE_LEN		32
@@ -132,6 +132,21 @@ typedef struct ControlFileData
 
 	XLogRecPtr	unloggedLSN;	/* current fake LSN value, for unlogged rels */
 
+	/*
+	 * Used for non-volatile WAL buffer (NVWAL).
+	 *
+	 * discardedUpTo is updated to the oldest LSN in the NVWAL when either a
+	 * checkpoint or a restartpoint is completed successfully, or whole the
+	 * NVWAL is filled with WAL records and a new record is being inserted.
+	 * This field tells that the NVWAL contains WAL records in the range of
+	 * [discardedUpTo, discardedUpTo+S), where S is the size of the NVWAL.
+	 * Note that the WAL records whose LSN are less than discardedUpTo would
+	 * remain in WAL segment files and be needed for recovery.
+	 *
+	 * It is set to zero when NVWAL is not used.
+	 */
+	XLogRecPtr	discardedUpTo;
+
 	/*
 	 * These two values determine the minimum point we must recover up to
 	 * before starting up:
diff --git a/src/test/regress/expected/misc_functions.out b/src/test/regress/expected/misc_functions.out
index d3acb98d04..bbd47e1663 100644
--- a/src/test/regress/expected/misc_functions.out
+++ b/src/test/regress/expected/misc_functions.out
@@ -142,14 +142,17 @@ HINT:  No function matches the given name and argument types. You might need to
 select setting as segsize
 from pg_settings where name = 'wal_segment_size'
 \gset
-select count(*) > 0 as ok from pg_ls_waldir();
+select setting as nvwal_path
+from pg_settings where name = 'nvwal_path'
+\gset
+select count(*) > 0 or :'nvwal_path' <> '' as ok from pg_ls_waldir();
  ok 
 ----
  t
 (1 row)
 
 -- Test ProjectSet as well as FunctionScan
-select count(*) > 0 as ok from (select pg_ls_waldir()) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select pg_ls_waldir()) ss;
  ok 
 ----
  t
@@ -161,14 +164,15 @@ select * from pg_ls_waldir() limit 0;
 ------+------+--------------
 (0 rows)
 
-select count(*) > 0 as ok from (select * from pg_ls_waldir() limit 1) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select * from pg_ls_waldir() limit 1) ss;
  ok 
 ----
  t
 (1 row)
 
-select (w).size = :segsize as ok
-from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from
+  (select * from pg_ls_waldir() w
+   where length((w).name) = 24 and (w).size = :segsize limit 1) ss;
  ok 
 ----
  t
diff --git a/src/test/regress/sql/misc_functions.sql b/src/test/regress/sql/misc_functions.sql
index 094e8f8296..09c326775d 100644
--- a/src/test/regress/sql/misc_functions.sql
+++ b/src/test/regress/sql/misc_functions.sql
@@ -39,15 +39,19 @@ SELECT num_nulls();
 select setting as segsize
 from pg_settings where name = 'wal_segment_size'
 \gset
+select setting as nvwal_path
+from pg_settings where name = 'nvwal_path'
+\gset
 
-select count(*) > 0 as ok from pg_ls_waldir();
+select count(*) > 0 or :'nvwal_path' <> '' as ok from pg_ls_waldir();
 -- Test ProjectSet as well as FunctionScan
-select count(*) > 0 as ok from (select pg_ls_waldir()) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select pg_ls_waldir()) ss;
 -- Test not-run-to-completion cases.
 select * from pg_ls_waldir() limit 0;
-select count(*) > 0 as ok from (select * from pg_ls_waldir() limit 1) ss;
-select (w).size = :segsize as ok
-from (select pg_ls_waldir() w) ss where length((w).name) = 24 limit 1;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from (select * from pg_ls_waldir() limit 1) ss;
+select count(*) > 0 or :'nvwal_path' <> '' as ok from
+  (select * from pg_ls_waldir() w
+   where length((w).name) = 24 and (w).size = :segsize limit 1) ss;
 
 select count(*) >= 0 as ok from pg_ls_archive_statusdir();
 
-- 
2.25.1

v4-0006-More-log-when-using-NVWAL.patchapplication/octet-stream; name=v4-0006-More-log-when-using-NVWAL.patchDownload
From 9d94f704e91619236f9c15d5878748b134ee25c3 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 19 Oct 2020 16:55:12 +0900
Subject: [PATCH v4 6/6] More log when using NVWAL

---
 src/backend/access/transam/xlog.c | 38 ++++++++++++++++++-------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a3caf85f1f..a6b18aa38a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2202,14 +2202,6 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	int			npages = 0;
 	bool		is_firstpage;
 
-	if (NvwalAvail)
-		elog(DEBUG1, "XLogCtl->InitializedUpTo %X/%X; upto %X/%X; opportunistic %s",
-			 (uint32) (XLogCtl->InitializedUpTo >> 32),
-			 (uint32) XLogCtl->InitializedUpTo,
-			 (uint32) (upto >> 32),
-			 (uint32) upto,
-			 opportunistic ? "true" : "false");
-
 	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
 
 	/*
@@ -2277,7 +2269,8 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 						 * If we use non-volatile WAL buffer, it is a special
 						 * but expected case to write the buffer pages out to
 						 * segment files, and for simplicity, it is done in
-						 * segment by segment.
+						 * segment by segment. Note that this output would
+						 * cause performance degrade, so we log it later.
 						 */
 						XLogRecPtr		OldSegEndPtr;
 
@@ -2294,6 +2287,14 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 					LWLockRelease(WALWriteLock);
 					WalStats.m_wal_buffers_full++;
 					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
+
+					/* Out of critical section, so it's time to log */
+					if (NvwalAvail)
+					{
+						elog(WARNING, "old segment written to file: up to %X/%X",
+							 (uint32) (WriteRqst.Write >> 32),
+							 (uint32) WriteRqst.Write);
+					}
 				}
 				/* Re-acquire WALBufMappingLock and retry */
 				LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
@@ -2393,13 +2394,6 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
 	}
 	LWLockRelease(WALBufMappingLock);
 
-	if (NvwalAvail)
-		elog(DEBUG1, "ControlFile->discardedUpTo %X/%X; XLogCtl->InitializedUpTo %X/%X",
-			 (uint32) (ControlFile->discardedUpTo >> 32),
-			 (uint32) ControlFile->discardedUpTo,
-			 (uint32) (XLogCtl->InitializedUpTo >> 32),
-			 (uint32) XLogCtl->InitializedUpTo);
-
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG && npages > 0)
 	{
@@ -4228,11 +4222,23 @@ PreallocNonVolatileXlogBuffer(void)
 	if (newupto <= InitializedUpTo)
 		return;
 
+	/*
+	 * Logging that we are starting to preallocate. Yes, we know that we are
+	 * still in a critical section of checkpoint, but we log it because
+	 * preallocating might cause performance degrade.
+	 */
+	elog(NOTICE, "preallocate starting: up to %X/%X",
+		 (uint32) (newupto >> 32), (uint32) newupto);
+
 	/*
 	 * Subtracting XLOG_BLCKSZ is important, because AdvanceXLInsertBuffer
 	 * handles the first argument as the beginning of pages, not the end.
 	 */
 	AdvanceXLInsertBuffer(newupto - XLOG_BLCKSZ, false);
+
+	/* Logging that we complete to preallocate */
+	elog(NOTICE, "preallocate complete: up to %X/%X",
+		 (uint32) (newupto >> 32), (uint32) newupto);
 }
 
 /*
-- 
2.25.1

v3-0001-Revert-Use-vectored-I-O-to-fill-new-WAL-segments.patchapplication/octet-stream; name=v3-0001-Revert-Use-vectored-I-O-to-fill-new-WAL-segments.patchDownload
From a28a31af7a679e976319f531340f806033be2a29 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Tue, 19 Jan 2021 17:09:19 +0900
Subject: [PATCH v3 01/10] Revert "Use vectored I/O to fill new WAL segments."

This reverts commit ce6a71fa5300cf00adf32c9daee302c523609709.
---
 src/backend/access/transam/xlog.c | 28 ++++++----------------------
 1 file changed, 6 insertions(+), 22 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 470e113b33..43fe60405e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -48,7 +48,6 @@
 #include "pg_trace.h"
 #include "pgstat.h"
 #include "port/atomics.h"
-#include "port/pg_iovec.h"
 #include "postmaster/bgwriter.h"
 #include "postmaster/startup.h"
 #include "postmaster/walwriter.h"
@@ -3272,6 +3271,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	XLogSegNo	installed_segno;
 	XLogSegNo	max_segno;
 	int			fd;
+	int			nbytes;
 	int			save_errno;
 
 	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
@@ -3318,9 +3318,6 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	save_errno = 0;
 	if (wal_init_zero)
 	{
-		struct iovec iov[PG_IOV_MAX];
-		int			blocks;
-
 		/*
 		 * Zero-fill the file.  With this setting, we do this the hard way to
 		 * ensure that all the file space has really been allocated.  On
@@ -3330,28 +3327,15 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
-
-		/* Prepare to write out a lot of copies of our zero buffer at once. */
-		for (int i = 0; i < lengthof(iov); ++i)
+		for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
 		{
-			iov[i].iov_base = zbuffer.data;
-			iov[i].iov_len = XLOG_BLCKSZ;
-		}
-
-		/* Loop, writing as many blocks as we can for each system call. */
-		blocks = wal_segment_size / XLOG_BLCKSZ;
-		for (int i = 0; i < blocks;)
-		{
-			int 		iovcnt = Min(blocks - i, lengthof(iov));
-			off_t		offset = i * XLOG_BLCKSZ;
-
-			if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0)
+			errno = 0;
+			if (write(fd, zbuffer.data, XLOG_BLCKSZ) != XLOG_BLCKSZ)
 			{
-				save_errno = errno;
+				/* if write didn't set errno, assume no disk space */
+				save_errno = errno ? errno : ENOSPC;
 				break;
 			}
-
-			i += iovcnt;
 		}
 	}
 	else
-- 
2.25.1

v3-0004-Lazy-unmap-WAL-segments.patchapplication/octet-stream; name=v3-0004-Lazy-unmap-WAL-segments.patchDownload
From cb6ecaa3a2d6d65dc6f95743fdb1d16c457927ea Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:02 +0900
Subject: [PATCH v3 04/10] Lazy-unmap WAL segments

---
 src/backend/access/transam/xlog.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a20fadbb55..7d9d2dc06a 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -798,7 +798,9 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
  */
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
+static XLogSegNo beingClosedLogSegNo = 0;
 static char *mappedPages = NULL;
+static char *beingUnmappedPages = NULL;
 static bool pmemMapped = 0;
 
 /* 2MiB hugepage mask used by XLogFileMapHint */
@@ -1215,6 +1217,14 @@ XLogInsertRecord(XLogRecData *rdata,
 		}
 	}
 
+	/* Lazy-unmap */
+	if (beingUnmappedPages != NULL)
+	{
+		XLogFileUnmap(beingUnmappedPages, beingClosedLogSegNo);
+		beingUnmappedPages = NULL;
+		beingClosedLogSegNo = 0;
+	}
+
 #ifdef WAL_DEBUG
 	if (XLOG_DEBUG)
 	{
@@ -1857,9 +1867,23 @@ GetXLogBuffer(XLogRecPtr ptr)
 	XLByteToSeg(ptr, segno, wal_segment_size);
 	if (segno != openLogSegNo)
 	{
-		/* Unmap the current segment if mapped */
+		/*
+		 * We do not want to unmap the current segment here because we are in
+		 * a critial section and unmap is time-consuming operation.  So we
+		 * just mark it to be unmapped later.
+		 */
 		if (mappedPages != NULL)
-			XLogFileUnmap(mappedPages, openLogSegNo);
+		{
+			/*
+			 * If there is another being-unmapped segment, it cannot be helped;
+			 * we unmap it here.
+			 */
+			if (beingUnmappedPages != NULL)
+				XLogFileUnmap(beingUnmappedPages, beingClosedLogSegNo);
+
+			beingUnmappedPages = mappedPages;
+			beingClosedLogSegNo = openLogSegNo;
+		}
 
 		/* Map the segment we need */
 		mappedPages = XLogFileMap(segno, &pmemMapped);
-- 
2.25.1

v3-0002-Preallocate-more-WAL-segments.patchapplication/octet-stream; name=v3-0002-Preallocate-more-WAL-segments.patchDownload
From a48695251e145e2691a1217d971ab7e9bbcb6de3 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:13:59 +0900
Subject: [PATCH v3 02/10] Preallocate more WAL segments

---
 src/backend/access/transam/xlog.c | 27 ++++++++++-----------------
 1 file changed, 10 insertions(+), 17 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 43fe60405e..5bf79e1d8c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -926,7 +926,7 @@ static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 										bool fetching_ckpt, XLogRecPtr tliRecPtr);
 static int	emode_for_corrupt_record(int emode, XLogRecPtr RecPtr);
 static void XLogFileClose(void);
-static void PreallocXlogFiles(XLogRecPtr endptr);
+static void PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr);
 static void RemoveTempXlogFiles(void);
 static void RemoveOldXlogFiles(XLogSegNo segno, XLogRecPtr lastredoptr, XLogRecPtr endptr);
 static void RemoveXlogFile(const char *segname, XLogSegNo recycleSegNo,
@@ -3895,27 +3895,20 @@ XLogFileClose(void)
 
 /*
  * Preallocate log files beyond the specified log endpoint.
- *
- * XXX this is currently extremely conservative, since it forces only one
- * future log segment to exist, and even that only if we are 75% done with
- * the current one.  This is only appropriate for very low-WAL-volume systems.
- * High-volume systems will be OK once they've built up a sufficient set of
- * recycled log segments, but the startup transient is likely to include
- * a lot of segment creations by foreground processes, which is not so good.
  */
 static void
-PreallocXlogFiles(XLogRecPtr endptr)
+PreallocXlogFiles(XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 {
 	XLogSegNo	_logSegNo;
+	XLogSegNo	endSegNo;
+	XLogSegNo	recycleSegNo;
 	int			lf;
 	bool		use_existent;
-	uint64		offset;
 
-	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
-	offset = XLogSegmentOffset(endptr - 1, wal_segment_size);
-	if (offset >= (uint32) (0.75 * wal_segment_size))
+	XLByteToPrevSeg(endptr, endSegNo, wal_segment_size);
+	recycleSegNo = XLOGfileslop(RedoRecPtr);
+	for (_logSegNo = endSegNo + 1; _logSegNo <= recycleSegNo; _logSegNo++)
 	{
-		_logSegNo++;
 		use_existent = true;
 		lf = XLogFileInit(_logSegNo, &use_existent, true);
 		close(lf);
@@ -7915,7 +7908,7 @@ StartupXLOG(void)
 	/*
 	 * Preallocate additional log files, if wanted.
 	 */
-	PreallocXlogFiles(EndOfLog);
+	PreallocXlogFiles(RedoRecPtr, EndOfLog);
 
 	/*
 	 * Okay, we're officially UP.
@@ -9202,7 +9195,7 @@ CreateCheckPoint(int flags)
 	 * segments, since that may supply some of the needed files.)
 	 */
 	if (!shutdown)
-		PreallocXlogFiles(recptr);
+		PreallocXlogFiles(RedoRecPtr, recptr);
 
 	/*
 	 * Truncate pg_subtrans if possible.  We can throw away all data before
@@ -9571,7 +9564,7 @@ CreateRestartPoint(int flags)
 	 * Make more log segments if needed.  (Do this after recycling old log
 	 * segments, since that may supply some of the needed files.)
 	 */
-	PreallocXlogFiles(endptr);
+	PreallocXlogFiles(RedoRecPtr, endptr);
 
 	/*
 	 * ThisTimeLineID is normally not set when we're still in recovery.
-- 
2.25.1

v3-0005-Speculative-map-WAL-segments.patchapplication/octet-stream; name=v3-0005-Speculative-map-WAL-segments.patchDownload
From 990c33ec45c5b8080f89414102ddd8a725f206c3 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:03 +0900
Subject: [PATCH v3 05/10] Speculative-map WAL segments

---
 src/backend/access/transam/xlog.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 7d9d2dc06a..825de800b7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1029,6 +1029,8 @@ XLogInsertRecord(XLogRecData *rdata,
 							   info == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
+	XLogRecPtr	ProbablyInsertPos;
+	XLogSegNo	ProbablyInsertSegNo;
 	bool		prevDoPageWrites = doPageWrites;
 
 	/* we assume that all of the record header is in the first chunk */
@@ -1038,6 +1040,23 @@ XLogInsertRecord(XLogRecData *rdata,
 	if (!XLogInsertAllowed())
 		elog(ERROR, "cannot make new WAL entries during recovery");
 
+	/* Speculatively map a segment we probably need */
+	ProbablyInsertPos = GetInsertRecPtr();
+	XLByteToSeg(ProbablyInsertPos, ProbablyInsertSegNo, wal_segment_size);
+	if (ProbablyInsertSegNo != openLogSegNo)
+	{
+		if (mappedPages != NULL)
+		{
+			Assert(beingUnmappedPages == NULL);
+			Assert(beingClosedLogSegNo == 0);
+			beingUnmappedPages = mappedPages;
+			beingClosedLogSegNo = openLogSegNo;
+		}
+		mappedPages = XLogFileMap(ProbablyInsertSegNo, &pmemMapped);
+		Assert(mappedPages != NULL);
+		openLogSegNo = ProbablyInsertSegNo;
+	}
+
 	/*----------
 	 *
 	 * We have now done all the preparatory work we can without holding a
-- 
2.25.1

v3-0003-Use-WAL-segments-as-WAL-buffers.patchapplication/octet-stream; name=v3-0003-Use-WAL-segments-as-WAL-buffers.patchDownload
From 67d7b1d16d8b2710e38d2b094ac9dc27acbfed40 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:00 +0900
Subject: [PATCH v3 03/10] Use WAL segments as WAL buffers

Please run ./configure with LIBS=-lpmem to build.

Note that we ignore wal_sync_method from here.
---
 src/backend/access/transam/xlog.c | 968 +++++++++++-------------------
 1 file changed, 366 insertions(+), 602 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5bf79e1d8c..a20fadbb55 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -18,9 +18,11 @@
 #include <math.h>
 #include <time.h>
 #include <fcntl.h>
+#include <sys/mman.h>
 #include <sys/stat.h>
 #include <sys/time.h>
 #include <unistd.h>
+#include <libpmem.h>
 
 #include "access/clog.h"
 #include "access/commit_ts.h"
@@ -623,24 +625,8 @@ typedef struct XLogCtlData
 	XLogwrtResult LogwrtResult;
 
 	/*
-	 * Latest initialized page in the cache (last byte position + 1).
-	 *
-	 * To change the identity of a buffer (and InitializedUpTo), you need to
-	 * hold WALBufMappingLock.  To change the identity of a buffer that's
-	 * still dirty, the old page needs to be written out first, and for that
-	 * you need WALWriteLock, and you need to ensure that there are no
-	 * in-progress insertions to the page by calling
-	 * WaitXLogInsertionsToFinish().
+	 * This value does not change after startup.
 	 */
-	XLogRecPtr	InitializedUpTo;
-
-	/*
-	 * These values do not change after startup, although the pointed-to pages
-	 * and xlblocks values certainly do.  xlblocks values are protected by
-	 * WALBufMappingLock.
-	 */
-	char	   *pages;			/* buffers for unwritten XLOG pages */
-	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
 	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
 
 	/*
@@ -804,9 +790,26 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
  * openLogSegNo identifies the segment.  These variables are only used to
  * write the XLOG, and so will normally refer to the active segment.
  * Note: call Reserve/ReleaseExternalFD to track consumption of this FD.
+ *
+ * mappedPages is mmap(2)-ed address for an open log file segment.
+ * It is used as WAL buffer instead of XLogCtl->pages.
+ *
+ * pmemMapped is true if mappedPages is on PMEM.
  */
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
+static char *mappedPages = NULL;
+static bool pmemMapped = 0;
+
+/* 2MiB hugepage mask used by XLogFileMapHint */
+#define PG_HUGEPAGE_MASK ((((uintptr_t) 1) << 21) - 1)
+
+#ifndef MAP_SHARED_VALIDATE
+#define MAP_SHARED_VALIDATE 0x3
+#endif
+#ifndef MAP_SYNC
+#define MAP_SYNC 0x80000
+#endif
 
 /*
  * These variables are used similarly to the ones above, but for reading
@@ -911,12 +914,15 @@ static void CheckPointGuts(XLogRecPtr checkPointRedo, int flags);
 static void KeepLogSeg(XLogRecPtr recptr, XLogSegNo *logSegNo);
 static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
-static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 								   bool find_free, XLogSegNo max_segno,
 								   bool use_lock);
+static void *XLogFileMapHint(void);
+static void *XLogFileMapUtil(void *hint, int fd, bool dax);
+static char *XLogFileMap(XLogSegNo segno, bool *is_pmem);
+static void XLogFileUnmap(char *pages, XLogSegNo segno);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 						 XLogSource source, bool notfoundOk);
 static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
@@ -979,7 +985,6 @@ static void checkXLogConsistency(XLogReaderState *record);
 static void WALInsertLockAcquire(void);
 static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
-static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
@@ -1623,27 +1628,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		 */
 		while (CurrPos < EndPos)
 		{
-			/*
-			 * The minimal action to flush the page would be to call
-			 * WALInsertLockUpdateInsertingAt(CurrPos) followed by
-			 * AdvanceXLInsertBuffer(...).  The page would be left initialized
-			 * mostly to zeros, except for the page header (always the short
-			 * variant, as this is never a segment's first page).
-			 *
-			 * The large vistas of zeros are good for compressibility, but the
-			 * headers interrupting them every XLOG_BLCKSZ (with values that
-			 * differ from page to page) are not.  The effect varies with
-			 * compression tool, but bzip2 for instance compresses about an
-			 * order of magnitude worse if those headers are left in place.
-			 *
-			 * Rather than complicating AdvanceXLInsertBuffer itself (which is
-			 * called in heavily-loaded circumstances as well as this lightly-
-			 * loaded one) with variant behavior, we just use GetXLogBuffer
-			 * (which itself calls the two methods we need) to get the pointer
-			 * and zero most of the page.  Then we just zero the page header.
-			 */
-			currpos = GetXLogBuffer(CurrPos);
-			MemSet(currpos, 0, SizeOfXLogShortPHD);
+			/* XXX We assume that XLogFileInit does what we did here */
 
 			CurrPos += XLOG_BLCKSZ;
 		}
@@ -1757,29 +1742,6 @@ WALInsertLockRelease(void)
 	}
 }
 
-/*
- * Update our insertingAt value, to let others know that we've finished
- * inserting up to that point.
- */
-static void
-WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
-{
-	if (holdingAllLocks)
-	{
-		/*
-		 * We use the last lock to mark our actual position, see comments in
-		 * WALInsertLockAcquireExclusive.
-		 */
-		LWLockUpdateVar(&WALInsertLocks[NUM_XLOGINSERT_LOCKS - 1].l.lock,
-						&WALInsertLocks[NUM_XLOGINSERT_LOCKS - 1].l.insertingAt,
-						insertingAt);
-	}
-	else
-		LWLockUpdateVar(&WALInsertLocks[MyLockNo].l.lock,
-						&WALInsertLocks[MyLockNo].l.insertingAt,
-						insertingAt);
-}
-
 /*
  * Wait for any WAL insertions < upto to finish.
  *
@@ -1881,123 +1843,37 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 /*
  * Get a pointer to the right location in the WAL buffer containing the
  * given XLogRecPtr.
- *
- * If the page is not initialized yet, it is initialized. That might require
- * evicting an old dirty buffer from the buffer cache, which means I/O.
- *
- * The caller must ensure that the page containing the requested location
- * isn't evicted yet, and won't be evicted. The way to ensure that is to
- * hold onto a WAL insertion lock with the insertingAt position set to
- * something <= ptr. GetXLogBuffer() will update insertingAt if it needs
- * to evict an old page from the buffer. (This means that once you call
- * GetXLogBuffer() with a given 'ptr', you must not access anything before
- * that point anymore, and must not call GetXLogBuffer() with an older 'ptr'
- * later, because older buffers might be recycled already)
  */
 static char *
 GetXLogBuffer(XLogRecPtr ptr)
 {
-	int			idx;
-	XLogRecPtr	endptr;
-	static uint64 cachedPage = 0;
-	static char *cachedPos = NULL;
-	XLogRecPtr	expectedEndPtr;
+	int				idx;
+	XLogPageHeader	page;
+	XLogSegNo		segno;
 
-	/*
-	 * Fast path for the common case that we need to access again the same
-	 * page as last time.
-	 */
-	if (ptr / XLOG_BLCKSZ == cachedPage)
+	/* shut-up compiler if not --enable-cassert */
+	(void) page;
+
+	XLByteToSeg(ptr, segno, wal_segment_size);
+	if (segno != openLogSegNo)
 	{
-		Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
-		Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
-		return cachedPos + ptr % XLOG_BLCKSZ;
+		/* Unmap the current segment if mapped */
+		if (mappedPages != NULL)
+			XLogFileUnmap(mappedPages, openLogSegNo);
+
+		/* Map the segment we need */
+		mappedPages = XLogFileMap(segno, &pmemMapped);
+		Assert(mappedPages != NULL);
+		openLogSegNo = segno;
 	}
 
-	/*
-	 * The XLog buffer cache is organized so that a page is always loaded to a
-	 * particular buffer.  That way we can easily calculate the buffer a given
-	 * page must be loaded into, from the XLogRecPtr alone.
-	 */
 	idx = XLogRecPtrToBufIdx(ptr);
+	page = (XLogPageHeader) (mappedPages + idx * (Size) XLOG_BLCKSZ);
 
-	/*
-	 * See what page is loaded in the buffer at the moment. It could be the
-	 * page we're looking for, or something older. It can't be anything newer
-	 * - that would imply the page we're looking for has already been written
-	 * out to disk and evicted, and the caller is responsible for making sure
-	 * that doesn't happen.
-	 *
-	 * However, we don't hold a lock while we read the value. If someone has
-	 * just initialized the page, it's possible that we get a "torn read" of
-	 * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
-	 * that case we will see a bogus value. That's ok, we'll grab the mapping
-	 * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
-	 * the page we're looking for. But it means that when we do this unlocked
-	 * read, we might see a value that appears to be ahead of the page we're
-	 * looking for. Don't PANIC on that, until we've verified the value while
-	 * holding the lock.
-	 */
-	expectedEndPtr = ptr;
-	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+	Assert(page->xlp_magic == XLOG_PAGE_MAGIC);
+	Assert(page->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
 
-	endptr = XLogCtl->xlblocks[idx];
-	if (expectedEndPtr != endptr)
-	{
-		XLogRecPtr	initializedUpto;
-
-		/*
-		 * Before calling AdvanceXLInsertBuffer(), which can block, let others
-		 * know how far we're finished with inserting the record.
-		 *
-		 * NB: If 'ptr' points to just after the page header, advertise a
-		 * position at the beginning of the page rather than 'ptr' itself. If
-		 * there are no other insertions running, someone might try to flush
-		 * up to our advertised location. If we advertised a position after
-		 * the page header, someone might try to flush the page header, even
-		 * though page might actually not be initialized yet. As the first
-		 * inserter on the page, we are effectively responsible for making
-		 * sure that it's initialized, before we let insertingAt to move past
-		 * the page header.
-		 */
-		if (ptr % XLOG_BLCKSZ == SizeOfXLogShortPHD &&
-			XLogSegmentOffset(ptr, wal_segment_size) > XLOG_BLCKSZ)
-			initializedUpto = ptr - SizeOfXLogShortPHD;
-		else if (ptr % XLOG_BLCKSZ == SizeOfXLogLongPHD &&
-				 XLogSegmentOffset(ptr, wal_segment_size) < XLOG_BLCKSZ)
-			initializedUpto = ptr - SizeOfXLogLongPHD;
-		else
-			initializedUpto = ptr;
-
-		WALInsertLockUpdateInsertingAt(initializedUpto);
-
-		AdvanceXLInsertBuffer(ptr, false);
-		endptr = XLogCtl->xlblocks[idx];
-
-		if (expectedEndPtr != endptr)
-			elog(PANIC, "could not find WAL buffer for %X/%X",
-				 (uint32) (ptr >> 32), (uint32) ptr);
-	}
-	else
-	{
-		/*
-		 * Make sure the initialization of the page is visible to us, and
-		 * won't arrive later to overwrite the WAL data we write on the page.
-		 */
-		pg_memory_barrier();
-	}
-
-	/*
-	 * Found the buffer holding this page. Return a pointer to the right
-	 * offset within the page.
-	 */
-	cachedPage = ptr / XLOG_BLCKSZ;
-	cachedPos = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
-
-	Assert(((XLogPageHeader) cachedPos)->xlp_magic == XLOG_PAGE_MAGIC);
-	Assert(((XLogPageHeader) cachedPos)->xlp_pageaddr == ptr - (ptr % XLOG_BLCKSZ));
-
-	return cachedPos + ptr % XLOG_BLCKSZ;
+	return mappedPages + ptr % wal_segment_size;
 }
 
 /*
@@ -2125,179 +2001,6 @@ XLogRecPtrToBytePos(XLogRecPtr ptr)
 	return result;
 }
 
-/*
- * Initialize XLOG buffers, writing out old buffers if they still contain
- * unwritten data, upto the page containing 'upto'. Or if 'opportunistic' is
- * true, initialize as many pages as we can without having to write out
- * unwritten data. Any new pages are initialized to zeros, with pages headers
- * initialized properly.
- */
-static void
-AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic)
-{
-	XLogCtlInsert *Insert = &XLogCtl->Insert;
-	int			nextidx;
-	XLogRecPtr	OldPageRqstPtr;
-	XLogwrtRqst WriteRqst;
-	XLogRecPtr	NewPageEndPtr = InvalidXLogRecPtr;
-	XLogRecPtr	NewPageBeginPtr;
-	XLogPageHeader NewPage;
-	int			npages = 0;
-
-	LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
-
-	/*
-	 * Now that we have the lock, check if someone initialized the page
-	 * already.
-	 */
-	while (upto >= XLogCtl->InitializedUpTo || opportunistic)
-	{
-		nextidx = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo);
-
-		/*
-		 * Get ending-offset of the buffer page we need to replace (this may
-		 * be zero if the buffer hasn't been used yet).  Fall through if it's
-		 * already written out.
-		 */
-		OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
-		if (LogwrtResult.Write < OldPageRqstPtr)
-		{
-			/*
-			 * Nope, got work to do. If we just want to pre-initialize as much
-			 * as we can without flushing, give up now.
-			 */
-			if (opportunistic)
-				break;
-
-			/* Before waiting, get info_lck and update LogwrtResult */
-			SpinLockAcquire(&XLogCtl->info_lck);
-			if (XLogCtl->LogwrtRqst.Write < OldPageRqstPtr)
-				XLogCtl->LogwrtRqst.Write = OldPageRqstPtr;
-			LogwrtResult = XLogCtl->LogwrtResult;
-			SpinLockRelease(&XLogCtl->info_lck);
-
-			/*
-			 * Now that we have an up-to-date LogwrtResult value, see if we
-			 * still need to write it or if someone else already did.
-			 */
-			if (LogwrtResult.Write < OldPageRqstPtr)
-			{
-				/*
-				 * Must acquire write lock. Release WALBufMappingLock first,
-				 * to make sure that all insertions that we need to wait for
-				 * can finish (up to this same position). Otherwise we risk
-				 * deadlock.
-				 */
-				LWLockRelease(WALBufMappingLock);
-
-				WaitXLogInsertionsToFinish(OldPageRqstPtr);
-
-				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
-
-				LogwrtResult = XLogCtl->LogwrtResult;
-				if (LogwrtResult.Write >= OldPageRqstPtr)
-				{
-					/* OK, someone wrote it already */
-					LWLockRelease(WALWriteLock);
-				}
-				else
-				{
-					/* Have to write it ourselves */
-					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_START();
-					WriteRqst.Write = OldPageRqstPtr;
-					WriteRqst.Flush = 0;
-					XLogWrite(WriteRqst, false);
-					LWLockRelease(WALWriteLock);
-					WalStats.m_wal_buffers_full++;
-					TRACE_POSTGRESQL_WAL_BUFFER_WRITE_DIRTY_DONE();
-				}
-				/* Re-acquire WALBufMappingLock and retry */
-				LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
-				continue;
-			}
-		}
-
-		/*
-		 * Now the next buffer slot is free and we can set it up to be the
-		 * next output page.
-		 */
-		NewPageBeginPtr = XLogCtl->InitializedUpTo;
-		NewPageEndPtr = NewPageBeginPtr + XLOG_BLCKSZ;
-
-		Assert(XLogRecPtrToBufIdx(NewPageBeginPtr) == nextidx);
-
-		NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
-
-		/*
-		 * Be sure to re-zero the buffer so that bytes beyond what we've
-		 * written will look like zeroes and not valid XLOG records...
-		 */
-		MemSet((char *) NewPage, 0, XLOG_BLCKSZ);
-
-		/*
-		 * Fill the new page's header
-		 */
-		NewPage->xlp_magic = XLOG_PAGE_MAGIC;
-
-		/* NewPage->xlp_info = 0; */	/* done by memset */
-		NewPage->xlp_tli = ThisTimeLineID;
-		NewPage->xlp_pageaddr = NewPageBeginPtr;
-
-		/* NewPage->xlp_rem_len = 0; */	/* done by memset */
-
-		/*
-		 * If online backup is not in progress, mark the header to indicate
-		 * that WAL records beginning in this page have removable backup
-		 * blocks.  This allows the WAL archiver to know whether it is safe to
-		 * compress archived WAL data by transforming full-block records into
-		 * the non-full-block format.  It is sufficient to record this at the
-		 * page level because we force a page switch (in fact a segment
-		 * switch) when starting a backup, so the flag will be off before any
-		 * records can be written during the backup.  At the end of a backup,
-		 * the last page will be marked as all unsafe when perhaps only part
-		 * is unsafe, but at worst the archiver would miss the opportunity to
-		 * compress a few records.
-		 */
-		if (!Insert->forcePageWrites)
-			NewPage->xlp_info |= XLP_BKP_REMOVABLE;
-
-		/*
-		 * If first page of an XLOG segment file, make it a long header.
-		 */
-		if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
-		{
-			XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
-
-			NewLongPage->xlp_sysid = ControlFile->system_identifier;
-			NewLongPage->xlp_seg_size = wal_segment_size;
-			NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
-			NewPage->xlp_info |= XLP_LONG_HEADER;
-		}
-
-		/*
-		 * Make sure the initialization of the page becomes visible to others
-		 * before the xlblocks update. GetXLogBuffer() reads xlblocks without
-		 * holding a lock.
-		 */
-		pg_write_barrier();
-
-		*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
-
-		XLogCtl->InitializedUpTo = NewPageEndPtr;
-
-		npages++;
-	}
-	LWLockRelease(WALBufMappingLock);
-
-#ifdef WAL_DEBUG
-	if (XLOG_DEBUG && npages > 0)
-	{
-		elog(DEBUG1, "initialized %d pages, up to %X/%X",
-			 npages, (uint32) (NewPageEndPtr >> 32), (uint32) NewPageEndPtr);
-	}
-#endif
-}
-
 /*
  * Calculate CheckPointSegments based on max_wal_size_mb and
  * checkpoint_completion_target.
@@ -2426,14 +2129,9 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
 static void
 XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 {
-	bool		ispartialpage;
-	bool		last_iteration;
 	bool		finishing_seg;
-	bool		use_existent;
-	int			curridx;
-	int			npages;
-	int			startidx;
-	uint32		startoffset;
+	XLogSegNo	rqstLogSegNo;
+	XLogSegNo	segno;
 
 	/* We should always be inside a critical section here */
 	Assert(CritSectionCount > 0);
@@ -2443,233 +2141,149 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 	 */
 	LogwrtResult = XLogCtl->LogwrtResult;
 
-	/*
-	 * Since successive pages in the xlog cache are consecutively allocated,
-	 * we can usually gather multiple pages together and issue just one
-	 * write() call.  npages is the number of pages we have determined can be
-	 * written together; startidx is the cache block index of the first one,
-	 * and startoffset is the file offset at which it should go. The latter
-	 * two variables are only valid when npages > 0, but we must initialize
-	 * all of them to keep the compiler quiet.
-	 */
-	npages = 0;
-	startidx = 0;
-	startoffset = 0;
+	/* Fast return if not requested to flush */
+	if (WriteRqst.Flush == 0)
+		return;
+	Assert(WriteRqst.Flush == WriteRqst.Write);
 
 	/*
-	 * Within the loop, curridx is the cache block index of the page to
-	 * consider writing.  Begin at the buffer containing the next unwritten
-	 * page, or last partially written page.
+	 * Call pmem_persist() or pmem_msync() for each segment file that contains
+	 * records to be flushed.
 	 */
-	curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);
-
-	while (LogwrtResult.Write < WriteRqst.Write)
+	XLByteToPrevSeg(WriteRqst.Flush, rqstLogSegNo, wal_segment_size);
+	XLByteToSeg(LogwrtResult.Flush, segno, wal_segment_size);
+	while (segno <= rqstLogSegNo)
 	{
-		/*
-		 * Make sure we're not ahead of the insert process.  This could happen
-		 * if we're passed a bogus WriteRqst.Write that is past the end of the
-		 * last page that's been initialized by AdvanceXLInsertBuffer.
-		 */
-		XLogRecPtr	EndPtr = XLogCtl->xlblocks[curridx];
+		bool		is_pmem;
+		char	   *addr;
+		char	   *p;
+		Size		len;
+		XLogRecPtr	BeginPtr;
+		XLogRecPtr	EndPtr;
 
-		if (LogwrtResult.Write >= EndPtr)
-			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
-				 (uint32) (LogwrtResult.Write >> 32),
-				 (uint32) LogwrtResult.Write,
-				 (uint32) (EndPtr >> 32), (uint32) EndPtr);
-
-		/* Advance LogwrtResult.Write to end of current buffer page */
-		LogwrtResult.Write = EndPtr;
-		ispartialpage = WriteRqst.Write < LogwrtResult.Write;
-
-		if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
-							 wal_segment_size))
+		/* Check if the segment is not mapped yet */
+		if (segno != openLogSegNo)
 		{
+			/* Map newly */
+			is_pmem = 0;
+			addr = XLogFileMap(segno, &is_pmem);
+
 			/*
-			 * Switch to new logfile segment.  We cannot have any pending
-			 * pages here (since we dump what we have at segment end).
+			 * Use the mapped above as WAL buffer of this process for the
+			 * future.  Note that it might be unmapped within this loop.
 			 */
-			Assert(npages == 0);
-			if (openLogFile >= 0)
-				XLogFileClose();
-			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
-							wal_segment_size);
-
-			/* create/use new log file */
-			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
-			ReserveExternalFD();
+			if (openLogSegNo == 0)
+			{
+				pmemMapped = is_pmem;
+				mappedPages = addr;
+				openLogSegNo = segno;
+			}
 		}
-
-		/* Make sure we have the current logfile open */
-		if (openLogFile < 0)
+		else
 		{
-			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
-							wal_segment_size);
-			openLogFile = XLogFileOpen(openLogSegNo);
-			ReserveExternalFD();
+			/* Or use existent mapping */
+			is_pmem = pmemMapped;
+			addr = mappedPages;
 		}
+		Assert(addr != NULL);
+		Assert(mappedPages != NULL);
+		Assert(openLogSegNo > 0);
 
-		/* Add current page to the set of pending pages-to-dump */
-		if (npages == 0)
-		{
-			/* first of group */
-			startidx = curridx;
-			startoffset = XLogSegmentOffset(LogwrtResult.Write - XLOG_BLCKSZ,
-											wal_segment_size);
-		}
-		npages++;
+		/* Find beginning position to be flushed */
+		BeginPtr = segno * wal_segment_size;
+		if (BeginPtr < LogwrtResult.Flush)
+			BeginPtr = LogwrtResult.Flush;
+
+		/* Find ending position to be flushed */
+		EndPtr = (segno + 1) * wal_segment_size;
+		if (EndPtr > WriteRqst.Flush)
+			EndPtr = WriteRqst.Flush;
+
+		/* Convert LSN to memory address */
+		Assert(BeginPtr <= EndPtr);
+		p = addr + BeginPtr % wal_segment_size;
+		len = (Size) (EndPtr - BeginPtr);
 
 		/*
-		 * Dump the set if this will be the last loop iteration, or if we are
-		 * at the last page of the cache area (since the next page won't be
-		 * contiguous in memory), or if we are at the end of the logfile
-		 * segment.
+		 * Do cache-flush or msync.
+		 *
+		 * Note that pmem_msync() does backoff to the page boundary.
 		 */
-		last_iteration = WriteRqst.Write <= LogwrtResult.Write;
-
-		finishing_seg = !ispartialpage &&
-			(startoffset + npages * XLOG_BLCKSZ) >= wal_segment_size;
-
-		if (last_iteration ||
-			curridx == XLogCtl->XLogCacheBlck ||
-			finishing_seg)
+		if (is_pmem)
 		{
-			char	   *from;
-			Size		nbytes;
-			Size		nleft;
-			int			written;
-
-			/* OK to write the page(s) */
-			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
-			nbytes = npages * (Size) XLOG_BLCKSZ;
-			nleft = nbytes;
-			do
+			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
+			pmem_persist(p, len);
+			pgstat_report_wait_end();
+		}
+		else
+		{
+			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
+			if (pmem_msync(p, len))
 			{
-				errno = 0;
-				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
+				char		xlogfname[MAXFNAMELEN];
+				int			save_errno;
+
 				pgstat_report_wait_end();
-				if (written <= 0)
-				{
-					char		xlogfname[MAXFNAMELEN];
-					int			save_errno;
 
-					if (errno == EINTR)
-						continue;
+				save_errno = errno;
+				XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
+							 wal_segment_size);
+				errno = save_errno;
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not msync to log file %s "
+								"at address %p, length %zu: %m",
+								xlogfname, p, len)));
+			}
+			pgstat_report_wait_end();
+		}
+		LogwrtResult.Flush = LogwrtResult.Write = EndPtr;
 
-					save_errno = errno;
-					XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
-								 wal_segment_size);
-					errno = save_errno;
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not write to log file %s "
-									"at offset %u, length %zu: %m",
-									xlogfname, startoffset, nleft)));
-				}
-				nleft -= written;
-				from += written;
-				startoffset += written;
-			} while (nleft > 0);
+		/* Check if whole my WAL buffers are synchronized to the segment */
+		finishing_seg = (LogwrtResult.Flush % wal_segment_size == 0) &&
+						XLByteInPrevSeg(LogwrtResult.Flush, openLogSegNo,
+										wal_segment_size);
 
-			npages = 0;
-
-			/*
-			 * If we just wrote the whole last page of a logfile segment,
-			 * fsync the segment immediately.  This avoids having to go back
-			 * and re-open prior segments when an fsync request comes along
-			 * later. Doing it here ensures that one and only one backend will
-			 * perform this fsync.
-			 *
-			 * This is also the right place to notify the Archiver that the
-			 * segment is ready to copy to archival storage, and to update the
-			 * timer for archive_timeout, and to signal for a checkpoint if
-			 * too many logfile segments have been used since the last
-			 * checkpoint.
-			 */
+		if (segno != openLogSegNo || finishing_seg)
+		{
+			XLogFileUnmap(addr, segno);
 			if (finishing_seg)
 			{
-				issue_xlog_fsync(openLogFile, openLogSegNo);
-
-				/* signal that we need to wakeup walsenders later */
-				WalSndWakeupRequest();
-
-				LogwrtResult.Flush = LogwrtResult.Write;	/* end of page */
-
-				if (XLogArchivingActive())
-					XLogArchiveNotifySeg(openLogSegNo);
-
-				XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
-				XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
-
-				/*
-				 * Request a checkpoint if we've consumed too much xlog since
-				 * the last one.  For speed, we first check using the local
-				 * copy of RedoRecPtr, which might be out of date; if it looks
-				 * like a checkpoint is needed, forcibly update RedoRecPtr and
-				 * recheck.
-				 */
-				if (IsUnderPostmaster && XLogCheckpointNeeded(openLogSegNo))
-				{
-					(void) GetRedoRecPtr();
-					if (XLogCheckpointNeeded(openLogSegNo))
-						RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
-				}
+				Assert(segno == openLogSegNo);
+				mappedPages = NULL;
+				openLogSegNo = 0;
 			}
-		}
 
-		if (ispartialpage)
-		{
-			/* Only asked to write a partial page */
-			LogwrtResult.Write = WriteRqst.Write;
-			break;
-		}
-		curridx = NextBufIdx(curridx);
+			/* signal that we need to wakeup walsenders later */
+			WalSndWakeupRequest();
 
-		/* If flexible, break out of loop as soon as we wrote something */
-		if (flexible && npages == 0)
-			break;
-	}
+			if (XLogArchivingActive())
+				XLogArchiveNotifySeg(segno);
 
-	Assert(npages == 0);
+			XLogCtl->lastSegSwitchTime = (pg_time_t) time(NULL);
+			XLogCtl->lastSegSwitchLSN = LogwrtResult.Flush;
 
-	/*
-	 * If asked to flush, do so
-	 */
-	if (LogwrtResult.Flush < WriteRqst.Flush &&
-		LogwrtResult.Flush < LogwrtResult.Write)
-
-	{
-		/*
-		 * Could get here without iterating above loop, in which case we might
-		 * have no open file or the wrong one.  However, we do not need to
-		 * fsync more than one file.
-		 */
-		if (sync_method != SYNC_METHOD_OPEN &&
-			sync_method != SYNC_METHOD_OPEN_DSYNC)
-		{
-			if (openLogFile >= 0 &&
-				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
-								 wal_segment_size))
-				XLogFileClose();
-			if (openLogFile < 0)
+			/*
+			 * Request a checkpoint if we've consumed too much xlog since
+			 * the last one.  For speed, we first check using the local
+			 * copy of RedoRecPtr, which might be out of date; if it looks
+			 * like a checkpoint is needed, forcibly update RedoRecPtr and
+			 * recheck.
+			 */
+			if (IsUnderPostmaster && XLogCheckpointNeeded(segno))
 			{
-				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
-								wal_segment_size);
-				openLogFile = XLogFileOpen(openLogSegNo);
-				ReserveExternalFD();
+				(void) GetRedoRecPtr();
+				if (XLogCheckpointNeeded(segno))
+					RequestCheckpoint(CHECKPOINT_CAUSE_XLOG);
 			}
-
-			issue_xlog_fsync(openLogFile, openLogSegNo);
 		}
 
-		/* signal that we need to wakeup walsenders later */
-		WalSndWakeupRequest();
-
-		LogwrtResult.Flush = LogwrtResult.Write;
+		++segno;
 	}
 
+	/* signal that we need to wakeup walsenders later */
+	WalSndWakeupRequest();
+
 	/*
 	 * Update shared-memory status
 	 *
@@ -3090,6 +2704,16 @@ XLogBackgroundFlush(void)
 				XLogFileClose();
 			}
 		}
+		else if (mappedPages != NULL)
+		{
+			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
+								 wal_segment_size))
+			{
+				XLogFileUnmap(mappedPages, openLogSegNo);
+				mappedPages = NULL;
+				openLogSegNo = 0;
+			}
+		}
 		return false;
 	}
 
@@ -3156,12 +2780,6 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests();
 
-	/*
-	 * Great, done. To take some work off the critical path, try to initialize
-	 * as many of the no-longer-needed WAL buffers for future use as we can.
-	 */
-	AdvanceXLInsertBuffer(InvalidXLogRecPtr, true);
-
 	/*
 	 * If we determined that we need to write data, but somebody else
 	 * wrote/flushed already, it should be considered as being active, to
@@ -3315,9 +2933,26 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	memset(zbuffer.data, 0, XLOG_BLCKSZ);
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-	save_errno = 0;
-	if (wal_init_zero)
+
+	/*
+	 * Allocate the file by posix_allocate(3) to utilize hugepage and reduce
+	 * overhead of page fault.  Note that posix_fallocate(3) do not set errno
+	 * on error.  Instead, it returns an error number directly.
+	 */
+	save_errno = posix_fallocate(fd, 0, wal_segment_size);
+
+	if (save_errno)
 	{
+		/*
+		 * Do nothing on error.  Go to pgstat_report_wait_end().
+		 */
+	}
+	else if (wal_init_zero)
+	{
+		XLogCtlInsert  *Insert = &XLogCtl->Insert;
+		XLogPageHeader	NewPage = (XLogPageHeader) zbuffer.data;
+		XLogRecPtr		NewPageBeginPtr = logsegno * wal_segment_size;
+
 		/*
 		 * Zero-fill the file.  With this setting, we do this the hard way to
 		 * ensure that all the file space has really been allocated.  On
@@ -3329,6 +2964,48 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		 */
 		for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
 		{
+			memset(NewPage, 0, SizeOfXLogLongPHD);
+
+			/*
+			 * Fill the new page's header
+			 */
+			NewPage->xlp_magic = XLOG_PAGE_MAGIC;
+
+			/* NewPage->xlp_info = 0; */	/* done by memset */
+			NewPage->xlp_tli = ThisTimeLineID;
+			NewPage->xlp_pageaddr = NewPageBeginPtr;
+
+			/* NewPage->xlp_rem_len = 0; */	/* done by memset */
+
+			/*
+			 * If online backup is not in progress, mark the header to indicate
+			 * that WAL records beginning in this page have removable backup
+			 * blocks.  This allows the WAL archiver to know whether it is safe to
+			 * compress archived WAL data by transforming full-block records into
+			 * the non-full-block format.  It is sufficient to record this at the
+			 * page level because we force a page switch (in fact a segment
+			 * switch) when starting a backup, so the flag will be off before any
+			 * records can be written during the backup.  At the end of a backup,
+			 * the last page will be marked as all unsafe when perhaps only part
+			 * is unsafe, but at worst the archiver would miss the opportunity to
+			 * compress a few records.
+			 */
+			if (!Insert->forcePageWrites)
+				NewPage->xlp_info |= XLP_BKP_REMOVABLE;
+
+			/*
+			 * If first page of an XLOG segment file, make it a long header.
+			 */
+			if ((XLogSegmentOffset(NewPage->xlp_pageaddr, wal_segment_size)) == 0)
+			{
+				XLogLongPageHeader NewLongPage = (XLogLongPageHeader) NewPage;
+
+				NewLongPage->xlp_sysid = ControlFile->system_identifier;
+				NewLongPage->xlp_seg_size = wal_segment_size;
+				NewLongPage->xlp_xlog_blcksz = XLOG_BLCKSZ;
+				NewPage->xlp_info |= XLP_LONG_HEADER;
+			}
+
 			errno = 0;
 			if (write(fd, zbuffer.data, XLOG_BLCKSZ) != XLOG_BLCKSZ)
 			{
@@ -3336,6 +3013,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 				save_errno = errno ? errno : ENOSPC;
 				break;
 			}
+
+			NewPageBeginPtr += XLOG_BLCKSZ;
 		}
 	}
 	else
@@ -3651,6 +3330,138 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	return true;
 }
 
+/*
+ * Get a hint address for hugepage boundary mapping.
+ *
+ * Returns non-NULL if success, or PANICs otherwise.
+ */
+static void *
+XLogFileMapHint(void)
+{
+	void	   *hint;
+	Size		len;
+
+	len = (Size) wal_segment_size + PG_HUGEPAGE_MASK + 1;
+	hint = mmap(NULL, len, PROT_READ | PROT_WRITE,
+				MAP_SHARED | MAP_ANONYMOUS, -1, 0);
+
+	if (hint == MAP_FAILED)
+		elog(PANIC, "could not get hint address");
+
+	if (munmap(hint, len) != 0)
+		elog(PANIC, "could not unmap hint address");
+
+	/* Go forward onto the nearest hugepage boundary */
+	return (void *) (((uintptr_t) hint + PG_HUGEPAGE_MASK) & ~PG_HUGEPAGE_MASK);
+}
+
+static void *
+XLogFileMapUtil(void *hint, int fd, bool dax)
+{
+	int			flags;
+
+	if (dax)
+		flags = MAP_SHARED_VALIDATE | MAP_SYNC;
+	else
+		flags = MAP_SHARED;
+
+	return mmap(hint, wal_segment_size, PROT_READ | PROT_WRITE, flags, fd, 0);
+}
+
+/*
+ * Memory-map a pre-existing logfile segment for WAL buffers.
+ *
+ * If success, it returns non-NULL and is_pmem is set whether the file is on
+ * PMEM or not.  Otherwise, it PANICs.
+ */
+static char *
+XLogFileMap(XLogSegNo segno, bool *is_pmem)
+{
+	char		path[MAXPGPATH];
+	char	   *addr;
+	void	   *hint;
+	int			fd;
+	struct stat	stat_buf;
+
+	XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
+
+	fd = BasicOpenFile(path, O_RDWR | PG_BINARY);
+	if (fd < 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not open file \"%s\": %m", path)));
+
+	if (fstat(fd, &stat_buf) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not fstat file \"%s\": %m", path)));
+
+	if (stat_buf.st_size != wal_segment_size)
+		elog(PANIC,
+			 "invalid logfile segment size; path \"%s\" actual %d expected %d",
+			 path, (int) stat_buf.st_size, wal_segment_size);
+
+	hint = XLogFileMapHint();
+
+	/*
+	 * Try DAX mapping first (dax=true).
+	 *
+	 * If not supported, then do regular mapping (dax=false).
+	 */
+	addr = XLogFileMapUtil(hint, fd, true);
+
+	if (addr != MAP_FAILED)
+	{
+		*is_pmem = true;
+	}
+	else if (errno == EOPNOTSUPP || errno == EINVAL)
+	{
+		addr = XLogFileMapUtil(hint, fd, false);
+
+		if (addr == MAP_FAILED)
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not mmap file \"%s\": %m", path)));
+
+		*is_pmem = false;
+	}
+
+	/* Check if the logfile segment is mapped onto hugepage boundary */
+	if ((uintptr_t) addr & PG_HUGEPAGE_MASK)
+			elog(WARNING,
+				 "logfile segment is not mapped onto hugepage boundary; path \"%s\" actual %p expected %p",
+			 path, addr, hint);
+
+	/* We don't need the file descriptor anymore, so close it */
+	if (close(fd) != 0)
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not close file \"%s\": %m", path)));
+
+	return addr;
+}
+
+/*
+ * Unmap a given logfile segment for WAL buffer.
+ */
+static void
+XLogFileUnmap(char *pages, XLogSegNo segno)
+{
+	Assert(pages != NULL);
+
+	if (munmap(pages, wal_segment_size) != 0)
+	{
+		char		xlogfname[MAXFNAMELEN];
+		int			save_errno = errno;
+
+		XLogFileName(xlogfname, ThisTimeLineID, segno, wal_segment_size);
+		errno = save_errno;
+		ereport(PANIC,
+				(errcode_for_file_access(),
+				 errmsg("could not unmap file \"%s\": %m", xlogfname)));
+	}
+}
+
 /*
  * Open a pre-existing logfile segment for writing.
  */
@@ -5070,12 +4881,6 @@ XLOGShmemSize(void)
 
 	/* WAL insertion locks, plus alignment */
 	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), NUM_XLOGINSERT_LOCKS + 1));
-	/* xlblocks array */
-	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
-	/* extra alignment padding for XLOG I/O buffers */
-	size = add_size(size, XLOG_BLCKSZ);
-	/* and the buffers themselves */
-	size = add_size(size, mul_size(XLOG_BLCKSZ, XLOGbuffers));
 
 	/*
 	 * Note: we don't count ControlFileData, it comes out of the "slop factor"
@@ -5149,10 +4954,6 @@ XLOGShmemInit(void)
 	 * needed here.
 	 */
 	allocptr = ((char *) XLogCtl) + sizeof(XLogCtlData);
-	XLogCtl->xlblocks = (XLogRecPtr *) allocptr;
-	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
-	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
-
 
 	/* WAL insertion locks. Ensure they're aligned to the full padded size */
 	allocptr += sizeof(WALInsertLockPadded) -
@@ -5168,15 +4969,6 @@ XLOGShmemInit(void)
 		WALInsertLocks[i].l.lastImportantAt = InvalidXLogRecPtr;
 	}
 
-	/*
-	 * Align the start of the page buffers to a full xlog block size boundary.
-	 * This simplifies some calculations in XLOG insertion. It is also
-	 * required for O_DIRECT.
-	 */
-	allocptr = (char *) TYPEALIGN(XLOG_BLCKSZ, allocptr);
-	XLogCtl->pages = allocptr;
-	memset(XLogCtl->pages, 0, (Size) XLOG_BLCKSZ * XLOGbuffers);
-
 	/*
 	 * Do basic initialization of XLogCtl shared data. (StartupXLOG will fill
 	 * in additional info.)
@@ -7717,40 +7509,12 @@ StartupXLOG(void)
 	Insert->CurrBytePos = XLogRecPtrToBytePos(EndOfLog);
 
 	/*
-	 * Tricky point here: readBuf contains the *last* block that the LastRec
-	 * record spans, not the one it starts in.  The last block is indeed the
-	 * one we want to use.
+	 * We DO NOT need the if-else block once existed here because we use WAL
+	 * segment files as WAL buffers so the last block is "already on the
+	 * buffers."
+	 *
+	 * XXX We assume there is no torn record.
 	 */
-	if (EndOfLog % XLOG_BLCKSZ != 0)
-	{
-		char	   *page;
-		int			len;
-		int			firstIdx;
-		XLogRecPtr	pageBeginPtr;
-
-		pageBeginPtr = EndOfLog - (EndOfLog % XLOG_BLCKSZ);
-		Assert(readOff == XLogSegmentOffset(pageBeginPtr, wal_segment_size));
-
-		firstIdx = XLogRecPtrToBufIdx(EndOfLog);
-
-		/* Copy the valid part of the last block, and zero the rest */
-		page = &XLogCtl->pages[firstIdx * XLOG_BLCKSZ];
-		len = EndOfLog % XLOG_BLCKSZ;
-		memcpy(page, xlogreader->readBuf, len);
-		memset(page + len, 0, XLOG_BLCKSZ - len);
-
-		XLogCtl->xlblocks[firstIdx] = pageBeginPtr + XLOG_BLCKSZ;
-		XLogCtl->InitializedUpTo = pageBeginPtr + XLOG_BLCKSZ;
-	}
-	else
-	{
-		/*
-		 * There is no partial block to copy. Just set InitializedUpTo, and
-		 * let the first attempt to insert a log record to initialize the next
-		 * buffer.
-		 */
-		XLogCtl->InitializedUpTo = EndOfLog;
-	}
 
 	LogwrtResult.Write = LogwrtResult.Flush = EndOfLog;
 
-- 
2.25.1

v3-0006-Map-WAL-segments-with-MAP_POPULATE-if-non-DAX.patchapplication/octet-stream; name=v3-0006-Map-WAL-segments-with-MAP_POPULATE-if-non-DAX.patchDownload
From 0870279832aac3ef802b6c4ae5ab348e6c854a9f Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Mon, 16 Mar 2020 11:14:04 +0900
Subject: [PATCH v3 06/10] Map WAL segments with MAP_POPULATE if non-DAX

---
 src/backend/access/transam/xlog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 825de800b7..b7d99cacba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3406,7 +3406,7 @@ XLogFileMapUtil(void *hint, int fd, bool dax)
 	if (dax)
 		flags = MAP_SHARED_VALIDATE | MAP_SYNC;
 	else
-		flags = MAP_SHARED;
+		flags = MAP_SHARED | MAP_POPULATE;
 
 	return mmap(hint, wal_segment_size, PROT_READ | PROT_WRITE, flags, fd, 0);
 }
-- 
2.25.1

v3-0008-Create-a-new-WAL-segment-just-before-mapping.patchapplication/octet-stream; name=v3-0008-Create-a-new-WAL-segment-just-before-mapping.patchDownload
From 9cc69017ed6028cfaf21c4709f4527c7d206ea2f Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Wed, 25 Mar 2020 11:19:05 +0900
Subject: [PATCH v3 08/10] Create a new WAL segment just before mapping

---
 src/backend/access/transam/xlog.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 777a9e921c..5a6304176b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3430,9 +3430,20 @@ XLogFileMap(XLogSegNo segno, bool *is_pmem)
 
 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY);
 	if (fd < 0)
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not open file \"%s\": %m", path)));
+	{
+		bool		use_existent = true;
+
+		/*
+		 * Create a new logfile segment if not exists.  This is an exceptional
+		 * creation because it should be done at the end of checkpoint.  So we
+		 * log this as warning.
+		 */
+		elog(WARNING,
+			 "creating logfile segment just before mapping; path \"%s\"",
+			 path);
+
+		fd = XLogFileInit(segno, &use_existent, true);
+	}
 
 	if (fstat(fd, &stat_buf) != 0)
 		ereport(PANIC,
-- 
2.25.1

v3-0010-Revert-Speculative-map-WAL-segments.patchapplication/octet-stream; name=v3-0010-Revert-Speculative-map-WAL-segments.patchDownload
From 93d59367dd30ad5b9b8b5eebb38ec85969418190 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Tue, 8 Dec 2020 17:25:10 +0900
Subject: [PATCH v3 10/10] Revert "Speculative-map WAL segments"

This reverts commit ca00a5c6faca758faa9e386cb980eb8ead9900db.
---
 src/backend/access/transam/xlog.c | 19 -------------------
 1 file changed, 19 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b4fa70aa2f..f9a3716006 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1029,8 +1029,6 @@ XLogInsertRecord(XLogRecData *rdata,
 							   info == XLOG_SWITCH);
 	XLogRecPtr	StartPos;
 	XLogRecPtr	EndPos;
-	XLogRecPtr	ProbablyInsertPos;
-	XLogSegNo	ProbablyInsertSegNo;
 	bool		prevDoPageWrites = doPageWrites;
 
 	/* we assume that all of the record header is in the first chunk */
@@ -1040,23 +1038,6 @@ XLogInsertRecord(XLogRecData *rdata,
 	if (!XLogInsertAllowed())
 		elog(ERROR, "cannot make new WAL entries during recovery");
 
-	/* Speculatively map a segment we probably need */
-	ProbablyInsertPos = GetInsertRecPtr();
-	XLByteToSeg(ProbablyInsertPos, ProbablyInsertSegNo, wal_segment_size);
-	if (ProbablyInsertSegNo != openLogSegNo)
-	{
-		if (mappedPages != NULL)
-		{
-			Assert(beingUnmappedPages == NULL);
-			Assert(beingClosedLogSegNo == 0);
-			beingUnmappedPages = mappedPages;
-			beingClosedLogSegNo = openLogSegNo;
-		}
-		mappedPages = XLogFileMap(ProbablyInsertSegNo, &pmemMapped);
-		Assert(mappedPages != NULL);
-		openLogSegNo = ProbablyInsertSegNo;
-	}
-
 	/*----------
 	 *
 	 * We have now done all the preparatory work we can without holding a
-- 
2.25.1

v3-0009-Do-not-open-an-existing-WAL-segment-when-creating.patchapplication/octet-stream; name=v3-0009-Do-not-open-an-existing-WAL-segment-when-creating.patchDownload
From 733de7d4b44df452ca8a584715991b0902468a3f Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Tue, 8 Dec 2020 16:28:55 +0900
Subject: [PATCH v3 09/10] Do not open an existing WAL segment when creating
 just before mapping

This commit fixes the commit "Create a new WAL segment just before
mapping" (598c5e6768cb90118741d5350da24eaf9a340add).
---
 src/backend/access/transam/xlog.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5a6304176b..b4fa70aa2f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -3431,7 +3431,7 @@ XLogFileMap(XLogSegNo segno, bool *is_pmem)
 	fd = BasicOpenFile(path, O_RDWR | PG_BINARY);
 	if (fd < 0)
 	{
-		bool		use_existent = true;
+		bool		use_existent = false;
 
 		/*
 		 * Create a new logfile segment if not exists.  This is an exceptional
-- 
2.25.1

#51Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#50)
Re: [PoC] Non-volatile WAL buffer

Dear everyone, Tomas,

First of all, the "v4" patchset for non-volatile WAL buffer attached to the
previous mail is actually v5... Please read "v4" as "v5."

Then, to Tomas:
Thank you for your crash report you gave on Nov 27, 2020, regarding msync
patchset. I applied the latest msync patchset v3 attached to the previous
to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when
pgbench -i -s 500. Please try it if necessary.

Best regards,
Takashi

2021年1月26日(火) 17:52 Takashi Menjo <takashi.menjo@gmail.com>:

Dear everyone,

Sorry but I forgot to attach my patchsets... Please see the files attached
to this mail. Please also note that they contain some fixes.

Best regards,
Takashi

2021年1月26日(火) 17:46 Takashi Menjo <takashi.menjo@gmail.com>:

Dear everyone,

I'm sorry for the late reply. I rebase my two patchsets onto the latest
master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL
buffer; the other prefixed with v3 is for msync.

I will reply to your thankful feedbacks one by one within days. Please
wait for a moment.

Best regards,
Takashi

01/25/2021(Mon) 11:56 Masahiko Sawada <sawada.mshk@gmail.com>:

On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/21/21 3:17 AM, Masahiko Sawada wrote:

On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I think I've managed to get the 0002 patch [1] rebased to master and
working (with help from Masahiko Sawada). It's not clear to me how

it

could have worked as submitted - my theory is that an incomplete

patch

was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For a
pgbench on scale 500 (fits into shared buffers), an average of three
5-minute runs looks like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065

NTT refers to the patch from September 10, pre-allocating a large

WAL

file on PMEM, and simple-no-buffers is the simpler patch simply

removing

the WAL buffers and writing directly to a mmap-ed WAL segment on

PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want

to

keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty bad.

Not

only compared to the "ntt" patch, but even to master. I'm not

entirely

sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've

tried

tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much faster

than

NVMe storage, but still much slower than DRAM (both in terms of

latency

and bandwidth, see [2] for some data). It's not terrible, but the
latency is maybe 2-3x higher - not a huge difference, but may

matter for

WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2],
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying,

because

that's how we write into WAL buffers - each thread writes it's own

data,

so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls / time

for

each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates

data

for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

xlog stats cnt 43000000
map cnt 100 time 5448333 unmap cnt 100 time 3730963
memcpy cnt 985964 time 1550442272 len 15150499
memset cnt 0 time 0 len 0
persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100

mmap and

unmap calls, taking ~10ms in total. There were ~14k pmem_persist

calls,

taking 10ms in total. And the most time (~1.5s) was used by

pmem_memcpy

copying about 15MB of data. That's quite a lot :-(

It might also be interesting if we can see how much time spent on

each

logging function, such as XLogInsert(), XLogWrite(), and XLogFlush().

Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
maybe that could be visible in a regular perf profile. Also, I suppose
most of the time will be used by the pmem calls, shown in the stats.

My conclusion from this is that eliminating WAL buffers and writing

WAL

directly to PMEM (by memcpy to mmap-ed WAL segments) is probably

not the

right approach.

I suppose we should keep WAL buffers, and then just write the data

to

mmap-ed WAL segments on PMEM. Which I think is what the NTT patch

does,

except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL

segments,

mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does

that,

and the results look a bit nicer:

branch 1 16 32 64 96
----------------------------------------------------------------
master 7291 87704 165310 150437 224186
ntt 7912 106095 213206 212410 237819
simple-no-buffers 7654 96544 115416 95828 103065
with-wal-buffers 7477 95454 181702 140167 214715

So, much better than the version without WAL buffers, somewhat

better

than master (except for 64/96 clients), but still not as good as

NTT.

At this point I was wondering how could the NTT patch be faster when
it's doing roughly the same thing. I'm sire there are some

differences,

but it seemed strange. The main difference seems to be that it only

maps

one large file, and only once. OTOH the alternative "simple" patch

maps

segments one by one, in each backend. Per the debug stats the

map/unmap

calls are fairly cheap, but maybe it interferes with the memcpy

somehow.

While looking at the two methods: NTT and simple-no-buffer, I

realized

that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

That's a very good question - it's quite possible the WALWriteLock is
not really needed, because the processes are actually "writing" the WAL
directly to PMEM. So it's a bit confusing, because it's only really
concerned about making sure it's flushed.

And yes, multiple processes certainly can write to PMEM at the same
time, in fact it's a requirement to get good throughput I believe. My
understanding is we need ~8 processes, at least that's what I heard

from

people with more PMEM experience.

Thanks, that's good to know.

TBH I'm not convinced the code in the "simple-no-buffer" code (coming
from the 0002 patch) is actually correct. Essentially, consider the
backend needs to do a flush, but does not have a segment mapped. So it
maps it and calls pmem_drain() on it.

But does that actually flush anything? Does it properly flush changes
done by other processes that may not have called pmem_drain() yet? I
find this somewhat suspicious and I'd bet all processes that did write
something have to call pmem_drain().

Yeah, in terms of experiments at least it's good to find out that the
approach mmapping each WAL segment is not good at performance.

So I did an experiment by increasing the size of the WAL segments. I
chose to try with 521MB and 1024MB, and the results with 1GB look

like this:

branch 1 16 32 64 96
----------------------------------------------------------------
master 6635 88524 171106 163387 245307
ntt 7909 106826 217364 223338 242042
simple-no-buffers 7871 101575 199403 188074 224716
with-wal-buffers 7643 101056 206911 223860 261712

So yeah, there's a clear difference. It changes the values for

"master"

a bit, but both the "simple" patches (with and without) WAL buffers

are

much faster. The with-wal-buffers is almost equal to the NTT patch,
which was using 96GB file. I presume larger WAL segments would get

even

closer, if we supported them.

I'll continue investigating this, but my conclusion so far seem to

be

that we can't really replace WAL buffers with PMEM - that seems to
perform much worse.

The question is what to do about the segment size. Can we reduce the
overhead of mmap-ing individual segments, so that this works even

for

smaller WAL segments, to make this useful for common instances (not
everyone wants to run with 1GB WAL). Or whether we need to adopt the
design with a large file, mapped just once.

Another question is whether it's even worth the extra complexity. On
16MB segments the difference between master and NTT patch seems to

be

non-trivial, but increasing the WAL segment size kinda reduces

that. So

maybe just using File I/O on PMEM DAX filesystem seems good enough.
Alternatively, maybe we could switch to libpmemblk, which should
eliminate the filesystem overhead at least.

I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Well, not sure. I think the question is still open whether it's

actually

safe to run on DAX, which does not have atomic writes of 512B sectors,
and I think we rely on that e.g. for pg_config. But maybe for WAL

that's

not an issue.

I think we can use the Block Translation Table (BTT) driver that
provides atomic sector updates.

Also, I'm interested in why the through-put of NTT patch saturated at
32 clients, which is earlier than the master's one (96 clients). How
many CPU cores are there on the machine you used?

From what I know, this is somewhat expected for PMEM devices, for a
bunch of reasons:

1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%), so
it takes fewer processes to saturate it.

2) Internally, the PMEM has a 256B buffer for writes, used for

combining

etc. With too many processes sending writes, it becomes to look more
random, which is harmful for throughput.

When combined, this means the performance starts dropping at certain
number of threads, and the optimal number of threads is rather low
(something like 5-10). This is very different behavior compared to

DRAM.

Makes sense.

There's a nice overview and measurements in this paper:

Building blocks for persistent memory / How to get the most out of your
new memory?
Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons
Kemper

https://link.springer.com/article/10.1007/s00778-020-00622-9

Thank you. I'll read it.

I'm also wondering if WAL is the right usage for PMEM. Per [2]

there's a

huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of

avoiding

writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is

pretty

write-heavy (and in most cases even write-only).

I think using PMEM for WAL is cost-effective but it leverages the

only

low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to logging
protocol while considering storing data on PMEM.

True. I think investigating whether it's sensible to use PMEM for this
purpose. It may turn out that replacing the DRAM WAL buffers with

writes

directly to PMEM is not economical, and aggregating data in a DRAM
buffer is better :-(

Yes. I think it might be interesting to do an analysis of the
bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
other places by removing WALWriteLock during flush, it's probably a
good sign for further performance improvements. IIRC WALWriteLock is
one of the main bottlenecks on OLTP workload, although my memory might
already be out of date.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

--
Takashi Menjo <takashi.menjo@gmail.com>

--
Takashi Menjo <takashi.menjo@gmail.com>

--
Takashi Menjo <takashi.menjo@gmail.com>

#52Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#51)
Re: [PoC] Non-volatile WAL buffer

Hi,

Now I have caught up with this thread. I see that many of you are
interested in performance profiling.

I share my slides in SNIA SDC 2020 [1]https://www.snia.org/educational-library/how-can-persistent-memory-make-databases-faster-and-how-could-we-go-ahead-2020. In the slides, I had profiles
focused on XLogInsert and XLogFlush (mainly the latter) for my non-volatile
WAL buffer patchset. I found that the time for XLogWrite and
locking/unlocking WALWriteLock were eliminated by the patchset. Instead,
XLogInsert and WaitXLogInsertionsToFinish took more (or a little more) time
than ever because memcpy-ing to PMEM (Optane PMem) is slower than to DRAM.
For details, please see the slides.

Best regards,
Takashi

[1]: https://www.snia.org/educational-library/how-can-persistent-memory-make-databases-faster-and-how-could-we-go-ahead-2020
https://www.snia.org/educational-library/how-can-persistent-memory-make-databases-faster-and-how-could-we-go-ahead-2020

2021年1月26日(火) 18:50 Takashi Menjo <takashi.menjo@gmail.com>:

Dear everyone, Tomas,

First of all, the "v4" patchset for non-volatile WAL buffer attached to
the previous mail is actually v5... Please read "v4" as "v5."

Then, to Tomas:
Thank you for your crash report you gave on Nov 27, 2020, regarding msync
patchset. I applied the latest msync patchset v3 attached to the previous
to master 411ae64 (on Jan18, 2021) then tested it, and I got no error when
pgbench -i -s 500. Please try it if necessary.

Best regards,
Takashi

2021年1月26日(火) 17:52 Takashi Menjo <takashi.menjo@gmail.com>:

Dear everyone,

Sorry but I forgot to attach my patchsets... Please see the files
attached to this mail. Please also note that they contain some fixes.

Best regards,
Takashi

2021年1月26日(火) 17:46 Takashi Menjo <takashi.menjo@gmail.com>:

Dear everyone,

I'm sorry for the late reply. I rebase my two patchsets onto the latest
master 411ae64.The one patchset prefixed with v4 is for non-volatile WAL
buffer; the other prefixed with v3 is for msync.

I will reply to your thankful feedbacks one by one within days. Please
wait for a moment.

Best regards,
Takashi

01/25/2021(Mon) 11:56 Masahiko Sawada <sawada.mshk@gmail.com>:

On Fri, Jan 22, 2021 at 11:32 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/21/21 3:17 AM, Masahiko Sawada wrote:

On Thu, Jan 7, 2021 at 2:16 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I think I've managed to get the 0002 patch [1] rebased to master

and

working (with help from Masahiko Sawada). It's not clear to me how

it

could have worked as submitted - my theory is that an incomplete

patch

was submitted by mistake, or something like that.

Unfortunately, the benchmark results were kinda disappointing. For

a

pgbench on scale 500 (fits into shared buffers), an average of

three

5-minute runs looks like this:

branch 1 16 32 64

96

----------------------------------------------------------------

master 7291 87704 165310 150437

224186

ntt 7912 106095 213206 212410

237819

simple-no-buffers 7654 96544 115416 95828

103065

NTT refers to the patch from September 10, pre-allocating a large

WAL

file on PMEM, and simple-no-buffers is the simpler patch simply

removing

the WAL buffers and writing directly to a mmap-ed WAL segment on

PMEM.

Note: The patch is just replacing the old implementation with mmap.
That's good enough for experiments like this, but we probably want

to

keep the old one for setups without PMEM. But it's good enough for
testing, benchmarking etc.

Unfortunately, the results for this simple approach are pretty

bad. Not

only compared to the "ntt" patch, but even to master. I'm not

entirely

sure what's the root cause, but I have a couple hypotheses:

1) bug in the patch - That's clearly a possibility, although I've

tried

tried to eliminate this possibility.

2) PMEM is slower than DRAM - From what I know, PMEM is much

faster than

NVMe storage, but still much slower than DRAM (both in terms of

latency

and bandwidth, see [2] for some data). It's not terrible, but the
latency is maybe 2-3x higher - not a huge difference, but may

matter for

WAL buffers?

3) PMEM does not handle parallel writes well - If you look at [2],
Figure 4(b), you'll see that the throughput actually *drops" as the
number of threads increase. That's pretty strange / annoying,

because

that's how we write into WAL buffers - each thread writes it's own

data,

so parallelism is not something we can get rid of.

I've added some simple profiling, to measure number of calls /

time for

each operation (use -DXLOG_DEBUG_STATS to enable). It accumulates

data

for each backend, and logs the counts every 1M ops.

Typical stats from a concurrent run looks like this:

xlog stats cnt 43000000
map cnt 100 time 5448333 unmap cnt 100 time 3730963
memcpy cnt 985964 time 1550442272 len 15150499
memset cnt 0 time 0 len 0
persist cnt 13836 time 10369617 len 16292182

The times are in nanoseconds, so this says the backend did 100

mmap and

unmap calls, taking ~10ms in total. There were ~14k pmem_persist

calls,

taking 10ms in total. And the most time (~1.5s) was used by

pmem_memcpy

copying about 15MB of data. That's quite a lot :-(

It might also be interesting if we can see how much time spent on

each

logging function, such as XLogInsert(), XLogWrite(), and

XLogFlush().

Yeah, we could extend it to that, that's fairly mechanical thing. Bbut
maybe that could be visible in a regular perf profile. Also, I suppose
most of the time will be used by the pmem calls, shown in the stats.

My conclusion from this is that eliminating WAL buffers and

writing WAL

directly to PMEM (by memcpy to mmap-ed WAL segments) is probably

not the

right approach.

I suppose we should keep WAL buffers, and then just write the data

to

mmap-ed WAL segments on PMEM. Which I think is what the NTT patch

does,

except that it allocates one huge file on PMEM and writes to that
(instead of the traditional WAL segments).

So I decided to try how it'd work with writing to regular WAL

segments,

mmap-ed ad hoc. The pmem-with-wal-buffers-master.patch patch does

that,

and the results look a bit nicer:

branch 1 16 32 64

96

----------------------------------------------------------------

master 7291 87704 165310 150437

224186

ntt 7912 106095 213206 212410

237819

simple-no-buffers 7654 96544 115416 95828

103065

with-wal-buffers 7477 95454 181702 140167

214715

So, much better than the version without WAL buffers, somewhat

better

than master (except for 64/96 clients), but still not as good as

NTT.

At this point I was wondering how could the NTT patch be faster

when

it's doing roughly the same thing. I'm sire there are some

differences,

but it seemed strange. The main difference seems to be that it

only maps

one large file, and only once. OTOH the alternative "simple" patch

maps

segments one by one, in each backend. Per the debug stats the

map/unmap

calls are fairly cheap, but maybe it interferes with the memcpy

somehow.

While looking at the two methods: NTT and simple-no-buffer, I

realized

that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

That's a very good question - it's quite possible the WALWriteLock is
not really needed, because the processes are actually "writing" the

WAL

directly to PMEM. So it's a bit confusing, because it's only really
concerned about making sure it's flushed.

And yes, multiple processes certainly can write to PMEM at the same
time, in fact it's a requirement to get good throughput I believe. My
understanding is we need ~8 processes, at least that's what I heard

from

people with more PMEM experience.

Thanks, that's good to know.

TBH I'm not convinced the code in the "simple-no-buffer" code (coming
from the 0002 patch) is actually correct. Essentially, consider the
backend needs to do a flush, but does not have a segment mapped. So it
maps it and calls pmem_drain() on it.

But does that actually flush anything? Does it properly flush changes
done by other processes that may not have called pmem_drain() yet? I
find this somewhat suspicious and I'd bet all processes that did write
something have to call pmem_drain().

Yeah, in terms of experiments at least it's good to find out that the
approach mmapping each WAL segment is not good at performance.

So I did an experiment by increasing the size of the WAL segments.

I

chose to try with 521MB and 1024MB, and the results with 1GB look

like this:

branch 1 16 32 64

96

----------------------------------------------------------------

master 6635 88524 171106 163387

245307

ntt 7909 106826 217364 223338

242042

simple-no-buffers 7871 101575 199403 188074

224716

with-wal-buffers 7643 101056 206911 223860

261712

So yeah, there's a clear difference. It changes the values for

"master"

a bit, but both the "simple" patches (with and without) WAL

buffers are

much faster. The with-wal-buffers is almost equal to the NTT

patch,

which was using 96GB file. I presume larger WAL segments would get

even

closer, if we supported them.

I'll continue investigating this, but my conclusion so far seem to

be

that we can't really replace WAL buffers with PMEM - that seems to
perform much worse.

The question is what to do about the segment size. Can we reduce

the

overhead of mmap-ing individual segments, so that this works even

for

smaller WAL segments, to make this useful for common instances (not
everyone wants to run with 1GB WAL). Or whether we need to adopt

the

design with a large file, mapped just once.

Another question is whether it's even worth the extra complexity.

On

16MB segments the difference between master and NTT patch seems to

be

non-trivial, but increasing the WAL segment size kinda reduces

that. So

maybe just using File I/O on PMEM DAX filesystem seems good enough.
Alternatively, maybe we could switch to libpmemblk, which should
eliminate the filesystem overhead at least.

I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Well, not sure. I think the question is still open whether it's

actually

safe to run on DAX, which does not have atomic writes of 512B sectors,
and I think we rely on that e.g. for pg_config. But maybe for WAL

that's

not an issue.

I think we can use the Block Translation Table (BTT) driver that
provides atomic sector updates.

Also, I'm interested in why the through-put of NTT patch saturated

at

32 clients, which is earlier than the master's one (96 clients). How
many CPU cores are there on the machine you used?

From what I know, this is somewhat expected for PMEM devices, for a
bunch of reasons:

1) The memory bandwidth is much lower than for DRAM (maybe ~10-20%),

so

it takes fewer processes to saturate it.

2) Internally, the PMEM has a 256B buffer for writes, used for

combining

etc. With too many processes sending writes, it becomes to look more
random, which is harmful for throughput.

When combined, this means the performance starts dropping at certain
number of threads, and the optimal number of threads is rather low
(something like 5-10). This is very different behavior compared to

DRAM.

Makes sense.

There's a nice overview and measurements in this paper:

Building blocks for persistent memory / How to get the most out of

your

new memory?
Alexander van Renen, Lukas Vogel, Viktor Leis, Thomas Neumann & Alfons
Kemper

https://link.springer.com/article/10.1007/s00778-020-00622-9

Thank you. I'll read it.

I'm also wondering if WAL is the right usage for PMEM. Per [2]

there's a

huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of

avoiding

writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is

pretty

write-heavy (and in most cases even write-only).

I think using PMEM for WAL is cost-effective but it leverages the

only

low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to

logging

protocol while considering storing data on PMEM.

True. I think investigating whether it's sensible to use PMEM for this
purpose. It may turn out that replacing the DRAM WAL buffers with

writes

directly to PMEM is not economical, and aggregating data in a DRAM
buffer is better :-(

Yes. I think it might be interesting to do an analysis of the
bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
other places by removing WALWriteLock during flush, it's probably a
good sign for further performance improvements. IIRC WALWriteLock is
one of the main bottlenecks on OLTP workload, although my memory might
already be out of date.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

--
Takashi Menjo <takashi.menjo@gmail.com>

--
Takashi Menjo <takashi.menjo@gmail.com>

--
Takashi Menjo <takashi.menjo@gmail.com>

--
Takashi Menjo <takashi.menjo@gmail.com>

#53Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Masahiko Sawada (#48)
Re: [PoC] Non-volatile WAL buffer

On 1/25/21 3:56 AM, Masahiko Sawada wrote:

...

On 1/21/21 3:17 AM, Masahiko Sawada wrote:

...

While looking at the two methods: NTT and simple-no-buffer, I realized
that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

That's a very good question - it's quite possible the WALWriteLock is
not really needed, because the processes are actually "writing" the WAL
directly to PMEM. So it's a bit confusing, because it's only really
concerned about making sure it's flushed.

And yes, multiple processes certainly can write to PMEM at the same
time, in fact it's a requirement to get good throughput I believe. My
understanding is we need ~8 processes, at least that's what I heard from
people with more PMEM experience.

Thanks, that's good to know.

TBH I'm not convinced the code in the "simple-no-buffer" code (coming
from the 0002 patch) is actually correct. Essentially, consider the
backend needs to do a flush, but does not have a segment mapped. So it
maps it and calls pmem_drain() on it.

But does that actually flush anything? Does it properly flush changes
done by other processes that may not have called pmem_drain() yet? I
find this somewhat suspicious and I'd bet all processes that did write
something have to call pmem_drain().

For the record, from what I learned / been told by engineers with PMEM
experience, calling pmem_drain() should properly flush changes done by
other processes. So it should be sufficient to do that in XLogFlush(),
from a single process.

My understanding is that we have about three challenges here:

(a) we still need to track how far we flushed, so this needs to be
protected by some lock anyway (although perhaps a much smaller section
of the function)

(b) pmem_drain() flushes all the changes, so it flushes even "future"
part of the WAL after the requested LSN, which may negatively affects
performance I guess. So I wonder if pmem_persist would be a better fit,
as it allows specifying a range that should be persisted.

(c) As mentioned before, PMEM behaves differently with concurrent
access, i.e. it reaches peak throughput with relatively low number of
threads wroting data, and then the throughput drops quite quickly. I'm
not sure if the same thing applies to pmem_drain() too - if it does, we
may need something like we have for insertions, i.e. a handful of locks
allowing limited number of concurrent inserts.

Yeah, in terms of experiments at least it's good to find out that the
approach mmapping each WAL segment is not good at performance.

Right. The problem with small WAL segments seems to be that each mmap
causes the TLB to be thrown away, which means a lot of expensive cache
misses. As the mmap needs to be done by each backend writing WAL, this
is particularly bad with small WAL segments. The NTT patch works around
that by doing just a single mmap.

I wonder if we could pre-allocate and mmap small segments, and keep them
mapped and just rename the underlying files when recycling them. That'd
keep the regular segment files, as expected by various tools, etc. The
question is what would happen when we temporarily need more WAL, etc.

...

I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Well, not sure. I think the question is still open whether it's actually
safe to run on DAX, which does not have atomic writes of 512B sectors,
and I think we rely on that e.g. for pg_config. But maybe for WAL that's
not an issue.

I think we can use the Block Translation Table (BTT) driver that
provides atomic sector updates.

But we have benchmarked that, see my message from 2020/11/26, which
shows this table:

master/btt master/dax ntt simple
-----------------------------------------------------------
1 5469 7402 7977 6746
16 48222 80869 107025 82343
32 73974 158189 214718 158348
64 85921 154540 225715 164248
96 150602 221159 237008 217253

Clearly, BTT is quite expensive. Maybe there's a way to tune that at
filesystem/kernel level, I haven't tried that.

I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a
huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of avoiding
writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).

I think using PMEM for WAL is cost-effective but it leverages the only
low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to logging
protocol while considering storing data on PMEM.

True. I think investigating whether it's sensible to use PMEM for this
purpose. It may turn out that replacing the DRAM WAL buffers with writes
directly to PMEM is not economical, and aggregating data in a DRAM
buffer is better :-(

Yes. I think it might be interesting to do an analysis of the
bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
other places by removing WALWriteLock during flush, it's probably a
good sign for further performance improvements. IIRC WALWriteLock is
one of the main bottlenecks on OLTP workload, although my memory might
already be out of date.

I think WALWriteLock itself (i.e. acquiring/releasing it) is not an
issue - the problem is that writing the WAL to persistent storage itself
is expensive, and we're waiting to that.

So it's not clear to me if removing the lock (and allowing multiple
processes to do pmem_drain concurrently) can actually help, considering
pmem_drain() should flush writes from other processes anyway.

But as I said, that is just my theory - I might be entirely wrong, it'd
be good to hack XLogFlush a bit and try it out.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#54tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Tomas Vondra (#53)
RE: [PoC] Non-volatile WAL buffer

From: Tomas Vondra <tomas.vondra@enterprisedb.com>

(c) As mentioned before, PMEM behaves differently with concurrent
access, i.e. it reaches peak throughput with relatively low number of
threads wroting data, and then the throughput drops quite quickly. I'm
not sure if the same thing applies to pmem_drain() too - if it does, we
may need something like we have for insertions, i.e. a handful of locks
allowing limited number of concurrent inserts.

I think WALWriteLock itself (i.e. acquiring/releasing it) is not an
issue - the problem is that writing the WAL to persistent storage itself
is expensive, and we're waiting to that.

So it's not clear to me if removing the lock (and allowing multiple
processes to do pmem_drain concurrently) can actually help, considering
pmem_drain() should flush writes from other processes anyway.

I may be out of the track, but HPE's benchmark using Oracle 18c, placing the REDO log file on Intel PMEM in App Direct mode, showed only 27% performance increase compared to even "SAS" SSD.

https://h20195.www2.hpe.com/v2/getdocument.aspx?docname=a00074230enw

The just-released Oracle 21c has started support for placing data files on PMEM, eliminating the overhead of buffer cache. It's interesting that this new feature is categorized in "Manageability", not "Performance and scalability."

https://docs.oracle.com/en/database/oracle/oracle-database/21/nfcon/persistent-memory-database-258797846.html

They recommend placing REDO logs on DAX-aware file systems. I ownder what's behind this.

https://docs.oracle.com/en/database/oracle/oracle-database/21/admin/using-PMEM-db-support.html#GUID-D230B9CF-1845-4833-9BF7-43E9F15B7113

"You can use PMEM Filestore for database datafiles and control files. For performance reasons, Oracle recommends that you store redo log files as independent files in a DAX-aware filesystem such as EXT4/XFS."

Regards
Takayuki Tsunakawa

#55Takashi Menjo
takashi.menjo@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#54)
Re: [PoC] Non-volatile WAL buffer

Hi Tomas,

I'd answer your questions. (Not all for now, sorry.)

Do I understand correctly that the patch removes "regular" WAL buffers

and instead writes the data into the non-volatile PMEM buffer, without
writing that to the WAL segments at all (unless in archiving mode)?

Firstly, I guess many (most?) instances will have to write the WAL

segments anyway because of PITR/backups, so I'm not sure we can save much
here.

Mostly yes. My "non-volatile WAL buffer" patchset removes regular volatile
WAL buffers and brings non-volatile ones. All the WAL will get into the
non-volatile buffers and persist there. No write out of the buffers to WAL
segment files is required. However in archiving mode or in a case of buffer
full (described later), both of the non-volatile buffers and the segment
files are used.

In archiving mode with my patchset, for each time one segment (16MB
default) is fixed on the non-volatile buffers, that segment is written to a
segment file asynchronously (by XLogBackgroundFlush). Then it will be
archived by existing archiving functionality.

But more importantly - doesn't that mean the nvwal_size value is

essentially a hard limit? With max_wal_size, it's a soft limit i.e. we're
allowed to temporarily use more WAL when needed. But with a pre-allocated
file, that's clearly not possible. So what would happen in those cases?

Yes, nvwal_size is a hard limit, and I see it's a major weak point of my
patchset.

When all non-volatile WAL buffers are filled, the oldest segment on the
buffers is written (by XLogWrite) to a regular WAL segment file, then those
buffers are cleared (by AdvanceXLInsertBuffer) for new records. All WAL
record insertions to the buffers block until that write and clear are
complete. Due to that, all write transactions also block.

To make the matter worse, if a checkpoint eventually occurs in such a
buffer full case, record insertions would block for a certain time at the
end of the checkpoint because a large amount of the non-volatile buffers
will be cleared (see PreallocNonVolatileXlogBuffer). From a client view, it
would look as if the postgres server freezes for a while.

Proper checkpointing would prevent such cases, but it could be hard to
control. When I reproduced the Gang's case reported in this thread, such
buffer full and freeze occured.

Also, is it possible to change nvwal_size? I haven't tried, but I wonder

what happens with the current contents of the file.

The value of nvwal_size should be equal to the actual size of nvwal_path
file when postgres starts up. If not equal, postgres will panic at
MapNonVolatileXLogBuffer (see nv_xlog_buffer.c), and the WAL contents on
the file will remain as it was. So, if an admin accidentally changes the
nvwal_size value, they just cannot get postgres up.

The file size may be extended/shrunk offline by truncate(1) command, but
the WAL contents on the file also should be moved to the proper offset
because the insertion/recovery offset is calculated by modulo, that is,
record's LSN % nvwal_size; otherwise we lose WAL. An offline tool to do
such an operation might be required, but is not yet.

The way I understand the current design is that we're essentially

switching from this architecture:

clients -> wal buffers (DRAM) -> wal segments (storage)

to this

clients -> wal buffers (PMEM)

(Assuming there we don't have to write segments because of archiving.)

Yes. Let me describe how current PostgreSQL design is and how the patchsets
and works talked in this thread changes it, AFAIU:

- Current PostgreSQL:
clients -[memcpy]-> buffers (DRAM) -[write]-> segments (disk)

- Patch "pmem-with-wal-buffers-master.patch" Tomas posted:
clients -[memcpy]-> buffers (DRAM) -[pmem_memcpy]-> mmap-ed segments
(PMEM)

- My "non-volatile WAL buffer" patchset:
clients -[pmem_memcpy(*)]-> buffers (PMEM)

- My another patchset mmap-ing segments as buffers:
clients -[pmem_memcpy(*)]-> mmap-ed segments as buffers (PMEM)

- "Non-volatile Memory Logging" in PGcon 2016 [1]https://www.pgcon.org/2016/schedule/track/Performance/945.en.html[2]https://github.com/meistervonperf/postgresql-NVM-logging[3]https://github.com/meistervonperf/pseudo-pram:
clients -[memcpy]-> buffers (WC[4]https://www.kernel.org/doc/html/latest/x86/pat.html DRAM as pseudo PMEM) -[async
write]-> segments (disk)

(* or memcpy + pmem_flush)

And I'd say that our previous work "Introducing PMDK into PostgreSQL"
talked in PGCon 2018 [5]https://pgcon.org/2018/schedule/events/1154.en.html and its patchset [6 for the latest] are based on
the same idea as Tomas's patch above.

That's all for this mail. Please be patient for the next mail.

Best regards,
Takashi

[1]: https://www.pgcon.org/2016/schedule/track/Performance/945.en.html
[2]: https://github.com/meistervonperf/postgresql-NVM-logging
[3]: https://github.com/meistervonperf/pseudo-pram
[4]: https://www.kernel.org/doc/html/latest/x86/pat.html
[5]: https://pgcon.org/2018/schedule/events/1154.en.html
[6]: /messages/by-id/CAOwnP3ONd9uXPXKoc5AAfnpCnCyOna1ru6sU=eY_4WfMjaKG9A@mail.gmail.com
/messages/by-id/CAOwnP3ONd9uXPXKoc5AAfnpCnCyOna1ru6sU=eY_4WfMjaKG9A@mail.gmail.com

--
Takashi Menjo <takashi.menjo@gmail.com>

#56Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Tomas Vondra (#53)
Re: [PoC] Non-volatile WAL buffer

On Thu, Jan 28, 2021 at 1:41 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/25/21 3:56 AM, Masahiko Sawada wrote:

...

On 1/21/21 3:17 AM, Masahiko Sawada wrote:

...

While looking at the two methods: NTT and simple-no-buffer, I realized
that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

That's a very good question - it's quite possible the WALWriteLock is
not really needed, because the processes are actually "writing" the WAL
directly to PMEM. So it's a bit confusing, because it's only really
concerned about making sure it's flushed.

And yes, multiple processes certainly can write to PMEM at the same
time, in fact it's a requirement to get good throughput I believe. My
understanding is we need ~8 processes, at least that's what I heard from
people with more PMEM experience.

Thanks, that's good to know.

TBH I'm not convinced the code in the "simple-no-buffer" code (coming
from the 0002 patch) is actually correct. Essentially, consider the
backend needs to do a flush, but does not have a segment mapped. So it
maps it and calls pmem_drain() on it.

But does that actually flush anything? Does it properly flush changes
done by other processes that may not have called pmem_drain() yet? I
find this somewhat suspicious and I'd bet all processes that did write
something have to call pmem_drain().

For the record, from what I learned / been told by engineers with PMEM
experience, calling pmem_drain() should properly flush changes done by
other processes. So it should be sufficient to do that in XLogFlush(),
from a single process.

My understanding is that we have about three challenges here:

(a) we still need to track how far we flushed, so this needs to be
protected by some lock anyway (although perhaps a much smaller section
of the function)

(b) pmem_drain() flushes all the changes, so it flushes even "future"
part of the WAL after the requested LSN, which may negatively affects
performance I guess. So I wonder if pmem_persist would be a better fit,
as it allows specifying a range that should be persisted.

(c) As mentioned before, PMEM behaves differently with concurrent
access, i.e. it reaches peak throughput with relatively low number of
threads wroting data, and then the throughput drops quite quickly. I'm
not sure if the same thing applies to pmem_drain() too - if it does, we
may need something like we have for insertions, i.e. a handful of locks
allowing limited number of concurrent inserts.

Thanks. That's a good summary.

Yeah, in terms of experiments at least it's good to find out that the
approach mmapping each WAL segment is not good at performance.

Right. The problem with small WAL segments seems to be that each mmap
causes the TLB to be thrown away, which means a lot of expensive cache
misses. As the mmap needs to be done by each backend writing WAL, this
is particularly bad with small WAL segments. The NTT patch works around
that by doing just a single mmap.

I wonder if we could pre-allocate and mmap small segments, and keep them
mapped and just rename the underlying files when recycling them. That'd
keep the regular segment files, as expected by various tools, etc. The
question is what would happen when we temporarily need more WAL, etc.

...

I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Well, not sure. I think the question is still open whether it's actually
safe to run on DAX, which does not have atomic writes of 512B sectors,
and I think we rely on that e.g. for pg_config. But maybe for WAL that's
not an issue.

I think we can use the Block Translation Table (BTT) driver that
provides atomic sector updates.

But we have benchmarked that, see my message from 2020/11/26, which
shows this table:

master/btt master/dax ntt simple
-----------------------------------------------------------
1 5469 7402 7977 6746
16 48222 80869 107025 82343
32 73974 158189 214718 158348
64 85921 154540 225715 164248
96 150602 221159 237008 217253

Clearly, BTT is quite expensive. Maybe there's a way to tune that at
filesystem/kernel level, I haven't tried that.

I missed your mail. Yeah, BTT seems to be quite expensive.

I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a
huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of avoiding
writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).

I think using PMEM for WAL is cost-effective but it leverages the only
low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to logging
protocol while considering storing data on PMEM.

True. I think investigating whether it's sensible to use PMEM for this
purpose. It may turn out that replacing the DRAM WAL buffers with writes
directly to PMEM is not economical, and aggregating data in a DRAM
buffer is better :-(

Yes. I think it might be interesting to do an analysis of the
bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
other places by removing WALWriteLock during flush, it's probably a
good sign for further performance improvements. IIRC WALWriteLock is
one of the main bottlenecks on OLTP workload, although my memory might
already be out of date.

I think WALWriteLock itself (i.e. acquiring/releasing it) is not an
issue - the problem is that writing the WAL to persistent storage itself
is expensive, and we're waiting to that.

So it's not clear to me if removing the lock (and allowing multiple
processes to do pmem_drain concurrently) can actually help, considering
pmem_drain() should flush writes from other processes anyway.

But as I said, that is just my theory - I might be entirely wrong, it'd
be good to hack XLogFlush a bit and try it out.

I've done some performance benchmarks with the master and NTT v4
patch. Let me share the results.

pgbench setup:
* scale factor = 2000
* duration = 600 sec
* clients = 32, 64, 96

NVWAL setup:
* nvwal_size = 50GB
* max_wal_size = 50GB
* min_wal_size = 50GB

The whole database fits in shared_buffers and WAL segment file size is 16MB.

The results are:

master NTT master-unlogged
32 113209 67107 154298
64 144880 54289 178883
96 151405 50562 180018

"master-unlogged" is the same setup as "master" except for using
unlogged tables (using --unlogged-tables pgbench option). The TPS
increased by about 20% compared to "master" case (i.g., logged table
case). The reason why I experimented unlogged table case as well is
that we can think these results as an ideal performance if we were
able to write WAL records in 0 sec. IOW, even if the PMEM patch would
significantly improve WAL logging performance, I think it could not
exceed this performance. But hope is that if we currently have a
performance bottle-neck in WAL logging (.e.g, locking and writing
WAL), removing or minimizing WAL logging would bring a chance to
further improve performance by eliminating the new-coming bottle-neck.

As we can see from the above result, apparently, the performance of
“ntt” case was not good in this evaluation. I've not reviewed the
patch in-depth yet but something might be wrong with the v4 patch or
PMEM configuration I did on my environment is wrong.

Besides, I've checked the main wait events on each experiment using
pg_wait_sampling. Here are the top 5 wait events on "master" case
excluding wait events on the main function of auxiliary processes:

event_type | event | sum
------------+----------------------+-------
Client | ClientRead | 46902
LWLock | WALWrite | 33405
IPC | ProcArrayGroupUpdate | 8855
LWLock | WALInsert | 3215
LWLock | ProcArray | 3022

We can see the wait event on WALWrite lwlock acquisition happened many
times and it was the primary wait event. On the other hand, In
"master-unlogged" case, I got:

event_type | event | sum
------------+----------------------+-------
Client | ClientRead | 59871
IPC | ProcArrayGroupUpdate | 17528
LWLock | ProcArray | 4317
LWLock | XactSLRU | 3705
IPC | XactGroupUpdate | 3045

LWLock of WAL logging disappeared.

The result of "ntt" case is:

event_type | event | sum
------------+----------------------+--------
LWLock | WALInsert | 126487
Client | ClientRead | 12173
LWLock | BufferContent | 4480
Lock | transactionid | 2017
IPC | ProcArrayGroupUpdate | 924

The wait event on WALWrite lwlock disappeared. Instead, there were
many wait events on WALInsert lwlock. I've not investigated this
result yet. This could be because the v4 patch acquires WALInsert lock
more than necessary or writing WAL records to PMEM took more time than
writing to DRAM as Tomas mentioned before.

If the PMEM patch introduces a new WAL file (called nwwal file in the
patch) and writes a normal WAL segment file based on nvwal file, I
think it doesn't necessarily need to follow the current WAL segment
file format (i.g., sequential writes to 8kB each block). I think there
is a better algorithm to write WAL records to PMEM more efficiently
like this paper proposing[1]https://jianh.web.engr.illinois.edu/papers/jian-vldb15.pdf.

Finally, I realized while using the PMEM patch that with a large nvwal
file, PostgreSQL server takes a long time to start since it
initializes nvwal file. In my environment, nvwal size is 50GB and it
took 1 min to startup. This could lead to downtime in production.

[1]: https://jianh.web.engr.illinois.edu/papers/jian-vldb15.pdf

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#57tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Masahiko Sawada (#56)
RE: [PoC] Non-volatile WAL buffer

From: Masahiko Sawada <sawada.mshk@gmail.com>

I've done some performance benchmarks with the master and NTT v4
patch. Let me share the results.

...

master NTT master-unlogged
32 113209 67107 154298
64 144880 54289 178883
96 151405 50562 180018

"master-unlogged" is the same setup as "master" except for using
unlogged tables (using --unlogged-tables pgbench option). The TPS
increased by about 20% compared to "master" case (i.g., logged table
case). The reason why I experimented unlogged table case as well is
that we can think these results as an ideal performance if we were
able to write WAL records in 0 sec. IOW, even if the PMEM patch would
significantly improve WAL logging performance, I think it could not
exceed this performance. But hope is that if we currently have a
performance bottle-neck in WAL logging (.e.g, locking and writing
WAL), removing or minimizing WAL logging would bring a chance to
further improve performance by eliminating the new-coming bottle-neck.

Could you tell us the specifics of the storage for WAL, e.g., SSD/HDD, the interface is NVMe/SAS/SATA, read-write throughput and latency (on the product catalog), and the product model?

Was the WAL stored on a storage device separate from the other files? I want to know if the comparison is as fair as possible. I guess that in the NTT (PMEM) case, the WAL traffic is not affected by the I/Os of the other files.

What would the comparison look like between master and unlogged-master if you place WAL on a DAX-aware filesystem like xfs or ext4 on PMEM, which Oracle recommends as REDO log storage? That is, if we place the WAL on the fastest storage configuration possible, what would be the difference between the logged and unlogged?

I'm asking these to know if we consider it worthwhile to make further efforts in special code for WAL on PMEM.

Besides, I've checked the main wait events on each experiment using
pg_wait_sampling. Here are the top 5 wait events on "master" case
excluding wait events on the main function of auxiliary processes:

event_type | event | sum
------------+----------------------+-------
Client | ClientRead | 46902
LWLock | WALWrite | 33405
IPC | ProcArrayGroupUpdate | 8855
LWLock | WALInsert | 3215
LWLock | ProcArray | 3022

We can see the wait event on WALWrite lwlock acquisition happened many
times and it was the primary wait event.

The result of "ntt" case is:

event_type | event | sum
------------+----------------------+--------
LWLock | WALInsert | 126487
Client | ClientRead | 12173
LWLock | BufferContent | 4480
Lock | transactionid | 2017
IPC | ProcArrayGroupUpdate | 924

The wait event on WALWrite lwlock disappeared. Instead, there were
many wait events on WALInsert lwlock. I've not investigated this
result yet. This could be because the v4 patch acquires WALInsert lock
more than necessary or writing WAL records to PMEM took more time than
writing to DRAM as Tomas mentioned before.

Increasing NUM_XLOGINSERT_LOCKS might improve the result, but I don't have much hope because PMEM appears to have limited concurrency...

Regards
Takayuki Tsunakawa

#58Takashi Menjo
takashi.menjo@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#57)
Re: [PoC] Non-volatile WAL buffer

Hi,

I made a new page at PostgreSQL Wiki to gather and summarize information
and discussion about PMEM-backed WAL designs and implementations. Some
parts of the page are TBD. I will continue to maintain the page. Requests
are welcome.

Persistent Memory for WAL
https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL

Regards,

--
Takashi Menjo <takashi.menjo@gmail.com>

#59tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Takashi Menjo (#58)
RE: [PoC] Non-volatile WAL buffer

From: Takashi Menjo <takashi.menjo@gmail.com>

I made a new page at PostgreSQL Wiki to gather and summarize information and discussion about PMEM-backed WAL designs and implementations. Some parts of the page are TBD. I will continue to maintain the page. Requests are welcome.

Persistent Memory for WAL
https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL

Thank you for putting together the information.

In "Allocates WAL buffers on shared buffers", "shared buffers" should be DRAM because shared buffers in Postgres means the buffer cache for database data.

I haven't tracked the whole thread, but could you collect information like the following? I think such (partly basic) information will be helpful to decide whether it's worth casting more efforts into complex code, or it's enough to place WAL on DAX-aware filesystems and tune the filesystem.

* What approaches other DBMSs take, and their performance gains (Oracle, SQL Server, HANA, Cassandra, etc.)
The same DBMS should take different approaches depending on the file type: Oracle recommends different things to data files and REDO logs.

* The storage capabilities of PMEM compared to the fast(est) alternatives such as NVMe SSD (read/write IOPS, latency, throughput, concurrency, which may be posted on websites like Tom's Hardware or SNIA)

* What's the situnation like on Windows?

Regards
Takayuki Tsunakawa

#60Takashi Menjo
takashi.menjo@gmail.com
In reply to: tsunakawa.takay@fujitsu.com (#59)
Re: [PoC] Non-volatile WAL buffer

Hi Takayuki,

Thank you for your helpful comments.

In "Allocates WAL buffers on shared buffers", "shared buffers" should be

DRAM because shared buffers in Postgres means the buffer cache for database
data.

That's true. Fixed.

I haven't tracked the whole thread, but could you collect information like
the following? I think such (partly basic) information will be helpful to
decide whether it's worth casting more efforts into complex code, or it's
enough to place WAL on DAX-aware filesystems and tune the filesystem.

* What approaches other DBMSs take, and their performance gains (Oracle,
SQL Server, HANA, Cassandra, etc.)
The same DBMS should take different approaches depending on the file type:
Oracle recommends different things to data files and REDO logs.

I also think it will be helpful. Adding "Other DBMSes using PMEM" section.

* The storage capabilities of PMEM compared to the fast(est) alternatives

such as NVMe SSD (read/write IOPS, latency, throughput, concurrency, which
may be posted on websites like Tom's Hardware or SNIA)

This will be helpful, too. Adding "Basic performance" subsection under
"Overview of persistent memory (PMEM)."

* What's the situnation like on Windows?

Sorry but I don't know Windows' PMEM support very much. All I know is that
Windows Server 2016 and 2019 supports PMEM (2016 partially) [1]https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/deploy-pmem and PMDK
supports Windows [2]https://docs.pmem.io/persistent-memory/getting-started-guide/installing-pmdk/installing-pmdk-on-windows.

All the above contents will be updated gradually. Please stay tuned.

Regards,

[1]: https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/deploy-pmem
https://docs.microsoft.com/en-us/windows-server/storage/storage-spaces/deploy-pmem
[2]: https://docs.pmem.io/persistent-memory/getting-started-guide/installing-pmdk/installing-pmdk-on-windows
https://docs.pmem.io/persistent-memory/getting-started-guide/installing-pmdk/installing-pmdk-on-windows

--
Takashi Menjo <takashi.menjo@gmail.com>

#61Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#60)
2 attachment(s)
Re: [PoC] Non-volatile WAL buffer

Hi Sawada,

Thank you for your performance report.

First, I'd say that the latest v5 non-volatile WAL buffer patchset
looks not bad itself. I made a performance test for the v5 and got
better performance than the original (non-patched) one and our
previous work. See the attached figure for results.

I think steps and/or setups of Tomas's, yours, and mine could be
different, leading to the different performance results. So I show my
steps and setups for my performance test. Please see the tail of this
mail for them.

Also, I write performance tips to the PMEM page at PostgreSQL wiki
[1]: https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL#Performance_tips

Regards,
Takashi

[1]: https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL#Performance_tips

# Environment variables
export PGHOST=/tmp
export PGPORT=5432
export PGDATABASE="$USER"
export PGUSER="$USER"
export PGDATA=/mnt/nvme0n1/pgdata

# Steps
Note that I ran postgres server and pgbench in a single-machine system
but separated two NUMA nodes. PMEM and PCI SSD for the server process
are on the server-side NUMA node.

01) Create a PMEM namespace (sudo ndctl create-namespace -f -t pmem -m
fsdax -M dev -e namespace0.0)
02) Make an ext4 filesystem for PMEM then mount it with DAX option
(sudo mkfs.ext4 -q -F /dev/pmem0 ; sudo mount -o dax /dev/pmem0
/mnt/pmem0)
03) Make another ext4 filesystem for PCIe SSD then mount it (sudo
mkfs.ext4 -q -F /dev/nvme0n1 ; sudo mount /dev/nvme0n1 /mnt/nvme0n1)
04) Make /mnt/pmem0/pg_wal directory for WAL
05) Make /mnt/nvme0n1/pgdata directory for PGDATA
06) Run initdb (initdb --locale=C --encoding=UTF8 -X /mnt/pmem0/pg_wal ...)
- Also give -P /mnt/pmem0/pg_wal/nvwal -Q 81920 in the case of
"Non-volatile WAL buffer"
07) Edit postgresql.conf as the attached one
08) Start postgres server process on NUMA node 0 (numactl -N 0 -m 0 --
pg_ctl -l pg.log start)
09) Create a database (createdb --locale=C --encoding=UTF8)
10) Initialize pgbench tables with s=50 (pgbench -i -s 50)
11) Stop the postgres server process (pg_ctl -l pg.log -m smart stop)
12) Remount the PMEM and the PCIe SSD
13) Start postgres server process on NUMA node 0 again (numactl -N 0
-m 0 -- pg_ctl -l pg.log start)
14) Run pg_prewarm for all the four pgbench_* tables
15) Run pgbench on NUMA node 1 for 30 minutes (numactl -N 1 -m 1 --
pgbench -r -M prepared -T 1800 -c __ -j __)
- It executes the default tpcb-like transactions

I repeated all the steps three times for each (c,j) then got the
median "tps = __ (including connections establishing)" of the three as
throughput and the "latency average = __ ms " of that time as average
latency.

# Setup
- System: HPE ProLiant DL380 Gen10
- CPU: Intel Xeon Gold 6240M x2 sockets (18 cores per socket; HT
disabled by BIOS)
- DRAM: DDR4 2933MHz 192GiB/socket x2 sockets (32 GiB per channel x 6
channels per socket)
- Optane PMem: Apache Pass, AppDirect Mode, DDR4 2666MHz 1.5TiB/socket
x2 sockets (256 GiB per channel x 6 channels per socket; interleaving
enabled)
- PCIe SSD: DC P4800X Series SSDPED1K750GA
- Distro: Ubuntu 20.04.1
- C compiler: gcc 9.3.0
- libc: glibc 2.31
- Linux kernel: 5.7.0 (built by myself)
- Filesystem: ext4 (DAX enabled when using Optane PMem)
- PMDK: 1.9 (built by myself)
- PostgreSQL (Original): 9e7dbe3369cd8f5b0136c53b817471002505f934 (Jan
18, 2021 @ master)
- PostgreSQL (Mapped WAL file): Original + v5 of "Applying PMDK to WAL
operations for persistent memory" [2]/messages/by-id/CAOwnP3O3O1GbHpddUAzT=CP3aMpX99=1WtBAfsRZYe2Ui53MFQ@mail.gmail.com
- PostgreSQL (Non-volatile WAL buffer): Original + v5 of "Non-volatile
WAL buffer" [3]/messages/by-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg@mail.gmail.com; please read the files' prefix "v4-" as "v5-"

[2]: /messages/by-id/CAOwnP3O3O1GbHpddUAzT=CP3aMpX99=1WtBAfsRZYe2Ui53MFQ@mail.gmail.com
[3]: /messages/by-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg@mail.gmail.com

--
Takashi Menjo <takashi.menjo@gmail.com>

Attachments:

pgbench-optane-pmem-9e7dbe3-s50.pngimage/png; name=pgbench-optane-pmem-9e7dbe3-s50.pngDownload
postgresql.confapplication/octet-stream; name=postgresql.confDownload
#62Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Konstantin Knizhnik (#47)
Re: [PoC] Non-volatile WAL buffer

On 1/22/21 5:04 PM, Konstantin Knizhnik wrote:

...

I have heard from several DBMS experts that appearance of huge and
cheap non-volatile memory can make a revolution in database system
architecture. If all database can fit in non-volatile memory, then we
do not need buffers, WAL, ...>
But although  multi-terabyte NVM announces were made by IBM several
years ago, I do not know about some successful DBMS prototypes with new
architecture.

I tried to understand why...

IMHO those predictions are a bit too optimistic, because they often
assume PMEM behavior is mostly similar to DRAM, except for the extra
persistence. But that's not quite true - throughput with PMEM is much
lower in general, peak throughput is reached with few processes (and
then drops quickly) etc. But over the last few years we were focused on
optimizing for exactly the opposite - systems with many CPU cores and
processes, because that's what maximizes DRAM throughput.

I'm not saying a revolution is not possible, but it'll probably require
quite significant rethinking of the whole architecture, and it may take
multiple PMEM generations until the performance improves enough to make
this economical. Some systems are probably more suitable for this (e.g.
Redis is doing most of the work in a single process, IIRC).

The other challenge of course is availability of the hardware - most
users run on whatever is widely available at cloud providers. And PMEM
is unlikely to get there very soon, I'd guess. Until that happens, the
pressure from these customers will be (naturally) fairly low. Perhaps
someone will develop hardware appliances for on-premise setups, as was
quite common in the past. Not sure.

It was very interesting to me to read this thread, which is actually
started in 2016 with "Non-volatile Memory Logging" presentation at PGCon.
As far as I understand  from Tomas result right now using PMEM for WAL
doesn't provide some substantial increase of performance.

At the moment, I'd probably agree. It's quite possible the PoC patches
are missing some optimizations and the difference might be better, but
even then the performance increase seems fairly modest and limited to
certainly workloads.

But the main advantage of PMEM from my point of view is that it allows
to avoid write-ahead logging at all!

No, PMEM certainly does not allow avoiding write-ahead logging - we
still need to handle e.g. recovery after a crash, when the data files
are in unknown / corrupted state.

Not to mention that WAL is used for physical and logical replication
(and thus HA), and so on.

Certainly we need to change our algorithms to make it possible. Speaking
about Postgres, we have to rewrite all indexes + heap
and throw away buffer manager + WAL.

The problem with removing buffer manager and just writing everything
directly to PMEM is the worse latency/throughput (compared to DRAM).
It's probably much more efficient to combine multiple writes into RAM
and then do one (much slower) write to persistent storage, than pay the
higher latency for every write.

It might make sense for data sets that are larger than DRAM but can fit
into PMEM. But that seems like fairly rare case, and even then it may be
more efficient to redesign the schema to fit into RAM somehow (sharding,
partitioning, ...).

What can be used instead of standard B-Tree?
For example there is description of multiword-CAS approach:

   http://justinlevandoski.org/papers/mwcas.pdf

and BzTree implementation on top of it:

   https://www.cc.gatech.edu/~jarulraj/papers/2018.bztree.vldb.pdf

There is free BzTree implementation at github:

    git@github.com:sfu-dis/bztree.git

I tried to adopt it for Postgres. It was not so easy because:
1. It was written in modern C++ (-std=c++14)
2. It supports multithreading, but not mutliprocess access

So I have to patch code of this library instead of just using it:

  git@github.com:postgrespro/bztree.git

I have not tested yet most iterating case: access to PMEM through PMDK.
And I do not have hardware for such tests.
But first results are also seem to be interesting: PMwCAS is kind of
lockless algorithm and it shows much better scaling at
NUMA host comparing with standard Postgres.

I have done simple parallel insertion test: multiple clients are
inserting data with random keys.
To make competition with vanilla Postgres more honest I used unlogged
table:

create unlogged table t(pk int, payload int);
create index on t using bztree(pk);

randinsert.sql:
insert into t (payload,pk) values
(generate_series(1,1000),random()*1000000000);

pgbench -f randinsert.sql -c N -j N -M prepared -n -t 1000 -P 1 postgres

So each client is inserting one million records.
The target system has 160 virtual and 80 real cores with 256GB of RAM.
Results (TPS) are the following:

N      nbtree      bztree
1           540          455
10         993        2237
100     1479        5025

So bztree is more than 3 times faster for 100 clients.
Just for comparison: result for inserting in this table without index is
10k TPS.

I'm not familiar with bztree, but I agree novel indexing structures are
an interesting topic on their own. I only quickly skimmed the bztree
paper, but it seems it might be useful even on DRAM (assuming it will
work with replication etc.).

The other "problem" with placing data files (tables, indexes) on PMEM
and making this code PMEM-aware is that these writes generally happen
asynchronously in the background, so the impact on transaction rate is
fairly low. This is why all the patches in this thread try to apply PMEM
on the WAL logging / flushing, which is on the critical path.

I am going then try to play with PMEM.
If results will be promising, then it is possible to think about
reimplementation of heap and WAL-less Postgres!

I am sorry, that my post has no direct relation to the topic of this
thread (Non-volatile WAL buffer).
It seems to be that it is better to use PMEM to eliminate WAL at all
instead of optimizing it.
Certainly, I realize that WAL plays very important role in Postgres:
archiving and replication are based on WAL. So even if we can live
without WAL, it is still not clear whether we really want to live
without it.

One more idea: using multiword CAS approach  requires us to make changes
as editing sequences.
Such editing sequence is actually ready WAL records. So implementors of
access methods do not have to do
double work: update data structure in memory and create correspondent
WAL records. Moreover, PMwCAS operations are atomic:
we can replay or revert them in case of fault. So there is no need in
FPW (full page writes) which have very noticeable impact on WAL size and
database performance.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#63Konstantin Knizhnik
k.knizhnik@postgrespro.ru
In reply to: Tomas Vondra (#62)
Re: [PoC] Non-volatile WAL buffer

Thank you for your feedback.

On 19.02.2021 6:25, Tomas Vondra wrote:

On 1/22/21 5:04 PM, Konstantin Knizhnik wrote:

...

I have heard from several DBMS experts that appearance of huge and
cheap non-volatile memory can make a revolution in database system
architecture. If all database can fit in non-volatile memory, then we
do not need buffers, WAL, ...>
But although  multi-terabyte NVM announces were made by IBM several
years ago, I do not know about some successful DBMS prototypes with new
architecture.

I tried to understand why...

IMHO those predictions are a bit too optimistic, because they often
assume PMEM behavior is mostly similar to DRAM, except for the extra
persistence. But that's not quite true - throughput with PMEM is much
lower

Actually it is not completely true.
There are several types of NVDIMMs.
Most popular now is NVDIMM-N which is just combination of DRAM and flash.
Speed it the same as of normal DRAM, but size of such memory is also
comparable with DRAM.
So I do not think that it is perspective approach.
And definitely speed of Intel Optane memory is much slower than of DRAM.

But the main advantage of PMEM from my point of view is that it allows
to avoid write-ahead logging at all!

No, PMEM certainly does not allow avoiding write-ahead logging - we
still need to handle e.g. recovery after a crash, when the data files
are in unknown / corrupted state.

It is possible to avoid write-ahead logging if we use special algorithms
(like PMwCAS)
which ensures atomicity of updates.

The problem with removing buffer manager and just writing everything
directly to PMEM is the worse latency/throughput (compared to DRAM).
It's probably much more efficient to combine multiple writes into RAM
and then do one (much slower) write to persistent storage, than pay the
higher latency for every write.

It might make sense for data sets that are larger than DRAM but can fit
into PMEM. But that seems like fairly rare case, and even then it may be
more efficient to redesign the schema to fit into RAM somehow (sharding,
partitioning, ...).

Certainly avoid buffering will make sense only if speed of accessing
PMEM will be comparable with DRAM.

So I have to patch code of this library instead of just using it:

  git@github.com:postgrespro/bztree.git

I have not tested yet most iterating case: access to PMEM through PMDK.
And I do not have hardware for such tests.
But first results are also seem to be interesting: PMwCAS is kind of
lockless algorithm and it shows much better scaling at
NUMA host comparing with standard Postgres.

I have done simple parallel insertion test: multiple clients are
inserting data with random keys.
To make competition with vanilla Postgres more honest I used unlogged
table:

create unlogged table t(pk int, payload int);
create index on t using bztree(pk);

randinsert.sql:
insert into t (payload,pk) values
(generate_series(1,1000),random()*1000000000);

pgbench -f randinsert.sql -c N -j N -M prepared -n -t 1000 -P 1 postgres

So each client is inserting one million records.
The target system has 160 virtual and 80 real cores with 256GB of RAM.
Results (TPS) are the following:

N      nbtree      bztree
1           540          455
10         993        2237
100     1479        5025

So bztree is more than 3 times faster for 100 clients.
Just for comparison: result for inserting in this table without index is
10k TPS.

I'm not familiar with bztree, but I agree novel indexing structures are
an interesting topic on their own. I only quickly skimmed the bztree
paper, but it seems it might be useful even on DRAM (assuming it will
work with replication etc.).

The other "problem" with placing data files (tables, indexes) on PMEM
and making this code PMEM-aware is that these writes generally happen
asynchronously in the background, so the impact on transaction rate is
fairly low. This is why all the patches in this thread try to apply PMEM
on the WAL logging / flushing, which is on the critical path.

I want to make an update on my prototype.
Unfortunately my attempt to use bztree with PMEM failed,
because of two problems:

1. Used libpmemobj/bztree libraries are not compatible with Postgres
architecture.
Them support concurrent access, but by multiple threads within one
process (widely use thread-local variables).
The traditional Postgres approach (initialize shared data structures in
postmaster
(shared_preload_libraries) and inherit it by forked child processes)
doesn't work for libpmemobj.
If child doesn't open pmem itself, then any access to it cause crash.
And in case of openning pmem by child, it is assigned different virtual
memory address.
But bztree and pmwcas implementations expect that addresses are the same
in all clients.

2. There is some bug in bztree/pmwcas implementation which cause its own
test to hang in case of multithreaded
access in persistence mode. I tried to find the reason of the problem
but didn;t succeed yet: PMwCAS implementation is very non-trivial).

So I just compared single threaded  performance of bztree test: with
Intel Optane it was about two times worser
than with volatile memory.

I still wonder if using bztree just as in-memory index will be
interested because it is scaling much better than Postgres B-Tree and
even our own PgPro
in_memory extension. But certainly volatile index has very limited
usages. Also full support of all Postgres types in bztree requires a lot
of efforts
(right now I support only equality comparison).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

#64Takashi Menjo
takashi.menjo@gmail.com
In reply to: Konstantin Knizhnik (#63)
1 attachment(s)
Re: [PoC] Non-volatile WAL buffer

Hi,

I had a performance test in another environment. The steps, setup,
and postgresql.conf of the test are same as the ones sent by me on
Feb 17 [1]/messages/by-id/CAOwnP3OFofOsFtmeikQcbMp0YWdJn0kVB4Ka_0tj+Urq7dtAzQ@mail.gmail.com, except the following items:

# Setup
- Distro: Red Hat Enterprise Linux release 8.2 (Ootpa)
- C compiler: gcc-8.3.1-5.el8.x86_64
- libc: glibc-2.28-101.el8.x86_64
- Linux kernel: kernel-4.18.0-193.el8.x86_64
- PMDK: libpmem-1.6.1-1.el8.x86_64, libpmem-devel-1.6.1-1.el8.x86_64

See the attached figure for the results. In short, the v5 non-volatile
WAL buffer got better performance than the original (non-patched) one.

Regards,

[1]: /messages/by-id/CAOwnP3OFofOsFtmeikQcbMp0YWdJn0kVB4Ka_0tj+Urq7dtAzQ@mail.gmail.com

--
Takashi Menjo <takashi.menjo@gmail.com>

Attachments:

pgbench-optane-pmem-9e7dbe3-s50-rhel82.pngimage/png; name=pgbench-optane-pmem-9e7dbe3-s50-rhel82.pngDownload
�PNG


IHDR�}R9
;sRGB���gAMA���a	pHYs%%IR$���IDATx^��
�Yy������X1���N��|�q������I���Qp�����?2nH8Q�
$<9i�D��Dg�x��<Q'�0���C�	n��Q�b�v�@2-u�c���}����^U���������9���i���V������u�c�o��o@����i1�n���p�`@0���7��`��
0 n��p�`@0���7��`��
;��}�c���c�=�_��q�7����������7���ny��W����������]���dge�k�Jva��;��;e2�����W0�+�����J������~�E�U5Or�/uI����q�ih�-45��x�oB�D����\�refw'��<3P��%"��ZXmL��w���x
d�us��!����T�4����I������.�D5�-�h(#T�B.�s���#����W�-����+�?q��[�^�p�$�aF���G�m�(��&o|�elr�jE����
��S��;4(2���%�0#�������Z��y��Ud�6����h�SX�}
�UU��']+�[�����RW��������d+�({$���FC~>9�-�|U��+�L"�c��];f��=`�W�qF���Q��#����m����N�����������L
���w6$2��,��T+���m�����Ww�y�Ghi+�pmL���M3���5�eH�>�=
�O��.��D�����kC�S���*�)#������$+||�p������JJ��K��U;�'}�`�W�F�A�K�|K����2����q+J|����W�2�)�V��M*m�.��� �{�����>i�m)Z[b�2$z�Gni���mLj��g����k0�[�[6��G�7^[��HX��������H�_�����iT�8k��?'����,�����h��~��6��J������-C`�W�F�A�K��=�r��>����.�������e��|�����A��(�0��NmB
�'H:���U%�(Crhi���m,����%�y�]��4H�?�������0���V��,i����G�Q���3�5(?55}�t�1�p�G��8����_����d��D����~Eg��#��&j���s�\��������S���p�%`^�A�����L��%����V�"�-�������[Z
�^E���{�Q�������z��<?����y$um��,����V����r_��k�q�K�y/��m�;�
;�>s�o��}��Bk���F�g��y�p�N*�^������L��V�5�"���ldf��L�^������t{?�*��h��.��5� E:�$Y����Jj�MMz�!~4��S*���-�����K|����'�4D�9<FB��&��	��3sz�������Z�g�E�.��T�mTnu������&�G�kp�%h[�
a�$������ J6����!9����!=�1�5�*|��1]�R)�k��p��MK��n8$����r���mN�^��k��f��/=��E(�)���b���q�l�1�C��{!�S6��T���Cj;�v��b�v�3��O]6~E�R��"�}���hC�V�����']������������H����oB��V����?s���-�l.�f�Oi�:
���m@Wt@�r�/�j��@��E�)�������[C�!��e-�V�ONr]������&���&�i�Lm�VC-���~�Db���������;PR\��v}�I���3j�B4[S�����ftQ�����tGLh6-C��rI�?g���2��S��@)s5�(7�=�mA���KM��C[����AK5��+���=~Gio�����A��e-����N|��r����������<:	�"h{
��6�p\�u$.e�O�"�41jXY�R��t�3\���?E����7!�M����N�j�*��������W"$V���n��n����-��"_�a�NI0uQs���n
M�����e������]�-jlu�SIok���oT��.
ZJ�=��7���P��*�UW���_4�R�S�����\���oi��E-P�V*��e����c��zc��o�J����x�mL���9� ���1�R^�����&����nW��E
���lZ��+�� �d}��V�#/���>��[o�H��Jtm1%�#�k}�%���-��\���*���6D���j]R�SC1�P�-"J<�&�����-��zR�K�>�^����+d��
��uU8o����
�-��������E?���]U�0:6���Zo�4��KmafC�M�Gg��K������/����J����H�E�W(��U5j�T�W��������oF���f�8�>��iz�F5M�����K:��+���w���i�jDC�x=C��y��#P����Z��n9���.)F�mLh� �gp��*m�����X[��-z���[���F��x���Gs�O�1���BMmG������"?|��E��������=^jbw�&l����V�]h����r��������+jC���!����C4������O-��������Mx�2^���4C���y+���Em��D����������y��r=����Z
w�Zf�ws�#F�
/5��kHl��&�l4���\�!����_4��b��f���@������Ky�� ���l�k�J�
���m���r,�M��i� �����Aw�u�&��h%�iSl�24�&�26[����3R�k[�EM�n���M���#��c���H��-�"1z���P� �*(��PG�-7h����*��p�5�r�D?���S<��6*��*��}�o�/D���	]��ht5��<�f�G���(2�tL�~|�j���^���jb|��.�S�����H�J��Kj�K
3��r�f�T41e�!�mu�U������e��f���@u���}tC)������Wf���43D#�hzMLt���U�z�0���o	x��4~5|�J�r�o-�&���kA��C/S� �Yz�G��#Q�����
�����~��r���^;:� �A�5�Z��~����6
�[�O���5�i�p����U��)����<�f�n��]
�.6����">n�d��?���hJM��p��(����sNQ���}X��p�M��Q�k�#��h������[��
7�����&>5Fki���E�r@�������	�=M�m���2�voWW���,�t����MhU���f0��������y|���=��/���4��)��P�ejj����m ��^��@�]�� t��ZJ���>3n0�A�w��X>�H���|M���C���)�y��)��z�7��$�[������G�l2d��*~@��F��1 ��|qI��}Y�/k��<�D_��}Qz0~�����wN��
����
�N����SW���[��
7��wS\��mY-����X�=�����������]]%y������	��������o�)���H#�<�_���h
^O-h�qh���^x
3T�l6�d�W�|�k����?@���#����,H��m��yG
����������G��M�&t��q�A���[c�������-W�=M�����Q��k� Y�yjg,4��������W����liH�
����
�N����)A<����U����[��
7�&i.��V��M|�p��m���@u�������������>�voWWI�\�`����v�k�����H����RZ�v���~_��fB�u.�5�?��6E�o	B����f(�w���M����������r������e���)�D���t��}��-Wi��;v�8��	]�F�?e�V��bn��a��Ej��������R����1��M�a4����_M�D%�[.�>��f�hb4D����6��n�*���/��{���t�^t�cU�����	�*[j[dt�����<n���k���hs���6���G
�oK��D���ej���)��"j�{t������o������L�_W��,�{|C��:��O���]����p�V|��
��Y�M��*�*��~�I����G���n2|���$���{��:5�v��0��nj���g��M�OOl��i���+t��U�]nH"�������$�Q|���Jl����X�&{.�jcn�A3�n�0��������PjG���%����o�S+�DC���ek?�B����K*�qnj�@�(!1�s�wY4������|��'e�G���	� ��_���������4�7���������c�Q�tQ]�/���6�&�r]U+�rI�nB�t�k����Jd�]R����z<�H�BM�4'M�p��I0/R�D�9
+~�nhD����Q��������-���n�����y�E�y�������ww���-�@
������������-�toWWI�����.�n�Ls���0B�T�@7������0h���@S��f�U�%��	�����u�����q�Ga�����%C��o��4h��%����)�P�$J��
IB�~eC����+�j#�����#�o�-W�U5a�H�	�\OE��
�7�Mh������������#���4�'���`�+����I�I���/.i�M�?p��pp����
BS�?���C����+��*��G�b�)V��"�K-1��y�E����z���:d���UMzicn�A3�D�7�I�]�rI�vuUXU�M7�$� 2�W	�U���k<�_����)��wj��/.��A|A2x�"O��M���B���.�qI%���oj��-H~!�,�������������������W���*)��H2hb���nzB���|+���Zo�>�D�Y��4�������(�,�c��u�|w�J��\EW5��cJ����*o�9�)�#���M�G�E���_%D�����0,E���Z����"���aMo���/=�^��-�/�M�|%�v5W�h��[��W��X�!�o��*��m�-7�f��Qd[��Up�%���U��-���tS�]��D��`h�����7J�� ztC�J4�*�h�^tGI��������a{-e� D[u"��2Dv<z��P����LU�GZ7�9�������Ti�������a�{����]��)�}f��_�R���N9�nBV��*ZP����B=����8e���d�Gn9@�����4M|����q�nK�]���X���(��z`F�K�h���w=��|Oi�i3._Im��W���r�J��*�_4�J����^z����r@�l���"e�6���f��Z�D�v�:"M����Hms�$J�]��6~u%Cw���*i*Z��io�M�8:�I����f�9Oy�m����Z�����t$&��0����IAbU��G��<m�L1u4b�������N.�M�:RH�q�Udt��������4��8h���}���o����</�v�s6K���]�X�������=��\�<���,>�.����M7D7����n6���`b��������R��C�v�RW�<���owf���pB�yt �_3������z���B~���\�v:@���]n9��
�s��	�{�wL0��0����Q��VC&�cPt-0:��{�����f���!���;~�/��t0��0���K��r�����=��+=���	��n��9�������r��r~0�����w��.=�-C�`���<���DV�&;���]�	�����<����_���
6V�y�����{~`��{$�2��ke���.�����7�Q�f��c���+�K���F��`&�Cp�7n��p�`@0���7��`��
0 n��p�`@0���7��`��
0 n��p�������=���K������(Y[[��rk����xV�pO�vA�����b�\����(��������~W�n���ve��.i��S�����������^z��k����uC�>y���GH�Pcu�poMlm��n���yN�#$f��2�{J��n����<Bb6�:;�p�nu����<�����N7�����
R�����3�
���GH�����!"O~��~�}��l�	h��6h3�s�>�����v�����;���m!���w�o��m������l��A���N��
w�
��S������m����l��F����t��j�[2V�sJ!1�"`�
0_�@�
���VA���N�cw�M�dm\���GH��n��6�fm6��F�`4�S�����Xw��>w�x�{����:\q��'��Z���S����`�
0_�@�
���VA���N��{����3�[j�kWxl�S�����$%��������
��@��l�4
���vI��m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�S������X_�3amm}s����,�K�J��$ ��Kh��6�j#h0��W�pom���[�|��F����S-�����$�l0n��6�fm6��F�`4�S���vn7<���:o���z�Pb�&s��6�|im6�f#[m
FC:�
��M.Q��jx��k��5������
��@��l�4
����6c��tm���R(�f����p`���6h���6��!�:��b�g�anq��B�5������
��@��l�4
���n���~7�������M���,�X�Q�|`�
0_�@�
���VA���N�"�;v����CW�'���X)%�l0n��6�fm6��F�`4�S�������������{N�y���� ��v��w�5�-h!WYb��
h�3�rn�h��6h3@�`4�S���n�Z����GN��Om`�
0_�@�
���VA���N���n1�m6��z�����
����
��@��l�4
��+`����8^=U&����H������-����
��@��l�4
���`���u��1��4g���H�re�����$1���Kh��6�j#h0��W�p����������U�����PoI�J��Z������`�
0_�@�
���VA���N�*�{���6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��W�pomn�����������tP���hK��$ ��Kh��6�j#h0��W�pom���[�|��F�������[^�t
k6	���Kh��6�j#h0��W�p;��RvI]�ws�����X�M�<`�
0_�@�
���VA���N��;��'������k]�s��5������
��@��l�4
����6c��S��Z���X�U�<`�
0_�@�
���VA���N�������g����pW�l�rb�fs��6�|im6�f#[m
FC:u��[mm�����"��#�X(�[�qb��H�m����l��F����t�1����mg�+L=�P<��n�g����4�p����;���{�{��uE��oO��-���zV�H�3�rn�h��6h3@�`4�S��k��O����'�k6
���Kh��6�j#h0��W�p7}m�V�Kizb����m����l��F����t��
w��m���67\9-�X�I�|`�
0_�@�
���VA���N���Y���������u�~�VUR����m����l��F����t�U0����Y[[�'�	�M�H������k�`v�%���`�
0_�@�
���VA���N��[��H����fH��V�����TQl��)�jv$f3��6�|im6�f#[m
FC:�������
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��6Rn���J���/_��F$����`4�Sc���m����l��61�{��=t��x�=%���=*_.\��2�E�q���hH��pg����
��X!m��_?p���k������n{��}W�^u��"����`4�Sc���m����l��6��=Kq�B�q���hH��pg����
��X9m��^�����
FC:5�;0��/m����i��mz=�[1.�����!���n��6�fc���n[Y���6n46
���,�p`���6+����}������P������
FC:5�;0��/m���
i;w���k�n���{���'O�t��"����`4�Sc���m����l��61���$�s�8qb{{�-�E�q���hH��pg����
��XEmb��u��'����`4�Sc���m����l�������{�>|�-/�l�Fc���N�����X_[���-W�n�m����l���'N��.������!��/�������Y�tU�F0��/m���ld����hH���p�on�p�]U���Kh��6�j#h0��{6�niN0�n��6�fm6��F�`4�S�f�7���6��{��;��Kh��6�j#h0���2��n��6�fm6��F�`4�Sc���m����l����7of2�d7��t��
���f�n~`��Kh��Z�N�:�g���?P�6n46
��C�����}�[k2L���������|Qw�i�n�Ne��5�)`>0��/m���ji;r���S�.]r��#����`4�Sd�K�����;��-7�[��n&���$������GKN
wb�����Kh��Z����'C��k�����6n46
�������c�$�b�a�e��]�����"������6�|im6VH��7dt����J�q���hH��p���=U�Z�VeK���o�Lb��	h�m����l���K�.��t��a��T���
FC:�8��4-�����fl�6��z���������
��X!mg�������ny�d7��t��n�T7�����5���r}�Y������������g����.0��/m���
i;y���M9<�D�6n46
��C��O�Tk�����u�:
TNh�k�K@'n��6�fc����q�������ny�d7��t���P<I�`�����&>Pmm�<r���(M��ZsO��p`���6h���6��!�zP���������Jk�E�,�p����;����V[�~���V���Xsf�n97]��@�
� h0��W�p7}m�n�]���0��Kh��6�j#h0��6�[�CJ�Iz;z�������������0��/m���ld����hH��poM�A��a����M��;}qR����m����l���.\�t�-d@�q���hH��p;C*�����M�O/i�9^W���:b���%����]sIb63n��6�fcU��������n!���
FC:�@�������x���������u�5g�rzV���lF0��/m����h�V�2 ����`4�Si�{��y7��6�|im6�f#[m
FC:�@��<���N�m����l��F����t�a���r'��6�|im6�f#[m
FC:�P�[�^U
Q0��/m���ld����hH��pOnd����|�8n��6�fm6��F�`4�Sf�'������p`���6�k�v���c��zD��m�hl0���2����k���[m����6�|im6��v��Is��;���0������m�Cc���N=�������p`���6�k;|���Y�fR���;�D>K���!�zH���N�m����l��m��}2���qC3����������hH��p��\��
����
���\��ls����������������hH���I��������
���\��K�d�=z��K��J�jV[?K���!�z0���S��~�$W�L�p`���6�k;{����N�rI�P�[�j�gI��`4�Sz��L8��p`���6�k;w����/_vI�P�[�j�������u����`4�Sc���m����l��FE[�j������(C(��t�/)�t0��/m���lxm��������{=9�0��1�Y��6�|im6�f�k;~������=���8�}�6�x���3&P
���,�p`���6h���v���������e������N���./��^��P�����
��@����C���������KZP
���,�p`���6h����w�������7��L(��t�^
������]�ZJ��Kh����o��o;v��[��Z������%�.�'���e@c���N���^w�d;�3_�@����:t���#n!3jq�|�������pa	�����`4�S�v��������Nb�
0_�@�
���i;���[��I~p	����hH���p�"`�
0_�@�
���i;y���//
����%����N�����Kh��65m��;�J��3�]�����hH��pg����
��@�������C��^�C�(��tjw`�
0_�@�
���i��w��������KZP
���,�p`���6h�j�v���={������}�\j<����'��_��	FC:5�;0��/m��F���){?��?���#�����w~Ua�?r�����>����}�e��%��K�}������@c���N�����Kh�����'O��=}��[��0n����������wn��>x��G�y�t�������Snu��tjw`�
0_�@��<�9rD,�[��V��a����{�_�������/}��5�����}m	�
FC:�@�{scmm}cw��f0��/m��F�����'�����[��0nG�u���C.�>���^>[3��q�[���hH��pO��^o�w7n��6�f#Cm7n��!v��������p��c.�n>��k��������=��#��A+'���q��������]�t��\�SuI����Z���,x�{kSjtU����U�*�X�Y@
n�
�fmsq��%��;�q����?��/-�Ojb/<���~����k�����g�G��}��� �S}
w��Ny�`W��W_���M�WK��a�����p�V��6h���g��8t������>��������p�|b�
y����:�����������u-�������:)L��4\j�r�W�@�TR�����������4�t��"���S�.�{,��Z��X�U@:n�
�fmsq���#G���]������|���{��po���?U��~�������a�����QD��������r:�W��3�k5��t����;�����p�l�<��V����A����ziF��S�:R�G
p���x�������m [a�l���Jh�g~�{��0��>����G�{���~�[c�w��C�y���#�=zT$]�pAS���m�R�s��`��Jf��2����I�i=MR�2�k�#����6c��St}���X�U�<`�
������@�
����2b~�e�k&{}��%�y��_�d$��_?p��P�������X���w�j���N���f�����=)�n��t����9)��m�����p��9�=��D�y����CI�H�X�Q�\`�
������@�
�����C<�;D��n{�d����v���S@�����9���m.�������V ����C#�zH������_>Rd�]k��Hd�h]���f#��B�y�#�f����p�V��6h������?�C�=�������z��{n���}n������S+�#�cc���H������Zk]!���L�9]�i�k���N=��vft���$J���=x��dr_���*K��*`.0��&���l��M�	�?(_���C�=�<�A1�/{��n9���s��7~�7��P������gR��Ij���hG��#Ld�B��z�V�8K�H��p��(�I$�8�0�G][��NWY(�f����- ���/?��n!{���-�E��p�g~������\����}�[n��#��u��}�so��V�"�����[��]���w��)�p��x�]��Vd�s��i�;Ni���5�(&��9f������8K�H������F��6`��y.i)�J��(`.8�m [a�l�-�����.��?n�L�{���3�}?�[y����}��t�U���/��/�_������[���K�*�9��p�h�K���^���?�k7�O%L�$,�F+mE��q�N=�5�[��VT�s�����g$-�B��X�Q�\`�
d+L@�
��s�D��7}G�o���.���������*�PB������\��:u�����O��|�e�H_�+"��kr����
LJ�*��L�������W����ff�my���{��R�N=���m?Z���GM�o����KM�91�"`�
d+L@�
���_�>���=�C��g�����Y�C����W�;&J.^�(��_�.����O5q�'�aN7A��C��H����9��o����k���\}$�|�P����ko
EI���L�'=H�'�3�]
)��r�������N�Sz�d�Q�� G�D��k���fJ�XR����m [a�l��F��>,������{��p��/}��n����q��A�5�y�)���t���i��i&���}��+�
�w�p�N����H"�?�!�G!��h�dI��V�<�2�J��UG:��7Mv���%�j�4���N:k,��ep��W����$1���l�	h��6�k�������!����N1��?x�[nA��w�^��4�ksK���\�Q�Y���j���+�����Tr�+X��j�K<�^���d��D��X9G����f�Hk��,�diV��c^d*�#������C:�R
w���R���h-%��������&+z����������`�
d+L@�
���\��35��k�wL���=���;&=W�^%�����U8�8�S�)%�Y&�=bV(��d��������t�lL2iE���{ |I�n���:�%�-���G�se���8������B�����
���� ����t���6�;�p�V��6h���6��?��?Y�a����_/�����r�-\�xQ�?~\��-2���AJS"�x��b�^3���oV�~T�0V*|ag��=���3���V�n�4FeB�}krl��-����g����E��'�$���6.���.&$�4�����oe]U_�QzD��n��6�.&|�O�p�V��6hKa{{������3��������P�7v>���'��m?���s���Y<v����sm0K������9���Nsf$|a�bv��F�C�Fi������m��:��$���?(�Y�RyX,8�>�����A����j�=���'��\������
���N=����B��]�Q�M���
�p�V��6hKA���g��q���M�	��W~ka���� �����~�{�����'h3��
5w����Z��C����^�mOb;��Ns����5��&�Z�m��K��|�������f�5Z����u�p��,����pGt
]%�7�Sg���
�N�R����@����@[
��=�n!��?~\��K���E��?6>�^1��z�>���\A�i�,F'n�{8�&��N��B5����:�_�-�/J�aB�V�9���].�U"\�����D�|�X�����8�����k�����e�UT}eW	�
��C�i��\&��im����l�	h����{���y�T��~�[
�}��[���������cp�i���y���L�E�T���B��x��XC@���s��[#�n������������(�P�]��!8emw-0��"�zp�
)`�
��t�fm)���W����{��}"��w���p��[����sb�~�Q�����w�zXe�&�Q���[�����6�5!��%!�*^y�pw�����	H��ZMG�F��z��NS"���n��6�f#m����������������	���{]j�����N�����I�6E�_�=�%�-8�8�RN��y�f���Vn�lf�N)�n5��X��{v=�67n�j:�m�*U	,�t��V;.`�
0_�@��L��?��<\/�����o���v[a���A��}��b��<����q�Fxo����$�-�������6�-�{���6/[O�e����������U�+/=������M�' VM��<����>��5�
���N=���^���1�.0��/m��F&���b��!�����OE�O����p�y�����1�������_)��;E[>�I,\_���p����l=]]r��hV�\�RO�4��'����p�R���`R�nmf��F���r�������N=��v����m����ld�m{{���K�S�9�m}���~�{
���'\j���,��>��n��������Hc�	������,�_*��e���'��On0\
R��Y�h�\�R0���8�I���7\�P0�^e�],�������I���^�0��/m���l|�w|�L�������!�/|�
)wL��qCd��v���(�*����\E�E��6���,}�2��jag�����&E���h�]O�H�fj�����c3�*�=<��2�z�;�`�
0_�@�
����o�F�>���0�}?�{��'�s��b���Kj����"����n���i�!9��R��=���4��d0��/m���l<��������������!�?����>����^�~�����GmH������8�p|`�j�!�z�k����G�^0��/m������3_q�0�}?��>��b�_��_v����9���>����Qk-�kJ�W�,�%;���]�t��n��H~8���6�|im6�f���+29r�p�<��U��'�{���u����ZD�[�����s�/��]��<���������,<w+.'`�
0_�@���]�x��@@%���?^&����?P�����y1����G�r;b���;�Jhl0���ir&��r`�
0_�@���8p |��'���:uJ����)���;&o>ay�
�
FC:5�;0��/m��F�n��q��5��m��]3��+w���;&���������`4�SxI	���6�|im6�f���Cb�?���g��~�[>��b����oq�s����N�����Kh��6{���������p��n�c��?�n�<'P
��#nn���m����l�m^�����zkq���������I��|���P
�����%�2�:�WlK�:O�p`���6h�}���_����$b��~������;&��?�c9�0���3����S���k`�
0_�@���h������n�d{{[��[��g����?�C?T<����p_��g�m�zy�u���]�r�hl0���2��nO�1_�$��
����
��_���{��S�]�{�����\�p�e*�3n'O��g_�w���}?���G?(��e��-�s��E�!qs�hl0��2�uC]7����O�p`���6�k�~���d��g}�����w��U��$���3��_O����p����b�_{��n�}����'��`4�Sd��;n�9�=�m����l,E�����������E��?w�0�}?��e�Y�=�~�-�������3Rhl0��1�Y��6�|im6��-��Q�-�7}&����W���"o�_�[/��}���u����_�_)J����t�!/)	��n��\R2�m����l,K��n[���������*��}�_a��m������������un�������hl0����i��g�\�n]�����6�|im6��-t�J�sg7}&��c�~��/.�#}qz��K�w�[nG�7=|��[���hH��p;�]�����V�o�"_5��IB0��/m���������o�������0n����'O������0����V��O��b�O�~�����W�_�l���!�z8�-l�����������Kh�1�6��z����zn1�.SI�q�g�?��{va�{E��|�'~�-����hH��p���\_������+&`�
0_�@���h�Z��O<��'��+�W�0nzn����
���.�'^p���p_{�Sn��
2��`]_�1��N=�����k���������*���X����m����l�m.�� �>������7��N�c��d�|��\���f<w�=�?#������x�F���tW�O|����p�S��j;H��$`0��/m�����>�����>w�[����G�p�=x�-[��fGyZ�M����n��?��3I$�J735OJj�l:?2},E�-_W��47�t,��2�?{�S�N=����y��s��qwI����n��5�����
��@[:�L�C�=�������!�~��W=r�[���
7����1� ���o����feC�DtRH%������
�y��
���W��'��b�(���J#�zY�;�1:\��U�tm�,��������m����l�-�K�.��x�����b�{}����-b�����r;��E�����p@���3�;
�t*j�ZoM����L;"	'�E�F%m����7����Te���l�A�X�c6,���z�{�!W�������2C�k6���Kh�1��.9r$�<�&��M��r�����<Z�^���� ����u����RD�[�Bc[����'��i�t��2_bn�J�Z����iN�:���L��KC���(�Yp�]��H���pK�+rK
&�SG��yD���?�ll��K��,`0��/m�������������l���v��	��?���:X����������s����%��OJ�t��[�Bc[�c��s��=���_�$�N��s�[?����"���Ib�&����g����E���Ih>�x�8���-���g��	�Z�h���S}M��o<�����,���m�6�S�k�5��Da=����bBT��4b�vs��6�|im6���:�bX��������>�q1�/|�
n�1�"����E�����P���In��;A}�-�������\��*S��z�mb��wb�H���X prK�;��� D�T����:�Z��y�u���D�7�VjW#���KJ�6
vqV�����~�ZMP��=Z�u%����Tdf��x���/c����{�m?��ou�}�W.������[n����hx�s���W
77����>y��^>��In�b�f�z����5=��1{�$�������4�M�D��!�B��qIu�OUy"v�5����I��o�,�F�Yp�
@�N��k������"�Lb��
h�3�r���fm6������o��������������~�1����_w��\�rEd>|�-7��69u�T1���w��7n�����sq��I��S��6����� �,W1"�,�".��^��!m�44N�M�����2%�v�d�����p-Q'����H�N=��.�~J|��,��[�(�@+s�k�U@n��6�fm����3��p��������c���|�-�s��y�Q{~���>)��4|l�$�\8�_>�3i�7�����99H/'��]���*^y�&���Y�(�P�]�PHqyx�KU=�.ym���RUx�x�7���]�t��w����t<+�k�U@n��6�fm��3�?����?��n��l?��s��Sw��.\mwL
P���7nh�?�m���m���3T
��Fz��$�U�+/%�
�������D������T�:��H��.C:u�����ij� ��V(�����6�|im6��v���k�*ng�����x&`���x��X��{4������l�O�b��!��e�ul����^�����h���f����r��L�:��b�*���t��bB=��KI�o�-i$#l"m#���C���9�Qj#�h���d��	��t�y�~�%lI5'g[��Kh�1�6=wx��A��LVq�g^�p��xL�'6��X��?�1���������#��0���"�����gS"+5�9���uY���!��3�U�+/M������M�' V��mty��=��*ca-;���N=��v����C>E���gj��ds�4�yw���L�X���5�$f3��6�|im6F��O~X�3�z��+W���Cb�?z��ba^��}b�7>�^��t�|�3l�oM��W�luzo��������_�YU�r�J����xN�v=!Z��#�X#��	(��7�����m�6�Se���)�2����[��X����j�g�'
>8��)R)S���dV���lF0��/m����������jY�M/��y���m����=x^��?�Q��t��P����'K����y�f��a�����	4��+/
��_L�JU�	8av=!���Tcy4����3K�dd@�k�MDwm�#�z �]?Z�rp������`�
0_�@�������3g���y�'n��e/�g��Mq�O��@��������������!t����$+�q�4�����+h�;�X�I����*^yYS`Y��&��ou��*2 ����bD�hb�U�&���3B�m�M��
���4�A���W^���1�m����l�������'n�#���?r�0�������������_�����������`����n�@:�����2��khF0��/m������F���Fw�O�.^�(;R<�}�
�}�������)��e��-w"�,�	a��6��2,������f���Q�Sd��(��1�0��/m������\�r�����g��M�	X\s���p�������?��w��N���+2��IG�����5%�+J��tB���Sd�[w��7��3i���L�p`���6hKA�w�����x�0����K�K�w��_�����]�&f>���)��Y^�������v:���t����z�r{�����Kh������W�^��p?�����wL��~�-����<z��[n�
�!�z0�]�U�!�� ����{�p`���6hKa��}bv�k9��p?��k�h��������r�K+�����
�!�zP�
�`�
0_�@�
��D�	���(����{�v��L�crzYK'P
���,�p`���6h����C���^�"��,�w/�^{��b�O�~����C�DFqYK'P
���n}��|����6�|im6��v���c����X~��?�����\/�^����p����r''N��y�@G���N������w�:0��/m�����m���8�,��{"����z�������[�:��tjw`�
0_�@�
���f������Ol��~��_��{�
�!��k���m����l�m5���>���'?$��%�y�[�	(��tjw`�
0_�@�
���nO�Ym�����?�n1��z�>��P
���,�p`���6hk���Y����C��1��O��[�	(��tjw`�
0_�@�
�E�z���=���[m������yN���?��{�
�!���n��6�f�/m�P�9r��5��K���e��V���YX��1��{�t��HTE���'�r't
���,�p`���6}i����|�\,1nj�?��M�\��'���OT�����r'���1'N�p���`4�Sc���m����l�����K����������`�q�|������{o�����+�Y�����G�oy\P�����~����'E��s��r't
��#n�*�
����
���E��c��a:����l8x����/l?�c�
��Kg�������,��>��n���G���
��;�#�hH��pomn��������F���7����Kh���6��p���7o�\�aeYq;{�����C������m�����XP�x���{�-w��~�S�kO���!�z8���x�����u�g���6�|im6����o�����M|��x�=*��O\+�$��H}(�i{����Ol��������{��~�I���#b�����gAG���N=��vv{m}����|�{`�q�U0��/m�����d�s�:��R����?^,�:R������8���}��k����z�K�s���+������q�:�r���9|��[��C=�����ak�_�P��2�z�0�n��2����
�fX���;n��6�fcqm.\���������/����w"�������}�����_}��'���[����������a�4�������V�;tn���Z-��*�q���J��<�x"bm]X+��#T�y��u�!�z �]7��eN��Kh��6O�^���x���+��������/}��5������}>��d�v��\�[3h;z�t���}�\��v�����}k�vV7-��/5g�"m��wG7�^�N�����l�	h��6#k��+�3��������K���z���gk�[>nud>����E������m��qp]�k��2��zyi��3�.���\��Z$Z�q����`I�t��N=�%%A�)Xp(f5����@����@��1�U��|���m���<wsi�����=x���_p��?��\�~����;(��U$+k�-�7�3v_1erl���j����r�c�������8�g7�z���y���'��z��Y�V�W�k${K8;�'���*� #��7M���<`�P���3���6��0m6�fcLm�{%��]\�-N����@����/>�K�f�_�6��p��Y{�������p�t�]nQ�\E���NokLR��1�3��kp�Y���9Qj�Jo��ZV�_9�����;��9���E]��c��������LYd+xRq�wRb6Y]y�q��Y�eM~&l����P=�NQ�GF{}���G�Z����'�BY0����"i�����p(�=9�E��?A���?�x�0��&�����v�����1Z�*�J����f�z�SA��v���7|�}�{���j������>��U1����C�s���w�\�pAS��7���@����pV#�r
4�qp�J3]�����H�3KeZ�d�M������j������c�-M��e
2��y���e�G�m�\�)V�ee��e)�f���3M��n[�n*!���c��'{�vXxi�F�&�r����t������Q����p�V��6;O����e��t��[�q�V�W2��(m����x����/���'��m��_?p���������>����}��u��q.��;:�9�z
u*�Z�����Z���[-���qI�>���	�[��	A��%_�[.R��t���R�E6���2�(R�K�>t���Re5�U��j+"z��@�,"��0���p#P��������
����������m [a�l�<m���9r��5�<��mz���}�{�y����}�������k��%��=�W}�W��l�-�y��naM�`�����H�����YB2w��x��G����-=�9wG��Q���>V"
K�������`C	����nN�����]�SV����h�!�eJ"�m-�F�@M�PM�o�Yj7!�zx�����RoSq�q.h���0����.�l����*�J���=�\�nl?�O)���{]�,����+�b�<��.an��a]���'��i�w-ln�
���p�G?B-E?m���+������I�����lmm��?�����Z�����l�C�	�T��E`���J�_���Ei�L��d�T�B�@SA%�m�3P�	W�p��g�H�KI��%����Y`�
��t�fm6F�6�W��-�Q������I���z�[N����_��_�C�7�7����0�
a�����v��+�Q�?o�s/V����Pj�{���]v�0&�u�;�������]��=��2������M���*���hI��lY �__�:��l�G5O��":�H���J�.�0+;p2�����3�%�n]u�.��%�8�"��$�����`�
d+L@�
��Z[�^����GI���������������Y��~���/����g|��}�|	���3�uanP�R�����,��x�N�Vg����d��T�]�8<���R�-}6���S�����09.������W[���U����x��<)E���rUQA�N���/���J�@SC%Ef���B�!
w���A��m=�S����f�����A��$����0��&���l��r�d�[��$j{�3��w��~��?��f!��m�r�-��w_xe����[f�a�S��\���4�2�����&8U���i�v}.`s^l�a�i��� m4������5��W[tL������?�G��m������+����J��P�C��*��j������3�����"��R\vU.���V����~�le�FH�yAI`�
d+L@�
��T��W��n�v�[���h�~����zq����N����_�z�n[S��>y���D����M�����A��	"Fg��������y�����D���~A%��f����\|+v�*��5K��o�i����������;V	nmtn�. �o�6)�5�I���@�2d��p�������K.����<xS�G��ea��QQb�	H�m [a�l�mGK����������%E�+>�����</��%���W��v��x�'Nlo���9�g
�!���Z�1��3�'�H�����5�:����Fr3��0�K�'W���+�lH��RwZ6�kr�t3�DQX�|����/����j"����-M�@�4$�f�J������FJB�v��5�a`�����\m+��5�Q��)���5%����d0��&�����v��%������g��M��Lx�{�������m?���������A��{�SJ
�c�C|sZ+���I���Dh0/F�v��hb}_����p�eY_���EI�xhq�u���M2��
y�u���������9�Q�m�)���F5������
�G�����"i����N=�M�n�$H���H��EA�����O��.`0��&�����v��!�+��b���{%?���Ny�{�nmb����o��o���q#w�n��JG���+���U�WZ��������A����=N:�
���%f0!�z)���Z���J��5c�il��?�f��y�-V�����2,<�����G\�j"����A������������#����x��|����m����K����`���[o;��,�c����K���bQl�-Y���t���e	��:]���mn#���d����w���n�(��(v�:Ht���LHG�p7[n�^X����6[�=�DJ5�k�	��p�V��6;@�^�q��Y�<
C���+���|�]L2�-�Q:�����������R.��|����{_�bD�n��n��x6H��������J����/tk1�����=�E��������5����L�H8���c��H��v2`��d���']
]s7��6\�
�����L,kK��$`N0��&����k�z��
�y�#�{�����������M�����=����p�B1���s�D����������b�e��A+e����
 �z(���&?�����v7���2�L�~+Z.�f��y�p�V��6�����32*t=�nz���^�y��%�m�W�{�}��)��(:E����*��cR���sz�r`\�Sg����'��\����������>����8^������7���jN��n�
�fc����y����]���~�����������(Qm/}�����_�������JGG��pG1S�c�#AD1��t�a
w[%n�
�x�� �s��5EwJ�%���`�
d+L@�
���Q[���s��=JS[�+���.\pI��Y������f���+�t���D�����[ ��v;���e�3��jv$f3��6�s�E�
���Q��W��{~�p����=JM�\�p�q�����/]���{���!�zU��m����l�m�^�_{��6��=J�m��'����sO�~��<hl0���4�����g�#��1����
���
���J���%�f~����`4�Sf�'Uw��v`�
0_�@��m�K��2�%n�^���
�=�[��xmo��o������#�hH��p���K��E��-��0��/m��F�������-,������|�?y��-�QT�������7�y��w�1��M�0��2�uK��[r��y���m����lth�P����q��J>�s�,����#���zr�����_��_v�
�_�.[
�-�@G���N=��ui�+���8�	n��6�fck��J~��~�]Lbz�{���k����
wq�����t�1
w�_��y�n0��/m������J����w���Gy��'n�9����g>���\�zU���C�n�\��sAc���N=��n<�?��1�0��/m���N������?{u��x�{������)1�m�p�v���}�D��#G�t��
FC:�P7M��x���M�
����Kh��#��{%��?�_�{��=�fq�/y���r�v������>�~�Y�N=��v�;b�=\�=�m����l�Hm�^�/���n{�����W������W���^��Gc���N=��.���Z����%���}��p`���65m�g�������xlq��J~���\�������c��+��|������#�hH��n�I�m����l���3�����a�b���+y��c}����_���G�����`4�Sd����v*n��6�f#�v��%����?���Qq��Jn��X���x��'|�{�����!�z�3�\��
����
���=zT�s����e3o������/�(�vOoqW�y�����s���
�P��3gd�+�g�q�Y������Kt��97����M�����������"�z�k��5%X�40��/m����v��U��9�-�7�W�E�������������W�{mw�q�lt����wt�l�T�P8�2e����h�3'&u-^�Vk�ys��u���������	��Y[��nEF:��7MN�h}C~��q9�m����lxmz�s>���������~��
���[���+�E��29~��lQ�{��e�b���!x�Y���5��ewk��Rs�6���[������IZ���V��R�N=�5�3y��m����lxm����0W���?����.&��-��
�������<��^��o���::B6��am��mu��h��}��=5o��[|��_[�R��>�������A:5�;0��/m����v�D�gBz��^�}_����T������OT/����������-��w�jo�������x����{T>�?�{�O��)
��������h�X�c���bI�����S[�"�kiw��k���w����-�"�z�KJ ��Kh����{%���_*�voq�u���b��C�vIN�>-��m463����������+�_~�����_p�����Z'�y�N�"k�Z����XQ��:��>��G7'4�5Tw��U��)�H�(�p�kU5k.	��������j-e���O5
���@��@3v���i�[����7ZeU�.UY��%V"����b^�D����tjw`�
0_�@���M��<�-{�{�+������g����V����l�7�zn��0mUU��Ty����GU�T#���W����
���e�k�&�[C�LCOa�&5�j���UU[,�H�T_.t���s�z�V5����<qw���F��S~
��VY���3�e]<{4Q�O*�k[u9d�$����_��G2�� �
����
��Xum�^��'n�����+�o�|6�
�l�Fc3��O$���9������	���<���L��+�O��]�������0H�
���\�R���1=����j�B��ST��J�PY�,_��UK��(�)���R��V�u�J�U�6�)��������H��pW�����a���`�
0_�@��U���J��e�R�����~��O���_�d\C�q��59���C����|��>PK��O��)��>�����q�5�J�����'6h��U>��f>O$C��Z�F���D����Z���%�
����+��I-�*c���q������n�+`*����6�|im6N�>}�f?&�wf�M�����}��������-��+�5���kg��
���1�#4��:v�����Z�|������=w��>/���n�u����P5Q�������(+�f*O���q���pM�b��6���n����ZU��u�_�����ROe�������J=�NI5g-���o���6�RVE3��b��-��Cn�}������@��w$n��6�f���2b��t���7��z�m����-�?��=b����x���}�����-�cJG0����w�����~��������������TMT�>8����=FY�f�&k����i��R�I�x�jU5k�U�hT�LYI����*T��+�'��J�����o�%;��*+9���D-����E[
��2�uQ������`W��6�|im������^�3�z���|��n�����W�_{�S�x��e�F��;��J��������|��_����||�-D���?���eb�K��\7F��R�G��!��A������#������^��T
~��b|���Ml�QQ�|;��Mjq;�2�h�Xb�p�m)H��p��h�//�z'��6�|im6VT��+���=_�k_U����^{���K��n[�6n4�l�0Ek���E�NE��e�Y���o'R��A�ED��	����-�Rp;�Y���\�*T�5�k�]I����c��m�'��kKB:5�;0��/m����j�{%��w���m������M����7��F8u���
�6n4�lh�:�\25Q.1����i�-F�/f�8M���%,�f����t��	��Ns��[/���~����O��������]���S+�|Q�W�oc�'����n;.�Q�R��jK����C^R�.w$��u�v#n��6�fc��������.&��-���z!�^�}�L��������
q�31
�Y��Ebx�p�O��QY4�x���
4Lv�=X���"}6��V�v'�wSIUM�\&[��ZW���H��z"M�{H�K@Yy��,S������R��Z��J���M&u��0�D�S{��h�^�;��k���6�|im6VN��+���=�:~ka�{z�{��2|]�tI�7�6n4�lh�?�,���[�
_�X���y��1*7��<��j��
��&K�YuF�b�#���kU���!i���L�ygbK�DG ��N2�R���e���b\^q������5�������*����aK�S�N=��B,�u�/��Q��m������y���������M����?��p�=�����/���b�_{��.i�����!�z8�-�NjD~��j0��/m�-������O��V��^��_����������w����l�FG���N=��.)_]_\�S���1��]����
�%r���={�:tHW+nz��������[����.n���md7:��t��
wOT__^W���S�	
5��*H��" ��KhkC�a��m�����GTW(nz�����K���[��+�������<x0|�d7�����!�z ���!���S���~�s����+���L�Pb����6�|imQ��-b��sk�Oy��Q�r����*q�{%�?{�����W�>���^���W}�7o���?�{Pr��Se��7����2������]R���\�����I��&`0��/m�-����8 �ZOl�����w�j�4�U���+�+/������-��z�>q����~��,Q��$�m��0������x�MiI=k-Wt�D�{�������z���Sk�M@;n��6������������+y�k�n�����p�g?r��?_"���mI$���`4�S}
w�2��9Oy���
��^Sb��	h�m����:=w�������c������_�[�>��~c���������_)9t��z�"�l�FG���N=�M�M��y�z����v�{�Y��k�K@n��6v�6q�Q�x��I1��^I����sk����^�7zf��{z�{�
�7��Y_������H'�����KA:�h��>�#��7�'����*dS�np�������
�����������g��9v���#G���_t��k�����=+VRo��z�����J�7�������^�n#������ �zd�];��`V����#��y��,���e��5/" ��K;I���W�X7��q����sW6;�q����}��i���<n��a�-{����O&��-�}������_�n#������#�z�������u��#��w��/���H��'�����#�<��%?Z����;��;~��~����}��}�������	�SJ��._��R;9}��T��m{�=����L�"���g?�'�xqz�/�����=�VXy�����X�����[77��@:���;���7�$mO��m?������2�5�(��p�yNB["���U����>���M���+W����<y���;��'N��3�9�o��o;��{��+�����]B���?�=�+���6ntR
��ngm��
��n��Z�/;Y�Pb�}
h�m������n��)�;j�/\� ]����ny�MR����DL��3g������d{L�^��n)���oq?�|1�������us����l�F'���N=��V3*���qc�f�;q�&u%����60��/m�M�����+<X�����k}����FF������'�Py���oq��6z�n���r��t����1���hH���f�/v�������������+����v��J�99�`�
0_��K�����X7O]7/�}����B�pL�����������\^L��[������_��}.ua�=�46
����4�(�����S�3���I�=e������G�]sIb63n��6fj'-
����g��9y�dye����E�d�<ny����WWsL��F�^������{��w�!G�K���[N}�7����SO���1���hH�^���7S�;��HarK"g���Wxl����B����jv$f3"��}��K+�������]�v�5�*g��u9RNZ��t.�	�[����g/�w����g?��;�?��;�=�)���=�46
�����i�;E��`�
0_�D,uy��]��_wIm�Jz�w����7n�)�'N�9s����R|�9���K�{��O��g����?�I�jB��������y�3/���cp�=��1���hH��p��f���ts�'�W�4�[~0_X�6���v�l��c��
T����X���4�W���./&y�����p�t�e��-�G�����!�zH�]�l�n��O��`�
�#L_���<�H���~9~�x�2���u�={�y����r���r�%"�>|����zy�{�Sd@y5lO�;[-����s��t�!w��i�8�]����8n�+���w��\����%��������R���3�N��K;���w��I�z�lb�����p�������\���g��/�{����.������>����{��w��.�W�=�4����77I!�N�6�Y����kv^:u��;b������v5��L��E����.[�u���p�������RJ+		/Y���r�� �A�����:���~��X������7��~������k������1��eG�C���'H���ib��C�D�,O3�f�R��fI�������%&h�]�b��V���n�5
���rz_:u����������S0���=�G�������9��d���R�-�={����RJ�V!��;�B�s9��l�}���?r�0�����k�f)mO��������q������l�)�-+�u�hn�0i����VEI�<�<��j�����m�a��b-���������R��%�[:u��{-~�vJ���`�
d%,�����_�<!K����<��-���G�������+�'��G�noo�����~��_w���u9 �cJc3#V��iN���21nuk]��+V��%X�Xu��k�����P�QkT�������b{�v`�5CGY����p���{4��.������v1nY	��mz=�$9r����y.G[�������=E��l �S����?��n��
�6�=�46���('���p�J�s{q��2_f[��v�]YR]-07Q��U�_[��U���*K��8����@:u�7M6/�����C0���m%��r#��mjV[?Ufj�F��>y}��'�</n���G3G����f#�kEF9?��������v���O�C�Pd���1�������h�f�i��������U�V��z����v�[$j���z�j��<��V��T<�i���
D7Z�k�zTu$�L���c� ��w�=!�L��Un��m��l�2�D�s�n�v��
������_�������!�\������o}�+�\��f��Se��S�NI���9z�#������md{L�fB����Z����ET{�>1���p��]�Pa�H�w�k�H���X ���X��d"b�h��n���f	M)K!�B�f���Bm��**N�-[%S��
�4�Sf�'4�z���w~t�A�[�-?�/�9w���j�o������8��y._EmW�\Q��H{�{^�:��{_r�/z�_������|6^�����y�a���m
']���TQm���o<�5��h�O���O~H�v��po#�cJGX��]+&�qu��z���i�T�0s-	��}���b���m����O�g�5��{�0������T�������"������U�����QPQ5]bR���Sn�'D_79���)���`�����|���8qB��[������h�PZe����1+rp�9�7_�-'��A���������w��z����������Y�����~��|����w��[�S�k�,>��.��H�:����7����mnl?~���b�����)C�b`Y��/!5�FB�����nu@4��:���r\5O�VB����[�giq)5sMi���qJ�hVS���bME�SmK���-�d���)�����%%��@:�h���]�4d�w������l���W��={����c���/r|���#�����`������|��r����Q��_^�'o�8���}���C�����#O?x�x��~>����F�������$n�+k��_�K���_�e/��w���7?����|b{�W���m{��69|�p95W�v�VO�����i�f�Y5�1�9j���Ho��F�fU���5I��Mq3��PB=St}|CA�.k����.���R�Ju�*�Zw�Z�$�C���$d�H��pO����C��X�-?�/
�m^��/���_���0�K_����������|N���{�x��=_I������_�C�|�7�u�����(����{�M�������b�w�����c7����&�4��/~����<g�-��o����[��o���|���C���
9k�9�+Ij��\�yP��a��l`#�a=[�����f���J�����l�N���>^^�l]��.�!���U�4�+�'���fo9%!��@:��7��p�����@�6q�#_|2P�d�+{�+{d�v~�$z�L8��S�^�q�ON���?[7�s}��=�}�������=���g�������?��� <���?��a�,?E�i$�3����m������c��
�6@��-�kES�����2� 6����f�+Z�i���;��t�U�+����g�7��T����e�����"�o����ks����4��T��eg ����N���`�4����g^04�D�ze�v�Y~8�\rh�*��}��=�.���������uO<�������'��9�x����������S.�G�<xPCz��Y�z��k�tU�/����o~Uh���������	��V�s�O8�����Fp�����'�'���{cU�+�&Tp�V������~j���F����j�][.q{�Y�_���5W���47��1�����tjw��q�����@��G?�[�[��z������������C����E/R'�����75�������n�zt�|>����n�������=����{/��������@\�����;���{-��v�'��q���������?�Q�y�O�������?�������g}����?�����kB����_��'��������
����Vl���{1G�����)��^��x�Z����E�=�M����&N$L7����I�m`�^��Ej�;V	������-�0;�NA:5�;��o��|i E�;ET���)��r��yg�K�]SB�6/~��=���7��k���������j���������>��w������|~���=O.���������G/T=~�x���$���6������g����~^��7}�/������~��������m�y�m�PTW^����{����ke)���?l��{�� ;g��z���,�_*�y���>y3���8<��,�����9�Z�P�`v=!���\���d����H,�L��nq�5�5�t������������N��F�Q�NB:5�;dw�����@��c�����~�~Z��t��R��o��g�������y�+�u����>1C>��������>?vh�����=��S����c��_�
�������=z���?�`����K��,�������������/�,�>����
��S? �g^�;>q��s��Sk������k��m�W|�������o|���v���R���/z��_�����_��y�o#�b gm�Q>�lb�����.)Q��fJ�tg��{�&�O^��x�eM�-��,�c�av=bYTr���.�U""[��#eR������O�k���;0���@_~fu;���,���}��K3�����������O���|�Z�����7�;����{��?������^�G����o��cS����S�,�Y������K��������?�'�D9�W8��?��.���������n�����_�%��SX?��y�f����s���P=��#<t��������'4����v ���ye����_���e���O�n[a1�������f�����n������E��pg�X�-?�/
��v��1h�o5������������[<I�����]�q����~���{���������v1����ehF��������n�_�^U��M� vV_��>o{��[�W�<�3n������3����_�e����[>�����{���0�t_Xm��f�ng�E�y�\5���q�&�]tjw`�
��|������[,�|s������������7��=w�,���nn������O��G��oSXz����o����}��;�������}�+�x�����?����[-w��/��/+��*b�kX>��s24��������u�]�!�W=r_h�7>���.�g��'W��-�l����Zk�_SR��d��j���v^:5�;0�Vw��W.n��SG~�Q���9���n���_|j{{����ev�[���{�>�o_�gG~���8���/�����~����Kw!����/����5r�Y=ti������kC������n{[.�j#h;���
g������������v��
���,�pX����p�|�]/}����y���a���>��O>~S_�v����#o�Z9}{��>Ps������~���x��gz����*��?����g�����UA����������]��~����_�����������Hdu��ra1��60 ���n;a�|�I�%��O=Q\s��~d�WeaX�s��5�����_��w����_���B�,��z����?�������S���{�q�7���Ge��k����������b����+��{g����w��k�������2@���!���n;u������=x�0�����o)�Q>b���r����B����0�7]�Yq�_��#��V|�w������g���&�qB`���A���!������a�4���'���/}�a��'������+���]�����ot�j�_Qm��]S�����7;��
�4
���,���������|b�
y�����g��5_���c�8O8|���#G�/���;������F����tjw`�
���r��O>����u�_�o�����C��������@��l�4
���,�p���������}��]�z����}y����@��l�4
���,�p�����s������O�<�)6vv��m6��F�`4�Sc���m`���b���V�s�8qb�w|�m6��F�`4�S������X��/�|[�[�NY&(��+�X�E@2n��6�fcm��dn������1�#�hH�^
��}is�����6]��T��k�Eb����6�|im6z�������[�aN��S:��t�U0������]R���\����x�p�H��&`0��/m��F��t���-�t�q	������w$j��������=�t
��+`��!�1��l5���0��5�&���Kh����6�=��[�F1z�?����Gtc����;�l�)FC:u������mpnB�Ek�O@;n��6�f�_m:��|��J�[�����Q'e$���2m�a/�cJG���N����1�9?���v��iU�5�'���Kh������Q5k�eZ��t���)=�<����1�#�hH���p����w;g�*:wL'���������
���U[��nmMF�[M�����Xs[md{L�0��W�p�&���}�fO��/������w��M���N��-7���bp��.�8EWH����;4����<S;���`G �zu�p��w��v;��^��3�rn�h��[��;:2L�}t<][�����d���'���=�t
����n���5�����m����l��m�g��>oD\��V?�[h���^
9k��W�p7}m������/N�yAI`�
0_�@������lB��������������*��,����0����1�#�hH���p���:��r��.H�y�`�
�+�<���{�657����|zB����:�#���Om�1%�%�w��q�\0���y��*����� �S���v�{e����[�d
g�n� ���l��60�0=���;������UY��zB���B:�%���Om:�tD5�AS�C�fy{c�����Hz��V�=����@H�^�=�e�/��6y�p8�;<M�)o���]sIb63n#
�������Y1.e�J3Iu�zf��[�j�m�fr�����#�:��[\<U�+�v����)t������� �S���v���J
�\��?���.��N3jv$f3��60����������j�)������M*�+~����$��V��s~r�z�6in�A|�;^�c����]�x��\0Kz�l���il�|�S�����`�
�&L�H������e~'��&�ZU,�`�-��Q�����^��1��T~�;����^��l=<��He�={n�C�p/Ps������=�������N���������0�4m�������j�P���K�tZ�y8�W��4�ec���]Tx����jkB{�������=�������N��������������jN����V�E�P,��M�%�2����p���^cj���C){���=�������N����������K�B������������=������'gmb�75L	������<��M�������v��2_������`�H��pg���H����|�3������)���h-�du`-x��RSy���)���k��r��%�j�h��<���G�=�V����Iq��tW56X.��1�Y��60������?�ZJ�lG��{���6�BJ)�N9Gh+���='n�_��]��`�H��pg���8��i�����.��������)p�����gX��0QjK-��F��l�4
���,�pIX�k���
m��h�����z����m�R�&r��f#[m
FC:5�;0�F���T{?��p��HM���O�h��Td����p�6�j#h0��1�Y��60������?��%ju[�R�&'��`�m�R�� ��m6��F�`4�Sc���m`,ac���@�67'gl���V&�n�<o'�BQ9����.��wi�K�tZ�y8B��]?�X�Y�N�������EO�:�*���p'�h�3��r.Y�x/��<1����N9Gh���9k��1�Y��60�0��5�,����7Mn�^$�%b���c�zf���i�+���Rv�r��f��@����tjw`�
�',zz���A�UQ�Y|��z�j���u�O���ld����hH��pg��������[(3�v��O[=�����G)��[�7n#���}�ld����hH��pg������p��?];����dcrmt��<s���KU�������y�y8B��l�4
���,�pW�|�y����d�����X�y���6�����	h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH�^�������g������[3��M-����$�����`�
0_�@�
���VA���N��{kC�����{�6f[���.�������p`���6h���6��!�z�s��)e���x�K�\��4��5[���6�|im6�f#[m
FC:�
���-���:�.�[�[[/3h
��X�U�`�
0_�@�
���VA���N���n3�q��5Y7��5����6�|im6�f#[m
FC:u��;n��;��#��>��m����l��F����t��
���V�Y!��k�E�,0��/m���ld����hH�^�;��v�9B���Ys/f��6�|im6�f#[m
FC:������p�����-@ps��Ss
����0�p������l��A���N�*���k�|p�.�=��>��m����l��F����t��
��n9J�7N����p`���6h���6��!�z�����������-sb����m����l��F����t�U0����Y+����Vz����mn����6��f�\�������
��@��l�4
���a��P�{v6W��Y=����8<�=��M*��g��H�f�m����l��F����t�U1�;��Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�����Kh��6�j#h0��1�Y��6�|im6�f#[m
FC:5�;0��/m���ld����hH��pg����
��@��l�4
���,�p`���6h���6��!���n��6�fm6��F�`4�Sc���m����l��F����tjw`�
0_�@�
���VA���N�2�{ksc}m������-����R�5�$��6�|im6�f#[m
FC:�j��
��bs��]����)�k�	H�m����l��F����t�U0������]R��M)�Xsb��p`���6h���6��!�z��z����m7�)�k�	���Kh��6�j#h0���7�m�6n�'��J��&`>0��/m���ld����hH���p��]�t���3�)�k�	���Kh��6�j#h0���7�jk[�ndEAJ����-����
��@��l�4
��+b�c����<��J��&`N0��/m���ld����hH�^�3�	���Tb�6s������r���`G�
������y.H)�X�M�*���m��JJ�����Y	��bl�l���R�5'f����Y���m^X�b��K%���
���n�x����k����J��Riy��l!+c��Pon�;�+����^�����v���R)y��lSV�p�n��p�`@0���7��`��
0 n��p�`@0���7��`���d�67���LX[[��rk�����Y�t	U����j�����f�����d��;���u��*k��.1h��������G0F��.)n��E�B�Z6;����%E�`���d������
�1�Y� 6,G��wk����\m���7���X�����3��l�A���m�\��mE������nIq�lV~q���6�<���%��>��%E��<����pI���~�����9,K�LO�Se���\����,mSts��m���Y�)���LxX+,O[���<�6�
U�F=mI�b�m�K��t�av,��/����&#��g�E�c�Q���	2R��\��6
�����DP��.I[����X^��[Y��&��.K[|�fr���-��Z�X���,>�.)��(�e�6@�������l/>��V�L7����6����P����]����.����5b� ��V�f^q��Yjck��f����0������^��}<���������|���M.Y[�y����eiSQz������m����%l`i��s8��nZ��.sh\��,mu��f��-��%��$,
�{I���:nFV�Dt<�I�
�N,K�����c���_�����l�~`��M���a[^�\�6�;��%���_-�<mu��`�����{9+=ai�\�u����]��zo,IQ����%��c�|Dt(��� ���p������)��(��M�=�B}�K�����S�HY�0�_qg�'���Yf�*��*B��m����W�T����:y��T�nv)��U�(����`�������q��f�L�t�����e��u��\��r��\�bN-P��6�<aZ����r�1K��6^���>"���b�q+:�
*(�b���#�m��S�,7���e�6<��;�1q]9�U
��eI7N��,m��6�r��jq�\���
���"h�hMY�6��l9������lq���ur��-��%�E��p/7��c�x:]3�������$�H��7�������B�����������V�����l�iw��n
�kK��,mq��t��u�(Y�Z����o���q{�j;�vAV���z5K��L;��|�V����um��.M���/�
K��l��.=n��.A[|�IJ�����<t������X���%������M6<�Bq�
���Q��%h��v�A�����������M�(��mnuM51K���m���������(�%�0�e�FN�w��}=�5���*WS1KQ;��l�A��%k+6*�����fr�UFm��k��3��da��>����%P'����r�M�:	[��.E�n���v�z����0��H��`������?kgLG�����}���L-1��M6Zn�������,��F�1)��-1f�Z��1u���r"sI�jG5~X��m��+*�dm=�K:��n��p�`@0���7��`��
0 n��p�`@0���7��`��
0 n��p��6��ky�X%�[*���1_
[k.p��.��
�F������=������$om�c�e(���Gxmm}�}�}ch�q��nZ�T��Q�|���*�[���L�����@3�k`0�m`������O��mZ�[�U�Za�Il��{K��-������v�s�.����],�m"Tf�+Y��R`�R�4�����m��Z��-��Y��R����U��TbE\t�5K#��W;?�O�'�����=vx�� y�1���Z�b5�@`�R��D��_����n���M�n�-���V�����U,����Z[��
��
�Rn9e�-5�`�R��D�)O(KEB�c����e����^���Mh�K��PQ�iF��j��%�I"'��F��-�wc)�F�`����	���v���V�t������:m5�`�R��D�2��A�/�����Z��g����RIh���R$]�������O�������:+��D�����`�
��zC�^GV�����������D�m�Fw(}��Fg�K��7@*���9����3YK*9�Vn�gHx�����jjPK�]j�������QQ5)��_�Y��G$�3�u4s���2Ra�������������4T(NKS��u�dZj�!�p��i�&F�fs\���xK���i����D�]��G�i,q~���7���\�N���T
�2Ra����v����3�B��X�����d����*R�EDj���p��i���K3�����h�
���9-X�KL��=�]���:������U���)��K%�T��}aL����8������ba(�	���t�7@*�����4��r����.����E2��*I��5gsMz�3�f
�S0E���}Mc]���:1��k�g�������
�n?�h��S��B`�R��Dm������U�@�m�o�����M����������a7kh���)u-S��t��W���#	Y"�+O���5��`�R��Dm������U�@Zh�����	k��������
[s6���9����s�!�W]�^uG�=�	Du,T#�� �NK�n.[������p����r�m5H�Ph��V����a7k��UC���U�,��J��;m�L��W��b5�\`�R��Dm�����t�/jn�m9(�2qx>�$[dc5��_l�����/��%6�C3w�m��T@c]G�m��n�x�0�At����
�J�%j�/����������<;���D�sz�nX�$�}�E���6U����i�B"\��YEDSkHg}
]�|���T��tO�v�o�%�*���R*	v�����0��o�����0��k$����U�V7�+��|F%�������p��c�*R�j��M%S��fI�hU�[�=/xl^��}Mv�U�m�$;�Z���%g��LcX��#�F�7@*�����4�g9��!��gk}�s���b�{����O{N���_����Ue�HG�S�+;�jKy�������d}k�S�PH,��U�4���u�l4�CR������5@`�v0��b&�������7��`D0�;�vS�}>����
0"n�����
W��n����1�{���'<�B��N���
0
n���s��������Mv;�=�+�8�����`>0�;�����d����]��
7�j��7��`��
0 n��p�`@0���7��`��
0 n��p�`@0���7���{kkK���������$]@��8:IEND�B`�
#65Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Masahiko Sawada (#56)
Re: [PoC] Non-volatile WAL buffer

On Sat, Feb 13, 2021 at 12:18 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

On Thu, Jan 28, 2021 at 1:41 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/25/21 3:56 AM, Masahiko Sawada wrote:

...

On 1/21/21 3:17 AM, Masahiko Sawada wrote:

...

While looking at the two methods: NTT and simple-no-buffer, I realized
that in XLogFlush(), NTT patch flushes (by pmem_flush() and
pmem_drain()) WAL without acquiring WALWriteLock whereas
simple-no-buffer patch acquires WALWriteLock to do that
(pmem_persist()). I wonder if this also affected the performance
differences between those two methods since WALWriteLock serializes
the operations. With PMEM, multiple backends can concurrently flush
the records if the memory region is not overlapped? If so, flushing
WAL without WALWriteLock would be a big benefit.

That's a very good question - it's quite possible the WALWriteLock is
not really needed, because the processes are actually "writing" the WAL
directly to PMEM. So it's a bit confusing, because it's only really
concerned about making sure it's flushed.

And yes, multiple processes certainly can write to PMEM at the same
time, in fact it's a requirement to get good throughput I believe. My
understanding is we need ~8 processes, at least that's what I heard from
people with more PMEM experience.

Thanks, that's good to know.

TBH I'm not convinced the code in the "simple-no-buffer" code (coming
from the 0002 patch) is actually correct. Essentially, consider the
backend needs to do a flush, but does not have a segment mapped. So it
maps it and calls pmem_drain() on it.

But does that actually flush anything? Does it properly flush changes
done by other processes that may not have called pmem_drain() yet? I
find this somewhat suspicious and I'd bet all processes that did write
something have to call pmem_drain().

For the record, from what I learned / been told by engineers with PMEM
experience, calling pmem_drain() should properly flush changes done by
other processes. So it should be sufficient to do that in XLogFlush(),
from a single process.

My understanding is that we have about three challenges here:

(a) we still need to track how far we flushed, so this needs to be
protected by some lock anyway (although perhaps a much smaller section
of the function)

(b) pmem_drain() flushes all the changes, so it flushes even "future"
part of the WAL after the requested LSN, which may negatively affects
performance I guess. So I wonder if pmem_persist would be a better fit,
as it allows specifying a range that should be persisted.

(c) As mentioned before, PMEM behaves differently with concurrent
access, i.e. it reaches peak throughput with relatively low number of
threads wroting data, and then the throughput drops quite quickly. I'm
not sure if the same thing applies to pmem_drain() too - if it does, we
may need something like we have for insertions, i.e. a handful of locks
allowing limited number of concurrent inserts.

Thanks. That's a good summary.

Yeah, in terms of experiments at least it's good to find out that the
approach mmapping each WAL segment is not good at performance.

Right. The problem with small WAL segments seems to be that each mmap
causes the TLB to be thrown away, which means a lot of expensive cache
misses. As the mmap needs to be done by each backend writing WAL, this
is particularly bad with small WAL segments. The NTT patch works around
that by doing just a single mmap.

I wonder if we could pre-allocate and mmap small segments, and keep them
mapped and just rename the underlying files when recycling them. That'd
keep the regular segment files, as expected by various tools, etc. The
question is what would happen when we temporarily need more WAL, etc.

...

I think the performance improvement by NTT patch with the 16MB WAL
segment, the most common WAL segment size, is very good (150437 vs.
212410 with 64 clients). But maybe evaluating writing WAL segment
files on PMEM DAX filesystem is also worth, as you mentioned, if we
don't do that yet.

Well, not sure. I think the question is still open whether it's actually
safe to run on DAX, which does not have atomic writes of 512B sectors,
and I think we rely on that e.g. for pg_config. But maybe for WAL that's
not an issue.

I think we can use the Block Translation Table (BTT) driver that
provides atomic sector updates.

But we have benchmarked that, see my message from 2020/11/26, which
shows this table:

master/btt master/dax ntt simple
-----------------------------------------------------------
1 5469 7402 7977 6746
16 48222 80869 107025 82343
32 73974 158189 214718 158348
64 85921 154540 225715 164248
96 150602 221159 237008 217253

Clearly, BTT is quite expensive. Maybe there's a way to tune that at
filesystem/kernel level, I haven't tried that.

I missed your mail. Yeah, BTT seems to be quite expensive.

I'm also wondering if WAL is the right usage for PMEM. Per [2] there's a
huge read-write assymmetry (the writes being way slower), and their
recommendation (in "Observation 3" is)

The read-write asymmetry of PMem im-plies the necessity of avoiding
writes as much as possible for PMem.

So maybe we should not be trying to use PMEM for WAL, which is pretty
write-heavy (and in most cases even write-only).

I think using PMEM for WAL is cost-effective but it leverages the only
low-latency (sequential) write, but not other abilities such as
fine-grained access and low-latency random write. If we want to
exploit its all ability we might need some drastic changes to logging
protocol while considering storing data on PMEM.

True. I think investigating whether it's sensible to use PMEM for this
purpose. It may turn out that replacing the DRAM WAL buffers with writes
directly to PMEM is not economical, and aggregating data in a DRAM
buffer is better :-(

Yes. I think it might be interesting to do an analysis of the
bottlenecks of NTT patch by perf etc. If bottlenecks are moved to
other places by removing WALWriteLock during flush, it's probably a
good sign for further performance improvements. IIRC WALWriteLock is
one of the main bottlenecks on OLTP workload, although my memory might
already be out of date.

I think WALWriteLock itself (i.e. acquiring/releasing it) is not an
issue - the problem is that writing the WAL to persistent storage itself
is expensive, and we're waiting to that.

So it's not clear to me if removing the lock (and allowing multiple
processes to do pmem_drain concurrently) can actually help, considering
pmem_drain() should flush writes from other processes anyway.

But as I said, that is just my theory - I might be entirely wrong, it'd
be good to hack XLogFlush a bit and try it out.

I've done some performance benchmarks with the master and NTT v4
patch. Let me share the results.

pgbench setup:
* scale factor = 2000
* duration = 600 sec
* clients = 32, 64, 96

NVWAL setup:
* nvwal_size = 50GB
* max_wal_size = 50GB
* min_wal_size = 50GB

The whole database fits in shared_buffers and WAL segment file size is 16MB.

The results are:

master NTT master-unlogged
32 113209 67107 154298
64 144880 54289 178883
96 151405 50562 180018

"master-unlogged" is the same setup as "master" except for using
unlogged tables (using --unlogged-tables pgbench option). The TPS
increased by about 20% compared to "master" case (i.g., logged table
case). The reason why I experimented unlogged table case as well is
that we can think these results as an ideal performance if we were
able to write WAL records in 0 sec. IOW, even if the PMEM patch would
significantly improve WAL logging performance, I think it could not
exceed this performance. But hope is that if we currently have a
performance bottle-neck in WAL logging (.e.g, locking and writing
WAL), removing or minimizing WAL logging would bring a chance to
further improve performance by eliminating the new-coming bottle-neck.

As we can see from the above result, apparently, the performance of
“ntt” case was not good in this evaluation. I've not reviewed the
patch in-depth yet but something might be wrong with the v4 patch or
PMEM configuration I did on my environment is wrong.

I've reconfigured PMEM and done the same benchmark. I got the
following results (changed only "ntt" case):

master NTT master-unlogged
32 113209 144829 154298
64 144880 164899 178883
96 151405 166096 180018

I got a much better performance with "ntt" patch. I think I think it
was wrong that I created a partition on PMEM (i.g., created filesystem
on /dev/pmem1p1) when the last evaluation. Sorry for confusing you,
Menjo-san.

FWIW here are the top 5 wait events on new "ntt" case:

event_type | event | sum
------------+----------------------+------
Client | ClientRead | 8462
LWLock | WALInsert | 1049
LWLock | ProcArray | 627
IPC | ProcArrayGroupUpdate | 481
LWLock | XactSLRU | 247

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#66Takashi Menjo
takashi.menjo@gmail.com
In reply to: Masahiko Sawada (#65)
Re: [PoC] Non-volatile WAL buffer

Hi Sawada,

I am relieved to hear that the performance problem was solved.

And I added a tip about PMEM namespace and partitioning in PG wiki[1]https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL#Configure_and_verify_DAX_hugepage_faults.

Regards,

[1]: https://wiki.postgresql.org/wiki/Persistent_Memory_for_WAL#Configure_and_verify_DAX_hugepage_faults

--
Takashi Menjo <takashi.menjo@gmail.com>

#67Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Takashi Menjo (#66)
1 attachment(s)
Re: [PoC] Non-volatile WAL buffer

Hi,

I've performed some additional benchmarking and testing on the patches
sent on 26/1 [1]/messages/by-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg@mail.gmail.com, and I'd like to share some interesting results.

I did the tests on two different machines, with slightly different
configurations. Both machines use the same CPU generation with slightly
different frequency, a different OS (Ubuntu vs. RH), kernel (5.3 vs.
4.18) and so on. A more detailed description is in the attached PDF,
along with the PostgreSQL configuration.

The benchmark is fairly simple - pgbench with scale 500 (fits into
shared buffers) and 5000 (fits into RAM). The runs were just 1 minute
each, which is fairly short - it's however intentional, because I've
done this with both full_page_writes=on/off to test how this behaves
with many and no FPIs. This models extreme behaviors at the beginning
and at the end of a checkpoint.

This thread is rather confusing because there are far too many patches
with over-lapping version numbers - even [1]/messages/by-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg@mail.gmail.com contains two very different
patches. I'll refer to them as "NTT / buffer" (for the patch using one
large PMEM buffer) and "NTT / segments" for the patch using regular WAL
segments.

The attached PDF shows all these results along with charts. The two
systems have a bit different performance (throughput), the conclusions
seem to be mostly the same, so I'll just talk about results from one of
the systems here (aka "System A").

Note: Those systems are hosted / provided by Intel SDP, and Intel is
interested in providing access to other devs interested in PMEM.

Furthermore, these patches seem to be very insensitive to WAL segment
size (unlike the experimental patches I shared some time ago), so I'll
only show results for one WAL segment size. (Obviously, the NTT / buffer
patch can't be sensitive to this by definition, as it's not using WAL
segments at all.)

Results
-------

For scale 500, the results (with full_page_writes=on) look like this:

1 8 16 32 48 64
------------------------------------------------------------------
master 9411 58833 111453 181681 215552 234099
NTT / buffer 10837 77260 145251 222586 255651 264207
NTT / segments 11011 76892 145049 223078 255022 269737

So there is a fairly nice speedup - about 30%, which is consistent with
the results shared before. Moreover, the "NTT / segments" patch performs
about the same as the "NTT / buffer" which is encouraging.

For scale 5000, the results look like this:

1 8 16 32 48 64
------------------------------------------------------------------
master 7388 42020 64523 91877 102805 111389
NTT / buffer 8650 58018 96314 132440 139512 134228
NTT / segments 8614 57286 97173 138435 157595 157138

That's intriguing - the speedup is even higher, almost 40-60% with
enough clients (16-64). For me this is a bit surprising, because in this
case the data don't fit into shared_buffers, so extra time needs to be
spent copying data between RAM and shared_buffers and perhaps even doing
some writes. So my expectation was that this increases the amount of
time spent outside XLOG code, thus diminishing the speedup.

Now, let's look at results with full_page_writes=off. For scale 500 the
results are:

1 8 16 32 48 64
------------------------------------------------------------------
master 10476 67191 122191 198620 234381 251452
NTT / buffer 11119 79530 148580 229523 262142 275281
NTT / segments 11528 79004 148978 229714 259798 274753

and on scale 5000:

1 8 16 32 48 64
------------------------------------------------------------------
master 8192 55870 98451 145097 172377 172907
NTT / buffer 9063 62659 110868 161352 173977 164359
NTT / segments 9277 63226 112307 166070 171997 158085

That is, the speedups with scale 500 drops to ~10%, and for scale 5000
it disappears almost entirely.

I'd have expected that without FPIs the patches will actually be more
effective - so this seems interesting. The conclusion however seems to
be that the lower the amount of FPIs in the WAL stream, the smaller the
speedup. Or in a different way - it's most effective right after a
checkpoint, and it decreases during the checkpoint. So in a well tuned
system with significant distance between checkpoints, the speedup seems
to be fairly limited.

This is also consistent with the fact that for scale 5000 (with FPW=on)
the speedups are much more significant, simply because there are far
more pages (and thus FPIs). Also, after disabling FPWs the speedup
almost entirely disappears.

On the second system, the differences are even more significant (see the
PDF). I suspect this is dues to slightly different hardware config with
slower CPU / different PMEM capacity, etc. The overall behavior and
conclusions are however the same, I think.

Of course, another question is how this will be affected by never PMEM
versions with higher performance (e.g. the new generation of Intel PMEM
should be ~20% faster, from what I hear).

Issues & Questions
------------------

While testing the "NTT / segments" patch, I repeatedly managed to crash
the cluster with errors like this:

2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating
logfile segment just before mapping; path "pg_wal/00000001000000070000002F"
2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating
logfile segment just before mapping; path "pg_wal/000000010000000700000030"
...
2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating
logfile segment just before mapping; path "pg_wal/000000010000000700000030"
2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not
open file "pg_wal/000000010000000700000030": No such file or directory

I do believe this is a thinko in the 0008 patch, which does XLogFileInit
in XLogFileMap. Notice there are multiple "creating logfile" messages
with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap
may be called from multiple backends, so they may call XLogFileInit
concurrently, likely triggering some sort of race condition. It's fairly
rare issue, though - I've only seen it twice from ~20 runs.

The other question I have is about WALInsertLockUpdateInsertingAt. 0003
removes this function, but leaves behind some of the other bits working
with insert locks and insertingAt. But it does not explain how it works
without WaitXLogInsertionsToFinish() - how does it ensure that when we
commit something, all the preceding WAL is "complete" (i.e. written by
other backends etc.)?

Conclusion
----------

I do think the "NTT / segments" patch is the most promising way forward.
It does perform about as well as the "NTT / buffer" patch (and much both
perform much better than the experimental patches I shared in January).

The "NTT / buffer" patch seems much more disruptive - it introduces one
large buffer for WAL, which makes various other tasks more complicated
(i.e. it needs additional complexity to handle WAL archival, etc.). Are
there some advantages of this patch (compared to the other patch)?

As for the "NTT / segments" patch, I wonder if we can just rework the
code like this (to use mmap etc.) or whether we need to support both
both ways (file I/O and mmap). I don't have much experience with many
other platforms, but it seems quite possible that mmap won't work all
that well on some of them. So my assumption is we'll need to support
both file I/O and mmap to make any of this committable, but I may be wrong.

[1]: /messages/by-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg@mail.gmail.com
/messages/by-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg@mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachments:

pmem-benchmarks.pdfapplication/pdf; name=pmem-benchmarks.pdfDownload
#68Takashi Menjo
takashi.menjo@gmail.com
In reply to: Tomas Vondra (#67)
Re: [PoC] Non-volatile WAL buffer

Hi Tomas,

Thank you so much for your report. I have read it with great interest.

Your conclusion sounds reasonable to me. My patchset you call "NTT /
segments" got as good performance as "NTT / buffer" patchset. I have
been worried that calling mmap/munmap for each WAL segment file could
have a lot of overhead. Based on your performance tests, however, the
overhead looks less than what I thought. In addition, "NTT / segments"
patchset is more compatible to the current PG and more friendly to
DBAs because that patchset uses WAL segment files and does not
introduce any other new WAL-related file.

I also think that supporting both file I/O and mmap is better than
supporting only mmap. I will continue my work on "NTT / segments"
patchset to support both ways.

In the following, I will answer "Issues & Questions" you reported.

While testing the "NTT / segments" patch, I repeatedly managed to crash the cluster with errors like this:

2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating logfile segment just before
mapping; path "pg_wal/00000001000000070000002F"
2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating logfile segment just before
mapping; path "pg_wal/000000010000000700000030"
...
2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating logfile segment just before
mapping; path "pg_wal/000000010000000700000030"
2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not open file
"pg_wal/000000010000000700000030": No such file or directory

I do believe this is a thinko in the 0008 patch, which does XLogFileInit in XLogFileMap. Notice there are multiple
"creating logfile" messages with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap may be
called from multiple backends, so they may call XLogFileInit concurrently, likely triggering some sort of race
condition. It's fairly rare issue, though - I've only seen it twice from ~20 runs.

Thank you for your report. I found that rather the patch 0009 has an
issue, and that will also cause WAL loss. I should have set
use_existent to true, or InstallXlogFileSegment and BasicOpenFile in
XLogFileInit can be racy. I have misunderstood that use_existent can
be true because I am creating a brand-new file with XLogFileInit.

I will fix the issue.

The other question I have is about WALInsertLockUpdateInsertingAt. 0003 removes this function, but leaves
behind some of the other bits working with insert locks and insertingAt. But it does not explain how it works without
WaitXLogInsertionsToFinish() - how does it ensure that when we commit something, all the preceding WAL is
"complete" (i.e. written by other backends etc.)?

To wait for *all* the WALInsertLocks to be released, no matter each of
them precedes or follows the current insertion.

It would have worked functionally, but I rethink it is not good for
performance because XLogFileMap in GetXLogBuffer (where
WaitXLogInsertionsToFinish is removed) can block because it can
eventually call write() in XLogFileInit.

I will restore the WALInsertLockUpdateInsertingAt function and related
code for mmap.

Best regards,
Takashi

On Tue, Mar 2, 2021 at 5:40 AM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

Hi,

I've performed some additional benchmarking and testing on the patches
sent on 26/1 [1], and I'd like to share some interesting results.

I did the tests on two different machines, with slightly different
configurations. Both machines use the same CPU generation with slightly
different frequency, a different OS (Ubuntu vs. RH), kernel (5.3 vs.
4.18) and so on. A more detailed description is in the attached PDF,
along with the PostgreSQL configuration.

The benchmark is fairly simple - pgbench with scale 500 (fits into
shared buffers) and 5000 (fits into RAM). The runs were just 1 minute
each, which is fairly short - it's however intentional, because I've
done this with both full_page_writes=on/off to test how this behaves
with many and no FPIs. This models extreme behaviors at the beginning
and at the end of a checkpoint.

This thread is rather confusing because there are far too many patches
with over-lapping version numbers - even [1] contains two very different
patches. I'll refer to them as "NTT / buffer" (for the patch using one
large PMEM buffer) and "NTT / segments" for the patch using regular WAL
segments.

The attached PDF shows all these results along with charts. The two
systems have a bit different performance (throughput), the conclusions
seem to be mostly the same, so I'll just talk about results from one of
the systems here (aka "System A").

Note: Those systems are hosted / provided by Intel SDP, and Intel is
interested in providing access to other devs interested in PMEM.

Furthermore, these patches seem to be very insensitive to WAL segment
size (unlike the experimental patches I shared some time ago), so I'll
only show results for one WAL segment size. (Obviously, the NTT / buffer
patch can't be sensitive to this by definition, as it's not using WAL
segments at all.)

Results
-------

For scale 500, the results (with full_page_writes=on) look like this:

1 8 16 32 48 64
------------------------------------------------------------------
master 9411 58833 111453 181681 215552 234099
NTT / buffer 10837 77260 145251 222586 255651 264207
NTT / segments 11011 76892 145049 223078 255022 269737

So there is a fairly nice speedup - about 30%, which is consistent with
the results shared before. Moreover, the "NTT / segments" patch performs
about the same as the "NTT / buffer" which is encouraging.

For scale 5000, the results look like this:

1 8 16 32 48 64
------------------------------------------------------------------
master 7388 42020 64523 91877 102805 111389
NTT / buffer 8650 58018 96314 132440 139512 134228
NTT / segments 8614 57286 97173 138435 157595 157138

That's intriguing - the speedup is even higher, almost 40-60% with
enough clients (16-64). For me this is a bit surprising, because in this
case the data don't fit into shared_buffers, so extra time needs to be
spent copying data between RAM and shared_buffers and perhaps even doing
some writes. So my expectation was that this increases the amount of
time spent outside XLOG code, thus diminishing the speedup.

Now, let's look at results with full_page_writes=off. For scale 500 the
results are:

1 8 16 32 48 64
------------------------------------------------------------------
master 10476 67191 122191 198620 234381 251452
NTT / buffer 11119 79530 148580 229523 262142 275281
NTT / segments 11528 79004 148978 229714 259798 274753

and on scale 5000:

1 8 16 32 48 64
------------------------------------------------------------------
master 8192 55870 98451 145097 172377 172907
NTT / buffer 9063 62659 110868 161352 173977 164359
NTT / segments 9277 63226 112307 166070 171997 158085

That is, the speedups with scale 500 drops to ~10%, and for scale 5000
it disappears almost entirely.

I'd have expected that without FPIs the patches will actually be more
effective - so this seems interesting. The conclusion however seems to
be that the lower the amount of FPIs in the WAL stream, the smaller the
speedup. Or in a different way - it's most effective right after a
checkpoint, and it decreases during the checkpoint. So in a well tuned
system with significant distance between checkpoints, the speedup seems
to be fairly limited.

This is also consistent with the fact that for scale 5000 (with FPW=on)
the speedups are much more significant, simply because there are far
more pages (and thus FPIs). Also, after disabling FPWs the speedup
almost entirely disappears.

On the second system, the differences are even more significant (see the
PDF). I suspect this is dues to slightly different hardware config with
slower CPU / different PMEM capacity, etc. The overall behavior and
conclusions are however the same, I think.

Of course, another question is how this will be affected by never PMEM
versions with higher performance (e.g. the new generation of Intel PMEM
should be ~20% faster, from what I hear).

Issues & Questions
------------------

While testing the "NTT / segments" patch, I repeatedly managed to crash
the cluster with errors like this:

2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating
logfile segment just before mapping; path "pg_wal/00000001000000070000002F"
2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating
logfile segment just before mapping; path "pg_wal/000000010000000700000030"
...
2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating
logfile segment just before mapping; path "pg_wal/000000010000000700000030"
2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not
open file "pg_wal/000000010000000700000030": No such file or directory

I do believe this is a thinko in the 0008 patch, which does XLogFileInit
in XLogFileMap. Notice there are multiple "creating logfile" messages
with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap
may be called from multiple backends, so they may call XLogFileInit
concurrently, likely triggering some sort of race condition. It's fairly
rare issue, though - I've only seen it twice from ~20 runs.

The other question I have is about WALInsertLockUpdateInsertingAt. 0003
removes this function, but leaves behind some of the other bits working
with insert locks and insertingAt. But it does not explain how it works
without WaitXLogInsertionsToFinish() - how does it ensure that when we
commit something, all the preceding WAL is "complete" (i.e. written by
other backends etc.)?

Conclusion
----------

I do think the "NTT / segments" patch is the most promising way forward.
It does perform about as well as the "NTT / buffer" patch (and much both
perform much better than the experimental patches I shared in January).

The "NTT / buffer" patch seems much more disruptive - it introduces one
large buffer for WAL, which makes various other tasks more complicated
(i.e. it needs additional complexity to handle WAL archival, etc.). Are
there some advantages of this patch (compared to the other patch)?

As for the "NTT / segments" patch, I wonder if we can just rework the
code like this (to use mmap etc.) or whether we need to support both
both ways (file I/O and mmap). I don't have much experience with many
other platforms, but it seems quite possible that mmap won't work all
that well on some of them. So my assumption is we'll need to support
both file I/O and mmap to make any of this committable, but I may be wrong.

[1]
/messages/by-id/CAOwnP3Oz4CnKp0-_KU-x5irr9pBqPNkk7pjwZE5Pgo8i1CbFGg@mail.gmail.com

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Takashi Menjo <takashi.menjo@gmail.com>

#69Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Takashi Menjo (#68)
Re: [PoC] Non-volatile WAL buffer

Hello Takashi-san,

On 3/5/21 9:08 AM, Takashi Menjo wrote:

Hi Tomas,

Thank you so much for your report. I have read it with great interest.

Your conclusion sounds reasonable to me. My patchset you call "NTT /
segments" got as good performance as "NTT / buffer" patchset. I have
been worried that calling mmap/munmap for each WAL segment file could
have a lot of overhead. Based on your performance tests, however, the
overhead looks less than what I thought. In addition, "NTT / segments"
patchset is more compatible to the current PG and more friendly to
DBAs because that patchset uses WAL segment files and does not
introduce any other new WAL-related file.

I agree. I was actually a bit surprised it performs this well, mostly in
line with the "NTT / buffer" patchset. I've seen significant issues with
our simple experimental patches, which however went away with larger WAL
segments. But the "NTT / segments" patch does not have that issue, so
either our patches were doing something wrong, or perhaps there was some
other issue (not sure why larger WAL segments would improve that).

Do these results match your benchmarks? Or are you seeing significantly
different behavior?

Do you have any thoughts regarding the impact of full-page writes? So
far all the benchmarks we did focused on small OLTP transactions on data
sets that fit into RAM. The assumption was that that's the workload that
would benefit from this, but maybe that's missing something important
about workloads producing much larger WAL records? Say, workloads
working with large BLOBs, bulk loads etc.

The other question is whether simply placing WAL on DAX (without any
code changes) is safe. If it's not, then all the "speedups" are computed
with respect to unsafe configuration and so are useless. And BTT should
be used instead, which would of course produce very different results.

I also think that supporting both file I/O and mmap is better than
supporting only mmap. I will continue my work on "NTT / segments"
patchset to support both ways.

+1

In the following, I will answer "Issues & Questions" you reported.

While testing the "NTT / segments" patch, I repeatedly managed to crash the cluster with errors like this:

2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating logfile segment just before
mapping; path "pg_wal/00000001000000070000002F"
2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating logfile segment just before
mapping; path "pg_wal/000000010000000700000030"
...
2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating logfile segment just before
mapping; path "pg_wal/000000010000000700000030"
2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not open file
"pg_wal/000000010000000700000030": No such file or directory

I do believe this is a thinko in the 0008 patch, which does XLogFileInit in XLogFileMap. Notice there are multiple
"creating logfile" messages with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap may be
called from multiple backends, so they may call XLogFileInit concurrently, likely triggering some sort of race
condition. It's fairly rare issue, though - I've only seen it twice from ~20 runs.

Thank you for your report. I found that rather the patch 0009 has an
issue, and that will also cause WAL loss. I should have set
use_existent to true, or InstallXlogFileSegment and BasicOpenFile in
XLogFileInit can be racy. I have misunderstood that use_existent can
be true because I am creating a brand-new file with XLogFileInit.

I will fix the issue.

OK, thanks for looking into this.

The other question I have is about WALInsertLockUpdateInsertingAt. 0003 removes this function, but leaves
behind some of the other bits working with insert locks and insertingAt. But it does not explain how it works without
WaitXLogInsertionsToFinish() - how does it ensure that when we commit something, all the preceding WAL is
"complete" (i.e. written by other backends etc.)?

To wait for *all* the WALInsertLocks to be released, no matter each of
them precedes or follows the current insertion.

It would have worked functionally, but I rethink it is not good for
performance because XLogFileMap in GetXLogBuffer (where
WaitXLogInsertionsToFinish is removed) can block because it can
eventually call write() in XLogFileInit.

I will restore the WALInsertLockUpdateInsertingAt function and related
code for mmap.

OK. I'm still not entirely sure I understand if the current version is
correct, but I'll wait for the reworked version.

kind regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#70Takashi Menjo
takashi.menjo@gmail.com
In reply to: Tomas Vondra (#69)
1 attachment(s)
Re: [PoC] Non-volatile WAL buffer

Hi Tomas,

Hello Takashi-san,

On 3/5/21 9:08 AM, Takashi Menjo wrote:

Hi Tomas,

Thank you so much for your report. I have read it with great interest.

Your conclusion sounds reasonable to me. My patchset you call "NTT /
segments" got as good performance as "NTT / buffer" patchset. I have
been worried that calling mmap/munmap for each WAL segment file could
have a lot of overhead. Based on your performance tests, however, the
overhead looks less than what I thought. In addition, "NTT / segments"
patchset is more compatible to the current PG and more friendly to
DBAs because that patchset uses WAL segment files and does not
introduce any other new WAL-related file.

I agree. I was actually a bit surprised it performs this well, mostly in
line with the "NTT / buffer" patchset. I've seen significant issues with
our simple experimental patches, which however went away with larger WAL
segments. But the "NTT / segments" patch does not have that issue, so
either our patches were doing something wrong, or perhaps there was some
other issue (not sure why larger WAL segments would improve that).

Do these results match your benchmarks? Or are you seeing significantly
different behavior?

I made a performance test for "NTT / segments" and added its results
to my previous report [1]/messages/by-id/CAOwnP3OFofOsFtmeikQcbMp0YWdJn0kVB4Ka_0tj+Urq7dtAzQ@mail.gmail.com, on the same conditions. The updated graph
is attached to this mail. Note that some legends are renamed: "Mapped
WAL file" to "NTT / simple", and "Non-volatile WAL buffer" to "NTT /
buffer."

The graph tells me that "NTT / segments" performs as well as "NTT /
buffer." This matches with the results you reported.

Do you have any thoughts regarding the impact of full-page writes? So
far all the benchmarks we did focused on small OLTP transactions on data
sets that fit into RAM. The assumption was that that's the workload that
would benefit from this, but maybe that's missing something important
about workloads producing much larger WAL records? Say, workloads
working with large BLOBs, bulk loads etc.

I'd say that more work is needed for workloads producing a large
amount of WAL (in the number of records or the size per record, or
both of them). Based on the case Gang reported and I have tried to
reproduce in this thread [2]/messages/by-id/BYAPR11MB344801FF81E9C92A081D3E10E6080@BYAPR11MB3448.namprd11.prod.outlook.com[3]/messages/by-id/CAOwnP3NHAbVFOfAawZPs5ezn57_7fcX=KaaQ5YMgirc9rNrijQ@mail.gmail.com, the current inserting and flushing
method can be unsuitable for such workloads. The case was for "NTT /
buffer," but I think it can be also applied to "NTT / segments."

The other question is whether simply placing WAL on DAX (without any
code changes) is safe. If it's not, then all the "speedups" are computed
with respect to unsafe configuration and so are useless. And BTT should
be used instead, which would of course produce very different results.

I think it's safe, thanks to the checksum in the header of WAL record
(xl_crc in struct XLogRecord). In DAX mode, user data (WAL record
here) is written to the PMEM device by a smaller unit (probably a byte
or a cache line) than the traditional 512-byte disk sector. So a
torn-write such that "some bytes in a sector persist, other bytes not"
can occur when crash. AFAICS, however, the checksum for WAL records
can also support such a torn-write case.

I also think that supporting both file I/O and mmap is better than
supporting only mmap. I will continue my work on "NTT / segments"
patchset to support both ways.

+1

In the following, I will answer "Issues & Questions" you reported.

While testing the "NTT / segments" patch, I repeatedly managed to crash the cluster with errors like this:

2021-02-28 00:07:21.221 PST client backend [3737139] WARNING: creating logfile segment just before
mapping; path "pg_wal/00000001000000070000002F"
2021-02-28 00:07:21.670 PST client backend [3737142] WARNING: creating logfile segment just before
mapping; path "pg_wal/000000010000000700000030"
...
2021-02-28 00:07:21.698 PST client backend [3737145] WARNING: creating logfile segment just before
mapping; path "pg_wal/000000010000000700000030"
2021-02-28 00:07:21.698 PST client backend [3737130] PANIC: could not open file
"pg_wal/000000010000000700000030": No such file or directory

I do believe this is a thinko in the 0008 patch, which does XLogFileInit in XLogFileMap. Notice there are multiple
"creating logfile" messages with the ..0030 segment, followed by the failure. AFAICS the XLogFileMap may be
called from multiple backends, so they may call XLogFileInit concurrently, likely triggering some sort of race
condition. It's fairly rare issue, though - I've only seen it twice from ~20 runs.

Thank you for your report. I found that rather the patch 0009 has an
issue, and that will also cause WAL loss. I should have set
use_existent to true, or InstallXlogFileSegment and BasicOpenFile in
XLogFileInit can be racy. I have misunderstood that use_existent can
be true because I am creating a brand-new file with XLogFileInit.

I will fix the issue.

OK, thanks for looking into this.

The other question I have is about WALInsertLockUpdateInsertingAt. 0003 removes this function, but leaves
behind some of the other bits working with insert locks and insertingAt. But it does not explain how it works without
WaitXLogInsertionsToFinish() - how does it ensure that when we commit something, all the preceding WAL is
"complete" (i.e. written by other backends etc.)?

To wait for *all* the WALInsertLocks to be released, no matter each of
them precedes or follows the current insertion.

It would have worked functionally, but I rethink it is not good for
performance because XLogFileMap in GetXLogBuffer (where
WaitXLogInsertionsToFinish is removed) can block because it can
eventually call write() in XLogFileInit.

I will restore the WALInsertLockUpdateInsertingAt function and related
code for mmap.

OK. I'm still not entirely sure I understand if the current version is
correct, but I'll wait for the reworked version.

kind regards

Best regards,
Takashi

[1]: /messages/by-id/CAOwnP3OFofOsFtmeikQcbMp0YWdJn0kVB4Ka_0tj+Urq7dtAzQ@mail.gmail.com
[2]: /messages/by-id/BYAPR11MB344801FF81E9C92A081D3E10E6080@BYAPR11MB3448.namprd11.prod.outlook.com
[3]: /messages/by-id/CAOwnP3NHAbVFOfAawZPs5ezn57_7fcX=KaaQ5YMgirc9rNrijQ@mail.gmail.com

--
Takashi Menjo <takashi.menjo@gmail.com>

Attachments:

pgbench-optane-pmem-msync-9e7dbe3-s50.pngimage/png; name=pgbench-optane-pmem-msync-9e7dbe3-s50.pngDownload
#71tsunakawa.takay@fujitsu.com
tsunakawa.takay@fujitsu.com
In reply to: Takashi Menjo (#70)
RE: [PoC] Non-volatile WAL buffer

From: Takashi Menjo <takashi.menjo@gmail.com>

The other question is whether simply placing WAL on DAX (without any
code changes) is safe. If it's not, then all the "speedups" are
computed with respect to unsafe configuration and so are useless. And
BTT should be used instead, which would of course produce very different

results.

I think it's safe, thanks to the checksum in the header of WAL record (xl_crc in
struct XLogRecord). In DAX mode, user data (WAL record
here) is written to the PMEM device by a smaller unit (probably a byte or a
cache line) than the traditional 512-byte disk sector. So a torn-write such that
"some bytes in a sector persist, other bytes not"
can occur when crash. AFAICS, however, the checksum for WAL records can
also support such a torn-write case.

I'm afraid I may be misunderstanding, so let me ask a naive question.

I understood "simply placing WAL on DAX (without any code changes)" means placing WAL files on DAX-aware filesystems such as ext4 and xfs, withoug modifying Postgres source code. That is, use the PMEM as a high performance storage device. Is this correct?

Second, does it what you represented as "master" in your test results?

I'd simply like to know what percentage of performance improvement we can expect by utilizing PMDK and modifying Postgres source code, and how much improvement we consider worthwhile.

Regards
Takayuki Tsunakawa