[HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

Started by Yoshimi Ichiyanagialmost 8 years ago39 messages
#1Yoshimi Ichiyanagi
ichiyanagi.yoshimi@lab.ntt.co.jp
3 attachment(s)

Hi.

These patches enable to use Persistent Memory Development Kit(PMDK)[1]http://pmem.io/pmdk/
for reading/writing WAL logs on persistent memory(PMEM).
PMEM is next generation storage and it has a number of nice features:
fast, byte-addressable and non-volatile.

Using pgbench which is a PostgreSQL general benchmark, the postgres server
to which the patches is applied is about 5% faster than original server.
And using my insert benchmark, it is up to 90% faster than original one.
I will describe these details later.

This e-mail describes the following:
A) About PMDK
B) About the patches
C) The way of running benchmarks using the patches, and the results

A) About PMDK
PMDK provides the functions to allow an application to directly access
PMEM without going through the kernel as a memory for the purpose of
high-speed access to PMEM by the application.
The following APIs are available through PMDK.
A-1. APIs to open a file on PMEM, to create a file on PMEM,
and to map a file on PMEM to virtual addresses
A-2. APIs to read/write data from and to a file on PMEM

A-1. APIs to open a file on PMEM, to create a file on PMEM,
and to map a file on PMEM to virtual addresses

PMDK provides these APIs using DAX filesystem(DAX FS)[2]https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf feature.

DAX FS is a PMEM-aware file system which allows direct access
to PMEM without using the kernel page caches. A file in DAX FS can
be mapped to memory using standard calls like mmap() on Linux.
Furthermore by mapping the file on PMEM to virtual addresses(and
after any initial minor page faults that may be required to create
the mappings in the MMU), the applications can access PMEM
using CPU load/store instructions instead of read/write system calls.

A-2. APIs to read/write data from and to a file on PMEM

PMDK provides the APIs like memcpy() to copy data to PMEM
using single instruction, multiple data(SIMD) instructions[3]SIMD: SIMD is the instruction operates on all loaded data in a single operation. If the SIMD system loads eight data into registers at once, the store operation to PMEM will happen to all eight values at the same time. and
NT store instructions[4]NT store instructions: NT store instructions bypass the CPU cache, so using these instructions does not require a flush.. These instructions improve the performance
to copy data to PMEM. As a result, using these APIs is faster than
using read/write system calls.

[1]: http://pmem.io/pmdk/
[2]: https://www.usenix.org/system/files/login/articles/login_summer17_07_rudoff.pdf
[3]: SIMD: SIMD is the instruction operates on all loaded data in a single operation. If the SIMD system loads eight data into registers at once, the store operation to PMEM will happen to all eight values at the same time.
operation. If the SIMD system loads eight data into registers at once,
the store operation to PMEM will happen to all eight values
at the same time.
[4]: NT store instructions: NT store instructions bypass the CPU cache, so using these instructions does not require a flush.
so using these instructions does not require a flush.

B) About the patches
Changes by the patches:
0001-Add-configure-option-for-PMDK.patch:
- Added "--with-libpmem" configure option to execute I/O with PMDK library

0002-Read-write-WAL-files-using-PMDK.patch:
- Added PMDK implementation for WAL I/O operations
- Added "pmem-drain" to the wal_sync_method parameter list
to write logs synchronously on PMEM

0003-Walreceiver-WAL-IO-using-PMDK.patch:
- Added PMDK implementation for Walreceiver of secondary server processes

C) The way of running benchmarks using the patches, and the results
It's the following:

Experimental setup
Server: HP ProLiant DL360 Gen9
CPU: Xeon E5-2667 v4 (3.20GHz); 2 processors(without HT)
DRAM: DDR4-2400; 32 GiB/processor
(8GiB/socket x 4 sockets/processor) x 2 processors
NVDIMM: DDR4-2133; 32 GiB/processor
(8GiB/socket x 4 sockets/processor) x 2 processors
HDD: Seagate Constellation2 2.5inch SATA 3.0. 6Gb/s 1TB 7200rpm x 1
OS: Ubuntu 16.04, linux-4.12
DAX FS: ext4
NVML: master@Aug 30, 2017
PostgreSQL: master
Note: I bound the postgres processes to one NUMA node,
and the benchmarks to other NUMA node.

C-1. Configuring PMEM for using as a block device
# ndctl list
# ndctl create-namespace -f -e namespace0.0 --mode=memory -M dev

C-2. Creating a file system on PMEM, and mounting it with DAX
# mkfs.ext4 /dev/pmem0
# mount -t ext4 -o dax /dev/pmem0 /mnt/pmem0

C-3. Setting PMEM_IS_PMEM_FORCE to determine if the WAL files is stored
on PMEM
Note: If this environment variable is not set, postgres processes are
not started.
# export PMEM_IS_PMEM_FORCE=1

C-4. Installing PostgreSQL
Note: There are 3 important things in installing PostgreSQL.
a. Executing "./configure --with-libpmem" to link libpmem
b. Setting WAL directory on PMEM
c. Modifying wal_sync_method parameter in postgresql.conf from fdatasync
to pmem_drain

# cd /path/to/[PG_source dir]
# ./configure --with-libpmem
# make && make install
# initdb /path/to/PG_DATA -X /mnt/pmem0/path/to/[PG_WAL dir]
# cat /path/to/PG_DATA/postgresql.conf | sed -e s/#wal_sync_method\ =\
fsync/wal_sync_method\ =\ pmem_drain/ > /path/to/PG_DATA/postgresql.conf.
tmp
# mv /path/to/PG_DATA/postgresql.conf.tmp /path/to/PG_DATA/postgresql.conf
# pg_ctl start -D /path/to/PG_DATA
# created [DB_NAME]

C-5. Running the 2 benchmarks(1. pgbench, 2. my insert benchmark)
C-5-1. pgbench
# numactl -N 1 pgbech -c 32 -j 8 -T 120 -M prepared [DB_NAME]

The averages of running pgbench three times are:
wal_sync_method=fdatasync: tps = 43,179
wal_sync_method=pmem_drain: tps = 45,254

C-5-2. pclinet_thread: my insert benchmark
Preparation
CREATE TABLE [TABLE_NAME] (id int8, value text);
ALTER TABLE [TABLE_NAME] ALTER value SET STORAGE external;
PREPARE insert_sql (int8) AS INSERT INTO %s (id, value) values ($1, '
[1K_data]');

Execution
BEGIN; EXECUTE insert_sql(%lld); COMMIT;
Note: I ran this quer 5M times with 32 threads.

# ./pclient_thread
Invalid Arguments:
Usage: ./pclient_thread [The number of threads] [The number to insert
tuples] [data size(KB)]
# numactl -N 1 ./pclient_thread 32 5242880 1

The averages of running this benchmark three times are:
wal_sync_method=fdatasync: tps = 67,780
wal_sync_method=pmem_drain: tps = 131,962

--
Yoshimi Ichiyanagi

Attachments:

0001-Add-configure-option-for-PMDK.patchapplication/octet-stream; name=0001-Add-configure-option-for-PMDK.patchDownload
diff --git a/configure b/configure
index 45221e1..0ebf1d4 100755
--- a/configure
+++ b/configure
@@ -700,6 +700,7 @@ EGREP
 GREP
 with_zlib
 with_system_tzdata
+with_libpmem
 with_libxslt
 with_libxml
 XML2_CONFIG
@@ -843,6 +844,7 @@ with_uuid
 with_ossp_uuid
 with_libxml
 with_libxslt
+with_libpmem
 with_system_tzdata
 with_zlib
 with_gnu_ld
@@ -1538,6 +1540,7 @@ Optional Packages:
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
   --with-libxslt          use XSLT support when building contrib/xml2
+  --with-libpmem          use PMEM support for WAL I/O
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
@@ -6346,6 +6349,33 @@ fi
 
 
 
+#
+# PMEM
+#
+
+
+
+# Check whether --with-libpmem was given.
+if test "${with_libpmem+set}" = set; then :
+  withval=$with_libpmem;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBPMEM 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-libpmem option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_libpmem=no
+
+fi
 
 
 
@@ -10320,6 +10350,57 @@ fi
 
 fi
 
+if test "$with_libpmem" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'pmwm' is required for PMEM support" "$LINENO" 5
+fi
+
+fi
+
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -11054,6 +11135,17 @@ fi
 
 fi
 
+if test "$with_libpmem" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for PMEM support" "$LINENO" 5
+fi
+
+
+fi
+
 if test "$with_ldap" = yes ; then
   if test "$PORTNAME" != "win32"; then
      for ac_header in ldap.h
diff --git a/configure.in b/configure.in
index 4d26034..a959b28 100644
--- a/configure.in
+++ b/configure.in
@@ -812,6 +812,14 @@ PGAC_ARG_BOOL(with, libxslt, no, [use XSLT support when building contrib/xml2],
 AC_SUBST(with_libxslt)
 
 #
+# PMEM
+#
+PGAC_ARG_BOOL(with, libpmem, no, [use PMEM support for WAL I/O],
+	      [AC_DEFINE([USE_LIBPMEM], 1, [Define to 1 to use PMEM support for WAL I/O. (--with-libpmem)])])
+
+AC_SUBST(with_libpmem)
+
+#
 # tzdata
 #
 PGAC_ARG_REQ(with, system-tzdata,
@@ -1089,6 +1097,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_LIB(xslt, xsltCleanupGlobals, [], [AC_MSG_ERROR([library 'xslt' is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [], [AC_MSG_ERROR([library 'pmem' is required for PMEM support])])
+fi
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -1247,6 +1259,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_HEADER(libxslt/xslt.h, [], [AC_MSG_ERROR([header file <libxslt/xslt.h> is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for PMEM support])])
+fi
+
 if test "$with_ldap" = yes ; then
   if test "$PORTNAME" != "win32"; then
      AC_CHECK_HEADERS(ldap.h, [],
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f98f773..7867118 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -346,6 +346,9 @@
 /* Define to 1 if you have the `xslt' library (-lxslt). */
 #undef HAVE_LIBXSLT
 
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
 /* Define to 1 if you have the `z' library (-lz). */
 #undef HAVE_LIBZ
 
@@ -847,6 +850,9 @@
    (--with-libxslt) */
 #undef USE_LIBXSLT
 
+/* Define to 1 to use PMEM support for WAL I/O. (--with-libpmem) */
+#undef USE_LIBPMEM
+
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
0002-Read-write-WAL-files-using-PMDK.patchapplication/octet-stream; name=0002-Read-write-WAL-files-using-PMDK.patchDownload
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 61d3605..fd4d232 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -426,7 +426,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	durable_link_or_rename(tmppath, path, ERROR);
+	durable_link_or_rename(tmppath, path, ERROR, true);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -505,7 +505,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	durable_link_or_rename(tmppath, path, ERROR);
+	durable_link_or_rename(tmppath, path, ERROR, true);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e42b828..60890b5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -58,6 +58,7 @@
 #include "storage/ipc.h"
 #include "storage/large_object.h"
 #include "storage/latch.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -146,6 +147,9 @@ const struct config_enum_entry sync_method_options[] = {
 #ifdef OPEN_DATASYNC_FLAG
 	{"open_datasync", SYNC_METHOD_OPEN_DSYNC, false},
 #endif
+#ifdef USE_LIBPMEM
+	{"pmem_drain", SYNC_METHOD_PMEM_DRAIN, false},
+#endif
 	{NULL, 0, false}
 };
 
@@ -774,6 +778,7 @@ static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
 static uint32 openLogOff = 0;
+static void	*mappedLogFileAddr = NULL;
 
 /*
  * These variables are used similarly to the ones above, but for reading
@@ -788,6 +793,7 @@ static XLogSegNo readSegNo = 0;
 static uint32 readOff = 0;
 static uint32 readLen = 0;
 static XLogSource readSource = 0;	/* XLOG_FROM_* code */
+static void	*mappedReadFileAddr = NULL;
 
 /*
  * Keeps track of which source we're currently reading from. This is
@@ -866,13 +872,15 @@ static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
 static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
+static int do_XLogFileOpen(char *pathname, int flags, void **addr);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, XLogSegNo max_segno,
-					   bool use_lock);
+					   bool use_lock, bool fsync_file);
 static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 int source, bool notfoundOk);
-static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
+			 int source, bool notfoundOk, void **addr);
+static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source,
+		void **addr);
 static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 			 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
 			 TimeLineID *readTLI);
@@ -2321,6 +2329,15 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
 	return false;
 }
 
+static int
+do_XLogFileOpen(char *pathname, int flags,  void **addr)
+{
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		return PmemFileOpen(pathname, flags, wal_segment_size, addr);
+	else
+		return BasicOpenFile(pathname, flags);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2400,23 +2417,25 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			 * pages here (since we dump what we have at segment end).
 			 */
 			Assert(npages == 0);
-			if (openLogFile >= 0)
+			if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 				XLogFileClose();
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
 
 			/* create/use new log file */
 			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
+			openLogFile = XLogFileInit(openLogSegNo, &use_existent,
+					true, &mappedLogFileAddr);
 			openLogOff = 0;
 		}
 
 		/* Make sure we have the current logfile open */
-		if (openLogFile < 0)
+		if (openLogFile < 0 && mappedLogFileAddr == NULL)
 		{
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
-			openLogFile = XLogFileOpen(openLogSegNo);
+			openLogFile = XLogFileOpen(openLogSegNo,
+					&mappedLogFileAddr);
 			openLogOff = 0;
 		}
 
@@ -2453,12 +2472,13 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			/* Need to seek in the file? */
 			if (openLogOff != startoffset)
 			{
-				if (lseek(openLogFile, (off_t) startoffset, SEEK_SET) < 0)
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not seek in log file %s to offset %u: %m",
-									XLogFileNameP(ThisTimeLineID, openLogSegNo),
-									startoffset)));
+				if (mappedLogFileAddr == NULL)
+					if (lseek(openLogFile, (off_t) startoffset, SEEK_SET) < 0)
+						ereport(PANIC,
+								(errcode_for_file_access(),
+								 errmsg("could not seek in log file %s to offset %u: %m",
+									 XLogFileNameP(ThisTimeLineID, openLogSegNo),
+									 startoffset)));
 				openLogOff = startoffset;
 			}
 
@@ -2469,6 +2489,13 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+				if (mappedLogFileAddr != NULL)
+				{
+					pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+					PmemFileWrite((char *)mappedLogFileAddr+openLogOff, from, nleft);
+					pgstat_report_wait_end();
+					break;
+				}
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = write(openLogFile, from, nleft);
 				pgstat_report_wait_end();
@@ -2565,15 +2592,16 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		if (sync_method != SYNC_METHOD_OPEN &&
 			sync_method != SYNC_METHOD_OPEN_DSYNC)
 		{
-			if (openLogFile >= 0 &&
+			if ((openLogFile >= 0 || mappedLogFileAddr != NULL) &&
 				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
 				XLogFileClose();
-			if (openLogFile < 0)
+			if (openLogFile < 0 && mappedLogFileAddr == NULL)
 			{
 				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 								wal_segment_size);
-				openLogFile = XLogFileOpen(openLogSegNo);
+				openLogFile = XLogFileOpen(openLogSegNo,
+						&mappedLogFileAddr);
 				openLogOff = 0;
 			}
 
@@ -2986,7 +3014,7 @@ XLogBackgroundFlush(void)
 	 */
 	if (WriteRqst.Write <= LogwrtResult.Flush)
 	{
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
@@ -3157,7 +3185,8 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
+XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+		void **addr)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3167,6 +3196,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	XLogSegNo	max_segno;
 	int			fd;
 	int			nbytes;
+	void	*tmpaddr = NULL;
+	bool	fsync_file = true;
 
 	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
 
@@ -3175,16 +3206,20 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 */
 	if (*use_existent)
 	{
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-		if (fd < 0)
+		fd = do_XLogFileOpen(path,
+				O_RDWR | PG_BINARY | get_sync_bit(sync_method),
+				&tmpaddr);
+		if (fd < 0 && tmpaddr == NULL)
 		{
 			if (errno != ENOENT)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not open file \"%s\": %m", path)));
 		}
-		else
+		else {
+			*addr = tmpaddr;
 			return fd;
+		}
 	}
 
 	/*
@@ -3200,8 +3235,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	fd = do_XLogFileOpen(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+			&tmpaddr);
+	if (fd < 0 && tmpaddr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
@@ -3222,40 +3258,53 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	memset(zbuffer, 0, XLOG_BLCKSZ);
 	for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
 	{
-		errno = 0;
-		pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-		if ((int) write(fd, zbuffer, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ)
+		if (tmpaddr != NULL)
 		{
-			int			save_errno = errno;
+			fsync_file = false;
+			pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
+			PmemFileWrite((char *)tmpaddr+nbytes, zbuffer,
+					XLOG_BLCKSZ);
+		}
+		else
+		{
+			errno = 0;
+			pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
+			if ((int) write(fd, zbuffer, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ)
+			{
+				int			save_errno = errno;
 
-			/*
-			 * If we fail to make the file, delete it to release disk space
-			 */
-			unlink(tmppath);
+				/*
+				 * If we fail to make the file, delete it
+				 * to release disk space
+				 */
+				unlink(tmppath);
 
-			close(fd);
+				close(fd);
 
-			/* if write didn't set errno, assume problem is no disk space */
-			errno = save_errno ? save_errno : ENOSPC;
+				/* if write didn't set errno, assume problem is
+				 * no disk space */
+				errno = save_errno ? save_errno : ENOSPC;
 
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write to file \"%s\": %m", tmppath)));
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write to file \"%s\": %m",
+							 tmppath)));
+			}
 		}
 		pgstat_report_wait_end();
 	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, tmpaddr) != 0)
 	{
-		close(fd);
+		do_XLogFileClose(fd, tmpaddr);
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	}
 	pgstat_report_wait_end();
 
-	if (close(fd))
+	if (do_XLogFileClose(fd, tmpaddr))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
@@ -3282,7 +3331,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	max_segno = logsegno + CheckPointSegments;
 	if (!InstallXLogFileSegment(&installed_segno, tmppath,
 								*use_existent, max_segno,
-								use_lock))
+								use_lock,
+								fsync_file))
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3296,8 +3346,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	*use_existent = false;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+			O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+
+	if (fd < 0 && *addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3332,13 +3384,21 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	int			srcfd;
 	int			fd;
 	int			nbytes;
+	void		*src_addr = NULL, *dst_addr = NULL;
+	bool		fsync_file = true;
 
 	/*
 	 * Open the source file
 	 */
 	XLogFilePath(path, srcTLI, srcsegno, wal_segment_size);
-	srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (srcfd < 0)
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		srcfd = MapTransientFile(path, O_RDONLY | PG_BINARY,
+				wal_segment_size, &src_addr);
+
+	if (src_addr == NULL)
+		srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+
+	if (srcfd < 0 && src_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3351,15 +3411,32 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = OpenTransientFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	if ( src_addr != NULL && sync_method == SYNC_METHOD_PMEM_DRAIN)
+		fd = MapTransientFile(tmppath,
+				O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+				wal_segment_size, &dst_addr);
+	else
+		fd = OpenTransientFile(tmppath,
+				O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+
+	if (fd < 0 && dst_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m", tmppath)));
+				 errmsg("could not create file \"%s\": %m",
+					 tmppath)));
 
 	/*
 	 * Do the data copying.
 	 */
+	if (src_addr && dst_addr) {
+		pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_READ);
+		PmemFileWrite(dst_addr, src_addr, wal_segment_size);
+		pgstat_report_wait_end();
+		fsync_file = false;
+
+		goto done_copy;
+	}
+
 	for (nbytes = 0; nbytes < wal_segment_size; nbytes += sizeof(buffer))
 	{
 		int			nread;
@@ -3408,29 +3485,42 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not write to file \"%s\": %m", tmppath)));
+					 errmsg("could not write to file \"%s\": %m",
+						 tmppath)));
 		}
 		pgstat_report_wait_end();
 	}
 
+done_copy:
 	pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, dst_addr) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
 
-	if (CloseTransientFile(fd))
+	if (dst_addr)
+	{
+		if (UnmapTransientFile(dst_addr, wal_segment_size))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not unmap file \"%s\": %m",
+						 tmppath)));
+	}
+	else if (CloseTransientFile(fd))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
-	CloseTransientFile(srcfd);
+	if (src_addr)
+		UnmapTransientFile(src_addr, wal_segment_size);
+	else
+		CloseTransientFile(srcfd);
 
 	/*
 	 * Now move the segment into place with its final name.
 	 */
-	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
+	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false, fsync_file))
 		elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
@@ -3465,7 +3555,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, XLogSegNo max_segno,
-					   bool use_lock)
+					   bool use_lock, bool fsync_file)
 {
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
@@ -3504,7 +3594,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	if (durable_link_or_rename(tmppath, path, LOG) != 0)
+	if (durable_link_or_rename(tmppath, path, LOG, fsync_file) != 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
@@ -3522,15 +3612,16 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  * Open a pre-existing logfile segment for writing.
  */
 int
-XLogFileOpen(XLogSegNo segno)
+XLogFileOpen(XLogSegNo segno, void **addr)
 {
 	char		path[MAXPGPATH];
 	int			fd;
 
 	XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+			O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+	if (fd < 0 && *addr == NULL)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open write-ahead log file \"%s\": %m", path)));
@@ -3546,7 +3637,7 @@ XLogFileOpen(XLogSegNo segno)
  */
 static int
 XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 int source, bool notfoundOk)
+			 int source, bool notfoundOk, void **addr)
 {
 	char		xlogfname[MAXFNAMELEN];
 	char		activitymsg[MAXFNAMELEN + 16];
@@ -3595,8 +3686,8 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlogfname);
 	}
 
-	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-	if (fd >= 0)
+	fd = do_XLogFileOpen(path, O_RDONLY | PG_BINARY, addr);
+	if (fd >= 0 || *addr != NULL)
 	{
 		/* Success! */
 		curFileTLI = tli;
@@ -3628,7 +3719,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  * This version searches for the segment with any TLI listed in expectedTLEs.
  */
 static int
-XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
+XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source, void **addr)
 {
 	char		path[MAXPGPATH];
 	ListCell   *cell;
@@ -3668,8 +3759,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_ARCHIVE, true);
-			if (fd != -1)
+					XLOG_FROM_ARCHIVE, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				elog(DEBUG1, "got WAL segment from archive");
 				if (!expectedTLEs)
@@ -3681,8 +3772,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_PG_WAL, true);
-			if (fd != -1)
+					XLOG_FROM_PG_WAL, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				if (!expectedTLEs)
 					expectedTLEs = tles;
@@ -3700,13 +3791,22 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 	return -1;
 }
 
+int
+do_XLogFileClose(int fd, void *addr)
+{
+	if (!addr)
+		return close(fd);
+
+	return PmemFileClose(addr, wal_segment_size);
+}
+
 /*
  * Close the current logfile segment for writing.
  */
 static void
 XLogFileClose(void)
 {
-	Assert(openLogFile >= 0);
+	Assert(openLogFile >= 0 || mappedLogFileAddr != NULL);
 
 	/*
 	 * WAL segment files will not be re-read in normal operation, so we advise
@@ -3715,15 +3815,16 @@ XLogFileClose(void)
 	 * use the cache to read the WAL segment.
 	 */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	if (!XLogIsNeeded())
+	if (!XLogIsNeeded() && openLogFile > 0)
 		(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif
 
-	if (close(openLogFile))
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not close log file %s: %m",
 						XLogFileNameP(ThisTimeLineID, openLogSegNo))));
+	mappedLogFileAddr = NULL;
 	openLogFile = -1;
 }
 
@@ -3743,6 +3844,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	XLogSegNo	_logSegNo;
 	int			lf;
 	bool		use_existent;
+	void		*laddr = NULL;
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3751,8 +3853,8 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	{
 		_logSegNo++;
 		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent, true);
-		close(lf);
+		lf = XLogFileInit(_logSegNo, &use_existent, true, &laddr);
+		do_XLogFileClose(lf, laddr);
 		if (!use_existent)
 			CheckpointStats.ckpt_segs_added++;
 	}
@@ -3972,6 +4074,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 	struct stat statbuf;
 	XLogSegNo	endlogSegNo;
 	XLogSegNo	recycleSegNo;
+	bool		fsync_file = true;
 
 	/*
 	 * Initialize info about where to try to recycle to.
@@ -3984,6 +4087,9 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 
 	snprintf(path, MAXPGPATH, XLOGDIR "/%s", segname);
 
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		fsync_file = false;
+
 	/*
 	 * Before deleting the file, see if it can be recycled as a future log
 	 * segment. Only recycle normal files, pg_standby for example can create
@@ -3992,7 +4098,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr PriorRedoPtr, XLogRecPtr endptr)
 	if (endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
-							   true, recycleSegNo, true))
+			true, recycleSegNo, true, fsync_file))
 	{
 		ereport(DEBUG2,
 				(errmsg("recycled write-ahead log file \"%s\"",
@@ -4162,9 +4268,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			if (readFile >= 0)
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
 			{
-				close(readFile);
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+				mappedReadFileAddr = NULL;
 				readFile = -1;
 			}
 
@@ -4682,7 +4789,8 @@ UpdateControlFile(void)
 	pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE);
 	if (write(fd, ControlFile, sizeof(ControlFileData)) != sizeof(ControlFileData))
 	{
-		/* if write didn't set errno, assume problem is no disk space */
+		/* if write didn't set errno, assume problem is no disk
+		 * space */
 		if (errno == 0)
 			errno = ENOSPC;
 		ereport(PANIC,
@@ -5111,34 +5219,44 @@ BootStrapXLOG(void)
 
 	/* Create first XLOG segment file */
 	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	openLogFile = XLogFileInit(1, &use_existent, false, &mappedLogFileAddr);
 
 	/* Write the first page with the initial record */
 	errno = 0;
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+
+	if (mappedLogFileAddr != NULL)
 	{
-		/* if write didn't set errno, assume problem is no disk space */
-		if (errno == 0)
-			errno = ENOSPC;
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not write bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		PmemFileWrite(mappedLogFileAddr, page, XLOG_BLCKSZ);
+	}
+	else
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write bootstrap write-ahead log file: %m")));
+		}
 	}
 	pgstat_report_wait_end();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
+	if (xlog_fsync(openLogFile, (void *)mappedLogFileAddr) != 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
 	pgstat_report_wait_end();
 
-	if (close(openLogFile))
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not close bootstrap write-ahead log file: %m")));
 
+	mappedLogFileAddr = NULL;
 	openLogFile = -1;
 
 	/* Now create pg_control */
@@ -5554,9 +5672,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * If the ending log segment is still open, close it (to avoid problems on
 	 * Windows with trying to rename or delete an open file).
 	 */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 
@@ -5595,10 +5714,11 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 */
 		bool		use_existent = true;
 		int			fd;
+		void		*tmpaddr = NULL;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+		fd = XLogFileInit(startLogSegNo, &use_existent, true, &tmpaddr);
 
-		if (close(fd))
+		if (do_XLogFileClose(fd, tmpaddr))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not close log file %s: %m",
@@ -7751,9 +7871,10 @@ StartupXLOG(void)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/* Shut down xlogreader */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 	XLogReaderFree(xlogreader);
@@ -10062,6 +10183,9 @@ get_sync_bit(int method)
 		case SYNC_METHOD_FSYNC:
 		case SYNC_METHOD_FSYNC_WRITETHROUGH:
 		case SYNC_METHOD_FDATASYNC:
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+#endif
 			return 0;
 #ifdef OPEN_SYNC_FLAG
 		case SYNC_METHOD_OPEN:
@@ -10079,7 +10203,36 @@ get_sync_bit(int method)
 }
 
 /*
- * GUC support
+ * GUC check_hook for xlog_sync_method
+ */
+bool
+check_xlog_sync_method(int *newval, void **extra, GucSource source)
+{
+	bool ret;
+	char tmppath[MAXPGPATH] = {};
+	int val = newval ? *newval : sync_method;
+
+	if (val != SYNC_METHOD_PMEM_DRAIN)
+		return true;
+
+	snprintf(tmppath, MAXPGPATH, "%s/" XLOGDIR "/pmem.tmp", DataDir);
+
+	ret = CheckPmem(tmppath);
+
+	if (!ret)
+	{
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("invalid value for parameter \"wal_sync_method\": \"pmem_drain\"");
+		GUC_check_errmsg("%s isn't stored on persistent memory(pmem_is_pmem() returned false).",
+				XLOGDIR);
+		GUC_check_errhint("Please see also ENVIRONMENT VARIABLES section in man libpmem.");
+	}
+
+	return ret;
+}
+
+/*
+ * GUC assign_hook for xlog_sync_method
  */
 void
 assign_xlog_sync_method(int new_sync_method, void *extra)
@@ -10092,10 +10245,10 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 		 * changing, close the log file so it will be reopened (with new flag
 		 * bit) at next use.
 		 */
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN);
-			if (pg_fsync(openLogFile) != 0)
+			if (xlog_fsync(openLogFile, (void *)mappedLogFileAddr) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not fsync log segment %s: %m",
@@ -10144,6 +10297,11 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 								XLogFileNameP(ThisTimeLineID, segno))));
 			break;
 #endif
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+			PmemFileSync();
+			break;
+#endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
 			/* write synced it already */
@@ -10154,6 +10312,17 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	}
 }
 
+int
+xlog_fsync(int fd, void *addr)
+{
+	if (!addr)
+		return pg_fsync(fd);
+
+	PmemFileSync();
+	return 0;
+}
+
+
 /*
  * Return the filename of given log segment, as a palloc'd string.
  */
@@ -11565,7 +11734,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 &&
+	if ((readFile >= 0 || mappedReadFileAddr != NULL) &&
 		!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
 	{
 		/*
@@ -11582,7 +11751,8 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			}
 		}
 
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 		readSource = 0;
 	}
@@ -11591,7 +11761,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
+	if ((readFile < 0 && mappedReadFileAddr == NULL) ||
 		(readSource == XLOG_FROM_STREAM &&
 		 receivedUpto < targetPagePtr + reqLen))
 	{
@@ -11600,8 +11770,9 @@ retry:
 										 private->fetching_ckpt,
 										 targetRecPtr))
 		{
-			if (readFile >= 0)
-				close(readFile);
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+			mappedReadFileAddr = NULL;
 			readFile = -1;
 			readLen = 0;
 			readSource = 0;
@@ -11614,7 +11785,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(readFile != -1 || mappedReadFileAddr != NULL);
 
 	/*
 	 * If the current segment is being streamed from master, calculate how
@@ -11635,30 +11806,44 @@ retry:
 
 	/* Read the requested page */
 	readOff = targetPageOff;
-	if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
-	{
-		char		fname[MAXFNAMELEN];
+	if (mappedReadFileAddr == NULL) {
+		if (lseek(readFile, (off_t) readOff, SEEK_SET) < 0)
+		{
+			char		fname[MAXFNAMELEN];
 
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-				(errcode_for_file_access(),
-				 errmsg("could not seek in log segment %s to offset %u: %m",
-						fname, readOff)));
-		goto next_record_is_invalid;
+			XLogFileName(fname, curFileTLI, readSegNo,
+					wal_segment_size);
+			ereport(emode_for_corrupt_record(emode,
+						targetPagePtr + reqLen),
+					(errcode_for_file_access(),
+					 errmsg("could not seek in log segment %s to offset %u: %m",
+						 fname, readOff)));
+			goto next_record_is_invalid;
+		}
 	}
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
-	{
-		char		fname[MAXFNAMELEN];
+	if (mappedReadFileAddr) {
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		PmemFileRead((char *)mappedReadFileAddr+readOff, readBuf,
+				XLOG_BLCKSZ);
 
-		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-				(errcode_for_file_access(),
-				 errmsg("could not read from log segment %s, offset %u: %m",
-						fname, readOff)));
-		goto next_record_is_invalid;
+	}
+	else {
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		if (read(readFile, readBuf, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			char		fname[MAXFNAMELEN];
+
+			pgstat_report_wait_end();
+			XLogFileName(fname, curFileTLI, readSegNo,
+					wal_segment_size);
+			ereport(emode_for_corrupt_record(emode,
+						targetPagePtr + reqLen),
+					(errcode_for_file_access(),
+					 errmsg("could not read from log segment %s, offset %u: %m",
+						 fname, readOff)));
+			goto next_record_is_invalid;
+		}
 	}
 	pgstat_report_wait_end();
 
@@ -11672,8 +11857,9 @@ retry:
 next_record_is_invalid:
 	lastSourceFailed = true;
 
-	if (readFile >= 0)
-		close(readFile);
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+	mappedReadFileAddr = NULL;
 	readFile = -1;
 	readLen = 0;
 	readSource = 0;
@@ -11922,9 +12108,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 				/* Close any old file we might have open. */
-				if (readFile >= 0)
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 				{
-					close(readFile);
+					do_XLogFileClose(readFile,
+							mappedReadFileAddr);
+					mappedReadFileAddr = NULL;
 					readFile = -1;
 				}
 				/* Reset curFileTLI if random fetch. */
@@ -11937,8 +12125,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 */
 				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
-				if (readFile >= 0)
+											  currentSource, &mappedReadFileAddr);
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 					return true;	/* success! */
 
 				/*
@@ -12002,14 +12190,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						 * info is set correctly and XLogReceiptTime isn't
 						 * changed.
 						 */
-						if (readFile < 0)
+						if (readFile < 0 && mappedReadFileAddr == NULL)
 						{
 							if (!expectedTLEs)
 								expectedTLEs = readTimeLineHistory(receiveTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
 													receiveTLI,
-													XLOG_FROM_STREAM, false);
-							Assert(readFile >= 0);
+													XLOG_FROM_STREAM, false, &mappedReadFileAddr);
+							Assert(readFile >= 0 || mappedReadFileAddr != NULL);
 						}
 						else
 						{
diff --git a/src/backend/storage/file/Makefile b/src/backend/storage/file/Makefile
index ca6a0e4..9271153 100644
--- a/src/backend/storage/file/Makefile
+++ b/src/backend/storage/file/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/file
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = fd.o buffile.o copydir.o reinit.o sharedfileset.o
+OBJS = fd.o buffile.o copydir.o reinit.o sharedfileset.o pmem.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 71516a9..ec19e37 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -88,6 +88,7 @@
 #include "portability/mem.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
@@ -125,12 +126,6 @@
 #define FD_MINFREE				10
 
 /*
- * Default mode for created files, unless something else is specified using
- * the *Perm() function variants.
- */
-#define PG_FILE_MODE_DEFAULT	(S_IRUSR | S_IWUSR)
-
-/*
  * A number of platforms allow individual processes to open many more files
  * than they can really support when *many* processes do the same thing.
  * This GUC parameter lets the DBA limit max_safe_fds to something less than
@@ -237,6 +232,9 @@ static uint64 temporary_files_size = 0;
 typedef enum
 {
 	AllocateDescFile,
+#ifdef USE_LIBPMEM
+	AllocateDescMap,
+#endif
 	AllocateDescPipe,
 	AllocateDescDir,
 	AllocateDescRawFD
@@ -251,6 +249,10 @@ typedef struct
 		FILE	   *file;
 		DIR		   *dir;
 		int			fd;
+#ifdef USE_LIBPMEM
+		size_t	fsize;
+		void	   *addr;
+#endif
 	}			desc;
 } AllocateDesc;
 
@@ -724,14 +726,16 @@ durable_unlink(const char *fname, int elevel)
  * valid upon return.
  */
 int
-durable_link_or_rename(const char *oldfile, const char *newfile, int elevel)
+durable_link_or_rename(const char *oldfile, const char *newfile, int elevel,
+		bool fsync_file)
 {
 	/*
 	 * Ensure that, if we crash directly after the rename/link, a file with
 	 * valid contents is moved into place.
 	 */
-	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
-		return -1;
+	if (fsync_file)
+		if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
+			return -1;
 
 #if HAVE_WORKING_LINK
 	if (link(oldfile, newfile) < 0)
@@ -759,8 +763,9 @@ durable_link_or_rename(const char *oldfile, const char *newfile, int elevel)
 	 * Make change persistent in case of an OS crash, both the new entry and
 	 * its parent directory need to be flushed.
 	 */
-	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
-		return -1;
+	if (fsync_file)
+		if (fsync_fname_ext(newfile, false, false, elevel) != 0)
+			return -1;
 
 	/* Same for parent directory */
 	if (fsync_parent_path(newfile, elevel) != 0)
@@ -1618,6 +1623,76 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
 	return file;
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Mmap a file with MapTransientFilePerm() and pass default file mode for
+ * the fileMode parameter.
+ */
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return MapTransientFilePerm(fileName, fileFlags, PG_FILE_MODE_DEFAULT,
+			fsize, addr);
+}
+
+/*
+ * Like AllocateFile, but returns an unbuffered pointer to the mapped area
+ * like mmap(2)
+ */
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+		size_t fsize, void **addr)
+{
+	int			fd;
+
+	DO_DB(elog(LOG, "MapTransientFilePerm: Allocated %d (%s)",
+			   numAllocatedDescs, fileName));
+
+	/* Can we allocate another non-virtual FD? */
+	if (!reserveAllocatedDesc())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("exceeded maxAllocatedDescs (%d) while trying to open file \"%s\"",
+						maxAllocatedDescs, fileName)));
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	if (addr != NULL)
+	{
+		void *ret_addr = NULL;
+		fd = PmemFileOpenPerm(fileName, fileFlags, fileMode, fsize, &ret_addr);
+		if (ret_addr != NULL)
+		{
+			AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
+			*addr = ret_addr;
+
+			desc->kind = AllocateDescMap;
+			desc->desc.addr = ret_addr;
+			desc->desc.fsize = fsize;
+			desc->create_subid = GetCurrentSubTransactionId();
+			numAllocatedDescs++;
+
+			return fd;
+		}
+	}
+
+	return -1;					/* failure */
+}
+#else
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return -1;
+}
+
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+		size_t fsize, void **addr)
+{
+	return -1;
+}
+#endif
 
 /*
  * Create a new file.  The directory containing it must already exist.  Files
@@ -2512,6 +2587,11 @@ FreeDesc(AllocateDesc *desc)
 		case AllocateDescRawFD:
 			result = close(desc->desc.fd);
 			break;
+#ifdef USE_LIBPMEM
+		case AllocateDescMap:
+			result = PmemFileClose(desc->desc.addr, desc->desc.fsize);
+			break;
+#endif
 		default:
 			elog(ERROR, "AllocateDesc kind not recognized");
 			result = 0;			/* keep compiler quiet */
@@ -2553,6 +2633,42 @@ FreeFile(FILE *file)
 	return fclose(file);
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Unmap a file returned by MapTransientFile.
+ *
+ * Note we do not check unmap's return value --- it is up to the caller
+ * to handle unmap errors.
+ */
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	int			i;
+
+	DO_DB(elog(LOG, "UnmapTransientFile: Allocated %d", numAllocatedDescs));
+
+	/* Remove fd from list of allocated files, if it's present */
+	for (i = numAllocatedDescs; --i >= 0;)
+	{
+		AllocateDesc *desc = &allocatedDescs[i];
+
+		if (desc->kind == AllocateDescMap && desc->desc.addr == addr)
+			return FreeDesc(desc);
+	}
+
+	/* Only get here if someone passes us a file not in allocatedDescs */
+	elog(WARNING, "fd passed to UnmapTransientFile was not obtained from MapTransientFile");
+
+	return PmemFileClose(addr, fsize);
+}
+#else
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	return -1;
+}
+#endif
+
 /*
  * Close a file returned by OpenTransientFile.
  *
diff --git a/src/backend/storage/file/pmem.c b/src/backend/storage/file/pmem.c
new file mode 100644
index 0000000..85fed32
--- /dev/null
+++ b/src/backend/storage/file/pmem.c
@@ -0,0 +1,188 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.c
+ *	  Virtual file descriptor code.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/file/pmem.c
+ *
+ * NOTES:
+ *
+ * This code manages an memory-mapped file on a filesystem mounted with DAX on
+ * persistent memory device using the Persistent Memory Development Kit
+ * (http://pmem.io/pmdk/).
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/pmem.h"
+#include "storage/fd.h"
+
+#ifdef USE_LIBPMEM
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <libpmem.h>
+#include <sys/mman.h>
+#include <string.h>
+
+#define PmemFileSize 32
+
+/*
+ * This function returns true, only if the file is stored on persistent memory.
+ */
+bool
+CheckPmem(const char *path)
+{
+	int    is_pmem = 0; /* false */
+	size_t mapped_len = 0;
+	bool   ret = true;
+	void   *tmpaddr;
+
+	/*
+	 * The value of is_pmem is 0, if the file(path) isn't stored on
+	 * persistent memory.
+	 */
+	tmpaddr = pmem_map_file(path, PmemFileSize, PMEM_FILE_CREATE,
+			PG_FILE_MODE_DEFAULT, &mapped_len, &is_pmem);
+
+	if (tmpaddr)
+	{
+		pmem_unmap(tmpaddr, mapped_len);
+		unlink(path);
+	}
+
+	if (is_pmem)
+		elog(LOG, "%s is stored on persistent memory.", path);
+	else
+		ret = false;
+
+	return ret;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return PmemFileOpenPerm(pathname, flags, PG_FILE_MODE_DEFAULT, fsize, addr);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+		void **addr)
+{
+	int mapped_flag = 0;
+	size_t mapped_len = 0, size = 0;
+	void *ret_addr;
+
+	if (addr == NULL)
+		return BasicOpenFile(pathname, flags);
+
+	/* non-zero 'len' not allowed without PMEM_FILE_CREATE */
+	if (flags & O_CREAT)
+	{
+		mapped_flag = PMEM_FILE_CREATE;
+		size = fsize;
+	}
+
+	if (flags & O_EXCL)
+		mapped_flag |= PMEM_FILE_EXCL;
+
+	ret_addr = pmem_map_file(pathname, size, mapped_flag, mode, &mapped_len,
+			NULL);
+
+	if (fsize != mapped_len)
+	{
+		if (ret_addr != NULL)
+			pmem_unmap(ret_addr, mapped_len);
+
+		return -1;
+	}
+
+	if (mapped_flag & PMEM_FILE_CREATE)
+		if (msync(ret_addr, mapped_len, MS_SYNC))
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not msync log file %s: %m", pathname)));
+
+	*addr = ret_addr;
+
+	return NO_FD_FOR_MAPPED_FILE;
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	pmem_memcpy_nodrain((void *)dest, src, len);
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	memcpy(buf, (void *)map_addr, len);
+}
+
+void
+PmemFileSync(void)
+{
+	return pmem_drain();
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	return pmem_unmap((void *)addr, fsize);
+}
+
+
+#else
+bool
+CheckPmem(const char *path)
+{
+	return true;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return BasicOpenFile(pathname, flags);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+		void **addr)
+{
+	return BasicOpenFilePerm(pathname, flags, mode);
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileSync(void)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+	return -1;
+}
+#endif
+
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 72f6be3..e60310c 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -3887,7 +3887,7 @@ static struct config_enum ConfigureNamesEnum[] =
 		},
 		&sync_method,
 		DEFAULT_SYNC_METHOD, sync_method_options,
-		NULL, assign_xlog_sync_method, NULL
+		check_xlog_sync_method, assign_xlog_sync_method, NULL
 	},
 
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 69f40f0..7b70ba0 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -190,6 +190,7 @@
 					#   fsync
 					#   fsync_writethrough
 					#   open_sync
+					#   pmem_drain
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
 #wal_log_hints = off			# also do full page writes of non-critical updates
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d..f1e886f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -27,6 +27,7 @@
 #define SYNC_METHOD_OPEN		2	/* for O_SYNC */
 #define SYNC_METHOD_FSYNC_WRITETHROUGH	3
 #define SYNC_METHOD_OPEN_DSYNC	4	/* for O_DSYNC */
+#define SYNC_METHOD_PMEM_DRAIN	5		/* for Persistent Memory Development Kit */
 extern int	sync_method;
 
 extern PGDLLIMPORT TimeLineID ThisTimeLineID;	/* current TLI */
@@ -226,8 +227,10 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
-extern int	XLogFileOpen(XLogSegNo segno);
+extern int	XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+		void **addr);
+extern int	XLogFileOpen(XLogSegNo segno, void **addr);
+extern int	do_XLogFileClose(int fd, void *addr);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
@@ -239,6 +242,7 @@ extern void xlog_desc(StringInfo buf, XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
+extern int	xlog_fsync(int fd, void *addr);
 
 extern bool RecoveryInProgress(void);
 extern bool HotStandbyActive(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index db5ca16..b48bed9 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -49,6 +49,13 @@
 typedef int File;
 
 
+/*
+ * Default mode for created files, unless something else is specified using
+ * the *Perm() function variants.
+ */
+#define PG_FILE_MODE_DEFAULT	(S_IRUSR | S_IWUSR)
+
+
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 
@@ -107,6 +114,13 @@ extern int	OpenTransientFile(const char *fileName, int fileFlags);
 extern int	OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern int	CloseTransientFile(int fd);
 
+/* Operations to allow use of a memory-mapped file */
+extern int	MapTransientFile(const char *fileName, int fileFlags, size_t fsize,
+		void **addr);
+extern int	MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+		size_t fsize, void **addr);
+extern int	UnmapTransientFile(void *addr, size_t fsize);
+
 /* If you've really really gotta have a plain kernel FD, use this */
 extern int	BasicOpenFile(const char *fileName, int fileFlags);
 extern int	BasicOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -132,7 +146,8 @@ extern void pg_flush_data(int fd, off_t offset, off_t amount);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel);
 extern int	durable_unlink(const char *fname, int loglevel);
-extern int	durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
+extern int	durable_link_or_rename(const char *oldfile, const char *newfile,
+		int loglevel, bool fsync_fname);
 extern void SyncDataDirectory(void);
 
 /* Filename components */
diff --git a/src/include/storage/pmem.h b/src/include/storage/pmem.h
new file mode 100644
index 0000000..823889a
--- /dev/null
+++ b/src/include/storage/pmem.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.h
+ *		Virtual file descriptor definitions for persistent memory.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/pmem.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PMEM_H
+#define PMEM_H
+
+#include "postgres.h"
+
+#define NO_FD_FOR_MAPPED_FILE -2
+
+extern bool	CheckPmem(const char *path);
+extern int	PmemFileOpen(const char *pathname, int flags, size_t fsize,
+		void **addr);
+extern int	PmemFileOpenPerm(const char *pathname, int flags, int mode,
+		size_t fsize, void **addr);
+extern void	PmemFileWrite(void *dest, void *src, size_t len);
+extern void	PmemFileRead(void *map_addr, void *buf, size_t len);
+extern void	PmemFileSync(void);
+extern int	PmemFileClose(void *addr, size_t fsize);
+
+#endif /* PMEM_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 77daa5a..9319271 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -432,6 +432,7 @@ extern void assign_search_path(const char *newval, void *extra);
 
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_xlog_sync_method(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
 #endif							/* GUC_H */
0003-Walreceiver-WAL-IO-using-PMDK.patchapplication/octet-stream; name=0003-Walreceiver-WAL-IO-using-PMDK.patchDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index a39a98f..887946c 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -60,6 +60,7 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -90,6 +91,7 @@ static int	recvFile = -1;
 static TimeLineID recvFileTLI = 0;
 static XLogSegNo recvSegNo = 0;
 static uint32 recvOff = 0;
+void	*mappedFileAddr = NULL;
 
 /*
  * Flags set by interrupt handlers of walreceiver for later service in the
@@ -604,12 +606,12 @@ WalReceiverMain(void)
 		 * End of WAL reached on the requested timeline. Close the last
 		 * segment, and await for new orders from the startup process.
 		 */
-		if (recvFile >= 0)
+		if (recvFile >= 0 || mappedFileAddr != NULL)
 		{
 			char		xlogfname[MAXFNAMELEN];
 
 			XLogWalRcvFlush(false);
-			if (close(recvFile) != 0)
+			if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not close log segment %s: %m",
@@ -626,6 +628,7 @@ WalReceiverMain(void)
 				XLogArchiveNotify(xlogfname);
 		}
 		recvFile = -1;
+		mappedFileAddr = NULL;
 
 		elog(DEBUG1, "walreceiver ended streaming and awaits new instructions");
 		WalRcvWaitForStartPosition(&startpoint, &startpointTLI);
@@ -949,7 +952,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		if ((recvFile < 0 && mappedFileAddr == NULL) ||
+				!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
 			bool		use_existent;
 
@@ -957,7 +961,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			 * fsync() and close current file before we switch to next one. We
 			 * would otherwise have to reopen this file to fsync it later
 			 */
-			if (recvFile >= 0)
+			if (recvFile >= 0 || mappedFileAddr != NULL)
 			{
 				char		xlogfname[MAXFNAMELEN];
 
@@ -968,7 +972,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				 * process soon, so we don't advise the OS to release cache
 				 * pages associated with the file like XLogFileClose() does.
 				 */
-				if (close(recvFile) != 0)
+				if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 					ereport(PANIC,
 							(errcode_for_file_access(),
 							 errmsg("could not close log segment %s: %m",
@@ -985,11 +989,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveNotify(xlogfname);
 			}
 			recvFile = -1;
+			mappedFileAddr = NULL;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			use_existent = true;
-			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
+			recvFile = XLogFileInit(recvSegNo, &use_existent, true, &mappedFileAddr);
 			recvFileTLI = ThisTimeLineID;
 			recvOff = 0;
 		}
@@ -1005,30 +1010,39 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* Need to seek in the file? */
 		if (recvOff != startoff)
 		{
-			if (lseek(recvFile, (off_t) startoff, SEEK_SET) < 0)
-				ereport(PANIC,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(recvFileTLI, recvSegNo),
-								startoff)));
+			if (!mappedFileAddr)
+				if (lseek(recvFile, (off_t) startoff, SEEK_SET) < 0)
+					ereport(PANIC,
+							(errcode_for_file_access(),
+							 errmsg("could not seek in log segment %s to offset %u: %m",
+								 XLogFileNameP(recvFileTLI, recvSegNo),
+								 startoff)));
 			recvOff = startoff;
 		}
 
-		/* OK to write the logs */
-		errno = 0;
-
-		byteswritten = write(recvFile, buf, segbytes);
-		if (byteswritten <= 0)
+		if (mappedFileAddr)
 		{
-			/* if write didn't set errno, assume no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not write to log segment %s "
-							"at offset %u, length %lu: %m",
-							XLogFileNameP(recvFileTLI, recvSegNo),
-							recvOff, (unsigned long) segbytes)));
+			PmemFileWrite((char *)mappedFileAddr+startoff, buf, segbytes);
+			byteswritten = segbytes;
+		}
+		else
+		{
+			/* OK to write the logs */
+			errno = 0;
+
+			byteswritten = write(recvFile, buf, segbytes);
+			if (byteswritten <= 0)
+			{
+				/* if write didn't set errno, assume no disk space */
+				if (errno == 0)
+					errno = ENOSPC;
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not write to log segment %s "
+							 "at offset %u, length %lu: %m",
+							 XLogFileNameP(recvFileTLI, recvSegNo),
+							 recvOff, (unsigned long) segbytes)));
+			}
 		}
 
 		/* Update state for write */
#2Robert Haas
robertmhaas@gmail.com
In reply to: Yoshimi Ichiyanagi (#1)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Tue, Jan 16, 2018 at 2:00 AM, Yoshimi Ichiyanagi
<ichiyanagi.yoshimi@lab.ntt.co.jp> wrote:

Using pgbench which is a PostgreSQL general benchmark, the postgres server
to which the patches is applied is about 5% faster than original server.
And using my insert benchmark, it is up to 90% faster than original one.
I will describe these details later.

Interesting. But your insert benchmark looks highly artificial... in
real life, you would not insert the same long static string 160
million times. Or if you did, you would use COPY or INSERT .. SELECT.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#3Robert Haas
robertmhaas@gmail.com
In reply to: Yoshimi Ichiyanagi (#1)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Tue, Jan 16, 2018 at 2:00 AM, Yoshimi Ichiyanagi
<ichiyanagi.yoshimi@lab.ntt.co.jp> wrote:

C-5. Running the 2 benchmarks(1. pgbench, 2. my insert benchmark)
C-5-1. pgbench
# numactl -N 1 pgbech -c 32 -j 8 -T 120 -M prepared [DB_NAME]

The averages of running pgbench three times are:
wal_sync_method=fdatasync: tps = 43,179
wal_sync_method=pmem_drain: tps = 45,254

What scale factor was used for this test?

Was the only non-default configuration setting wal_sync_method? i.e.
synchronous_commit=on? No change to max_wal_size?

This seems like an exceedingly short test -- normally, for write
tests, I recommend the median of 3 30-minute runs. It also seems
likely to be client-bound, because of the fact that jobs = clients/4.
Normally I use jobs = clients or at least jobs = clients/2.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Yoshimi Ichiyanagi
ichiyanagi.yoshimi@lab.ntt.co.jp
In reply to: Robert Haas (#2)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

Thank you for your reply.

<CA+TgmobUrKBWgOa8x=mbW4Cmsb=NeV8Egf+RSLp7XiCAjHdmgw@mail.gmail.com>
Wed, 17 Jan 2018 15:29:11 -0500Robert Haas <robertmhaas@gmail.com> wrote :

Using pgbench which is a PostgreSQL general benchmark, the postgres server
to which the patches is applied is about 5% faster than original server.
And using my insert benchmark, it is up to 90% faster than original one.
I will describe these details later.

Interesting. But your insert benchmark looks highly artificial... in
real life, you would not insert the same long static string 160
million times. Or if you did, you would use COPY or INSERT .. SELECT.

I made this benchmark in order to put very heavy WAL I/O load on PMEM.

PMEM is very fast. I ran the micro-benchmark test like fio on PMEM.
This workload involved 8K Bytes-block synchronous sequential writes,
and the total write size was 40G Bytes.

The micro-benchmark result was the following.
Using DAX FS(like fdatasync): 5,559 MB/sec
Using DAX FS and PMDK(like pmem_drain): 13,177 MB/sec

Using pgbench, the postgres server to which my patches were applied was
only 5% faster than the original server.

The averages of running pgbench three times are:
wal_sync_method=fdatasync: tps = 43,179
wal_sync_method=pmem_drain: tps = 45,254

While this pgbench was running, the utilization of 8 CPU cores(on which
the postgres server was runnnig) was about 800%, and the throughput of
WAL I/O was about 10 MB/sec. I thought that pgbench was not enough to put
heavy WAL I/O load on PMEM. So I made and ran the WAL I/O intensive test.

Do you know any good WAL I/O intensive benchmarks? DBT2?

<CA+TgmoawGN6Z8PcLKrMrGg99hF0028sFS2a1_VQEMDKcJjQDMQ@mail.gmail.com>
Wed, 17 Jan 2018 15:40:25 -0500Robert Haas <robertmhaas@gmail.com> wrote :

C-5. Running the 2 benchmarks(1. pgbench, 2. my insert benchmark)
C-5-1. pgbench
# numactl -N 1 pgbench -c 32 -j 8 -T 120 -M prepared [DB_NAME]

The averages of running pgbench three times are:
wal_sync_method=fdatasync: tps = 43,179
wal_sync_method=pmem_drain: tps = 45,254

What scale factor was used for this test?

This scale factor was 200.

# numactl -N 0 pgbench -s 200 -i [DB_NAME]

Was the only non-default configuration setting wal_sync_method? i.e.
synchronous_commit=on? No change to max_wal_size?

No, I used the following parameter in postgresql.conf to prevent
checkpoints from occurring while running the tests.

# - Settings -
wal_level = replica
fsync = on
synchronous_commit = on
wal_sync_method = pmem_drain
full_page_writes = on
wal_compression = off

# - Checkpoints -
checkpoint_timeout = 1d
max_wal_size = 20GB
min_wal_size = 20GB

This seems like an exceedingly short test -- normally, for write
tests, I recommend the median of 3 30-minute runs. It also seems
likely to be client-bound, because of the fact that jobs = clients/4.
Normally I use jobs = clients or at least jobs = clients/2.

Thank you for your kind proposal. I did that.

# numactl -N 0 pgbench -s 200 -i [DB_NAME]
# numactl -N 1 pgbench -c 32 -j 32 -T 1800 -M prepared [DB_NAME]

The averages of running pgbench three times are:
wal_sync_method=fdatasync: tps = 39,966
wal_sync_method=pmem_drain: tps = 41,365

--
Yoshimi Ichiyanagi

#5Robert Haas
robertmhaas@gmail.com
In reply to: Yoshimi Ichiyanagi (#4)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Fri, Jan 19, 2018 at 4:56 AM, Yoshimi Ichiyanagi
<ichiyanagi.yoshimi@lab.ntt.co.jp> wrote:

Was the only non-default configuration setting wal_sync_method? i.e.
synchronous_commit=on? No change to max_wal_size?

No, I used the following parameter in postgresql.conf to prevent
checkpoints from occurring while running the tests.

I think that you really need to include the checkpoints in the tests.
I would suggest setting max_wal_size and/or checkpoint_timeout so that
you reliably complete 2 checkpoints in a 30-minute test, and then do a
comparison on that basis.

Do you know any good WAL I/O intensive benchmarks? DBT2?

pgbench is quite a WAL-intensive benchmark; it is much more
write-heavy than what most systems experience in real life, at least
in my experience. Your comparison of DAX FS to DAX FS + PMDK is very
interesting, but in real life the bandwidth of DAX FS is already so
high -- and the latency so low -- that I think most real-world
workloads won't gain very much. At least, that is my impression based
on internal testing EnterpriseDB did a few months back. (Thanks to
Mithun and Kuntal for that work.)

That's not necessarily an argument against this patch, which by the
way I have not reviewed. Even a 5% speedup on this kind of workload
is potentially worthwhile; everyone likes it when things go faster.
I'm just not convinced you can get very much more than that on a
realistic workload. Of course, I might be wrong.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#6Robert Haas
robertmhaas@gmail.com
In reply to: Robert Haas (#5)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Fri, Jan 19, 2018 at 9:42 AM, Robert Haas <robertmhaas@gmail.com> wrote:

That's not necessarily an argument against this patch, which by the
way I have not reviewed. Even a 5% speedup on this kind of workload
is potentially worthwhile; everyone likes it when things go faster.
I'm just not convinced you can get very much more than that on a
realistic workload. Of course, I might be wrong.

Oh, incidentally -- in our internal testing, we found that
wal_sync_method=open_datasync was significantly faster than
wal_sync_method=fdatasync. You might find that open_datasync isn't
much different from pmem_drain, even though they're both faster than
fdatasync.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#6)
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From: Robert Haas [mailto:robertmhaas@gmail.com]

Oh, incidentally -- in our internal testing, we found that
wal_sync_method=open_datasync was significantly faster than
wal_sync_method=fdatasync. You might find that open_datasync isn't much
different from pmem_drain, even though they're both faster than fdatasync.

That's interesting. How fast was open_datasync in what environment (Linux distro/kernel version, HDD or SSD etc.)?

Is it now time to change the default setting to open_datasync on Linux, at least when O_DIRECT is not used (i.e. WAL archiving or streaming replication is used)?

[Current port/linux.h]
/*
* Set the default wal_sync_method to fdatasync. With recent Linux versions,
* xlogdefs.h's normal rules will prefer open_datasync, which (a) doesn't
* perform better and (b) causes outright failures on ext4 data=journal
* filesystems, because those don't support O_DIRECT.
*/
#define PLATFORM_DEFAULT_SYNC_METHOD SYNC_METHOD_FDATASYNC

pg_test_fsync showed open_datasync is slower on my RHEL6 VM:

----------------------------------------ep
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 4276.373 ops/sec 234 usecs/op
fdatasync 4895.256 ops/sec 204 usecs/op
fsync 4797.094 ops/sec 208 usecs/op
fsync_writethrough n/a
open_sync 4575.661 ops/sec 219 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 2243.680 ops/sec 446 usecs/op
fdatasync 4347.466 ops/sec 230 usecs/op
fsync 4337.312 ops/sec 231 usecs/op
fsync_writethrough n/a
open_sync 2329.700 ops/sec 429 usecs/op
----------------------------------------ep

Regards
Takayuki Tsunakawa

#8Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#7)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Tue, Jan 23, 2018 at 8:07 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

From: Robert Haas [mailto:robertmhaas@gmail.com]

Oh, incidentally -- in our internal testing, we found that
wal_sync_method=open_datasync was significantly faster than
wal_sync_method=fdatasync. You might find that open_datasync isn't much
different from pmem_drain, even though they're both faster than fdatasync.

That's interesting. How fast was open_datasync in what environment (Linux distro/kernel version, HDD or SSD etc.)?

Is it now time to change the default setting to open_datasync on Linux, at least when O_DIRECT is not used (i.e. WAL archiving or streaming replication is used)?

I think open_datasync will be worse on systems where fsync() is
expensive -- it forces the data out to disk immediately, even if the
data doesn't need to be flushed immediately. That's bad, because we
wait immediately when we could have deferred the wait until later and
maybe gotten the WAL writer to do the work in the background. But it
might be better on systems where fsync() is basically free, because
there you might as well just get it out of the way immediately and not
leave something left to be done later.

This is just a guess, of course. You didn't mention what the
underlying storage for your test was?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#8)
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From: Robert Haas [mailto:robertmhaas@gmail.com]

I think open_datasync will be worse on systems where fsync() is expensive
-- it forces the data out to disk immediately, even if the data doesn't
need to be flushed immediately. That's bad, because we wait immediately
when we could have deferred the wait until later and maybe gotten the WAL
writer to do the work in the background. But it might be better on systems
where fsync() is basically free, because there you might as well just get
it out of the way immediately and not leave something left to be done later.

This is just a guess, of course. You didn't mention what the underlying
storage for your test was?

Uh, your guess was correct. My file system was ext3, where fsync() writes all dirty buffers in page cache.

As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM volume with ext4 (mounted with options noatime, nobarrier) on a PCIe flash memory.

5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 50829.597 ops/sec 20 usecs/op
fdatasync 42094.381 ops/sec 24 usecs/op
fsync 42209.972 ops/sec 24 usecs/op
fsync_writethrough n/a
open_sync 48669.605 ops/sec 21 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 26366.373 ops/sec 38 usecs/op
fdatasync 33922.725 ops/sec 29 usecs/op
fsync 32990.209 ops/sec 30 usecs/op
fsync_writethrough n/a
open_sync 24326.249 ops/sec 41 usecs/op

What do you think about changing the default value of wal_sync_method on Linux in PG 11? I can understand the concern that users might hit performance degredation if they are using PostgreSQL on older systems. But it's also mottainai that many users don't notice the benefits of wal_sync_method = open_datasync on new systems.

Regards
Takayuki Tsunakawa

#10Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#9)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

This is just a guess, of course. You didn't mention what the underlying
storage for your test was?

Uh, your guess was correct. My file system was ext3, where fsync() writes all dirty buffers in page cache.

Oh, ext3 is terrible. I don't think you can do any meaningful
benchmark results on ext3. Use ext4 or, if you prefer, xfs.

As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on a LVM volume with ext4 (mounted with options noatime, nobarrier) on a PCIe flash memory.

So does that mean it was faster than your PMDK implementation?

What do you think about changing the default value of wal_sync_method on Linux in PG 11? I can understand the concern that users might hit performance degredation if they are using PostgreSQL on older systems. But it's also mottainai that many users don't notice the benefits of wal_sync_method = open_datasync on new systems.

Well, some day persistent memory may be a common enough storage
technology that such a change makes sense, but these days most people
have either SSD or spinning disks, where the change would probably be
a net negative. It seems more like something we might think about
changing in PG 20 or PG 30.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#10)
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From: Robert Haas [mailto:robertmhaas@gmail.com]

On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

As you said, open_datasync was 20% faster than fdatasync on RHEL7.2, on

a LVM volume with ext4 (mounted with options noatime, nobarrier) on a PCIe
flash memory.

So does that mean it was faster than your PMDK implementation?

The PMDK patch is not mine, but is from people in NTT Lab. I'm very curious about the comparison of open_datasync and PMDK, too.

What do you think about changing the default value of wal_sync_method

on Linux in PG 11? I can understand the concern that users might hit
performance degredation if they are using PostgreSQL on older systems. But
it's also mottainai that many users don't notice the benefits of
wal_sync_method = open_datasync on new systems.

Well, some day persistent memory may be a common enough storage technology
that such a change makes sense, but these days most people have either SSD
or spinning disks, where the change would probably be a net negative. It
seems more like something we might think about changing in PG 20 or PG 30.

No, I'm not saying we should make the persistent memory mode the default. I'm simply asking whether it's time to make open_datasync the default setting. We can write a notice in the release note for users who still use ext3 etc. on old systems. If there's no objection, I'll submit a patch for the next CF.

Regards
Takayuki Tsunakawa

#12Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#11)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Thu, Jan 25, 2018 at 7:08 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

No, I'm not saying we should make the persistent memory mode the default. I'm simply asking whether it's time to make open_datasync the default setting. We can write a notice in the release note for users who still use ext3 etc. on old systems. If there's no objection, I'll submit a patch for the next CF.

Well, like I said, I think that will degrade performance for users of
SSDs or spinning disks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#13Michael Paquier
michael.paquier@gmail.com
In reply to: Robert Haas (#10)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Thu, Jan 25, 2018 at 09:30:45AM -0500, Robert Haas wrote:

On Wed, Jan 24, 2018 at 10:31 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

This is just a guess, of course. You didn't mention what the underlying
storage for your test was?

Uh, your guess was correct. My file system was ext3, where fsync() writes all dirty buffers in page cache.

Oh, ext3 is terrible. I don't think you can do any meaningful
benchmark results on ext3. Use ext4 or, if you prefer, xfs.

Or to put it short, the lack of granular syncs in ext3 kills
performance for some workloads. Tomas Vondra's presentation on such
matters are a really cool read by the way:
https://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btrfs-and-zfs
(I would have loved seeing this presentation in live).
--
Michael

#14Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#12)
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From: Robert Haas [mailto:robertmhaas@gmail.com]> On Thu, Jan 25, 2018 at 7:08 PM, Tsunakawa, Takayuki

<tsunakawa.takay@jp.fujitsu.com> wrote:

No, I'm not saying we should make the persistent memory mode the default.

I'm simply asking whether it's time to make open_datasync the default
setting. We can write a notice in the release note for users who still
use ext3 etc. on old systems. If there's no objection, I'll submit a patch
for the next CF.

Well, like I said, I think that will degrade performance for users of SSDs
or spinning disks.

As I showed previously, regular file writes on PCIe flash, *not writes using PMDK on persistent memory*, was 20% faster with open_datasync than with fdatasync.

In addition, regular file writes on HDD with ext4 was also 10% faster:

--------------------------------------------------
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 3408.905 ops/sec 293 usecs/op
fdatasync 3111.621 ops/sec 321 usecs/op
fsync 3609.940 ops/sec 277 usecs/op
fsync_writethrough n/a
open_sync 3356.362 ops/sec 298 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
open_datasync 1892.157 ops/sec 528 usecs/op
fdatasync 3284.278 ops/sec 304 usecs/op
fsync 3066.655 ops/sec 326 usecs/op
fsync_writethrough n/a
open_sync 1853.415 ops/sec 540 usecs/op
--------------------------------------------------

And you said open_datasync was significantly faster than fdatasync. Could you show your results? What device and filesystem did you use?

Regards
Takayuki Tsunakawa

#15Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Michael Paquier (#13)
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From: Michael Paquier [mailto:michael.paquier@gmail.com]

Or to put it short, the lack of granular syncs in ext3 kills performance
for some workloads. Tomas Vondra's presentation on such matters are a really
cool read by the way:
https://www.slideshare.net/fuzzycz/postgresql-on-ext4-xfs-btrfs-and-zf
s

Yeah, I saw this recently, too. That was cool.

Regards
Takayuki Tsunakawa

#16Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#14)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Thu, Jan 25, 2018 at 8:32 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

As I showed previously, regular file writes on PCIe flash, *not writes using PMDK on persistent memory*, was 20% faster with open_datasync than with fdatasync.

If I understand correctly, those results are all just pg_test_fsync
results. That's not reflective of what will happen when the database
is actually running. When you use open_sync or open_datasync, you
force WAL write and WAL flush to happen simultaneously, instead of
letting the WAL flush be delayed.

And you said open_datasync was significantly faster than fdatasync. Could you show your results? What device and filesystem did you use?

I don't have the results handy at the moment. We found it to be
faster on a database benchmark where the WAL was stored on an NVRAM
device.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#17Tsunakawa, Takayuki
tsunakawa.takay@jp.fujitsu.com
In reply to: Robert Haas (#16)
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

From: Robert Haas [mailto:robertmhaas@gmail.com]

If I understand correctly, those results are all just pg_test_fsync results.
That's not reflective of what will happen when the database is actually
running. When you use open_sync or open_datasync, you force WAL write and
WAL flush to happen simultaneously, instead of letting the WAL flush be
delayed.

Yes, that's pg_test_fsync output. Isn't pg_test_fsync the tool to determine the value for wal_sync_method? Is this manual misleading?

https://www.postgresql.org/docs/devel/static/pgtestfsync.html
--------------------------------------------------
pg_test_fsync - determine fastest wal_sync_method for PostgreSQL

pg_test_fsync is intended to give you a reasonable idea of what the fastest wal_sync_method is on your specific system, as well as supplying diagnostic information in the event of an identified I/O problem.
--------------------------------------------------

Anyway, I'll use pgbench, and submit a patch if open_datasync is better than fdatasync. I guess the current tweak of making fdatasync the default is a holdover from the era before ext4 and XFS became prevalent.

I don't have the results handy at the moment. We found it to be faster
on a database benchmark where the WAL was stored on an NVRAM device.

Oh, NVRAM. Interesting. Then I'll try open_datasync/fdatasync comparison on HDD and SSD/PCie flash with pgbench.

Regards
Takayuki Tsunakawa

#18Robert Haas
robertmhaas@gmail.com
In reply to: Tsunakawa, Takayuki (#17)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Thu, Jan 25, 2018 at 8:54 PM, Tsunakawa, Takayuki
<tsunakawa.takay@jp.fujitsu.com> wrote:

Yes, that's pg_test_fsync output. Isn't pg_test_fsync the tool to determine the value for wal_sync_method? Is this manual misleading?

Hmm. I hadn't thought about it as misleading, but now that you
mention it, I'd say that it probably is. I suspect that there should
be a disclaimer saying that the fastest WAL sync method in terms of
ops/second is not necessarily the one that will deliver the best
database performance, and mention the issues around open_sync and
open_datasync specifically. But let's see what your testing shows;
I'm talking based on now-fairly-old experience with this and a passing
familiarity with the relevant source code.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#19Yoshimi Ichiyanagi
ichiyanagi.yoshimi@lab.ntt.co.jp
In reply to: Robert Haas (#10)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

<CA+TgmoZygQO3EC4mMdf-b=UuY3HZz6+-Y2w5_s9bLtH4NPw6Bg@mail.gmail.com>
Fri, 19 Jan 2018 09:42:25 -0500Robert Haas <robertmhaas@gmail.com> wrote
:

I think that you really need to include the checkpoints in the tests.
I would suggest setting max_wal_size and/or checkpoint_timeout so that
you reliably complete 2 checkpoints in a 30-minute test, and then do a
comparison on that basis.

Experimental setup:
-------------------------
Server: HP ProLiant DL360 Gen9
CPU: Xeon E5-2667 v4 (3.20GHz); 2 processors(without HT)
DRAM: DDR4-2400; 32 GiB/processor
(8GiB/socket x 4 sockets/processor) x 2 processors
NVDIMM: DDR4-2133; 32 GiB/processor
(node 0: 8GiB/socket x 2 sockets/processor,
node 1: 8GiB/socket x 6 sockets/processor)
HDD: Seagate Constellation2 2.5inch SATA 3.0. 6Gb/s 1TB 7200rpm x 1
SATA-SSD: Crucial_CT500MX200SSD1 (SATA 3.2, SATA 6Gb/s)
OS: Ubuntu 16.04, linux-4.12
DAX FS: ext4
PMDK: master(at)Aug 30, 2017
PostgreSQL: master
Note: I bound the postgres processes to one NUMA node,
and the benchmarks to other NUMA node.
-------------------------

postgresql.conf
-------------------------
# - Settings -
wal_level = replica
fsync = on
synchronous_commit = on
wal_sync_method = pmem_drain/fdatasync/open_datasync
full_page_writes = on
wal_compression = off

# - Checkpoints -
checkpoint_timeout = 12min
max_wal_size = 20GB
min_wal_size = 20GB
-------------------------

Executed commands:
--------------------------------------------------------------------
# numactl -N 1 pg_ctl start -D [PG_DIR] -l [LOG_FILE]
# numactl -N 0 pgbench -s 200 -i [DB_NAME]
# numactl -N 0 pgbench -c 32 -j 32 -T 1800 -r [DB_NAME] -M prepared
--------------------------------------------------------------------

The results:
--------------------------------------------------------------------
A) Applied the patches to PG src, and compiled PG with libpmem
B) Applied the patches to PG src, and compiled PG without libpmem
C) Original PG

The averages of running pgbench three times on *PMEM* are:
A)
wal_sync_method = pmem_drain tps = 41660.42524
wal_sync_method = open_datasync tps = 39913.49897
wal_sync_method = fdatasync tps = 39900.83396

C)
wal_sync_method = open_datasync tps = 40335.50178
wal_sync_method = fdatasync tps = 40649.57772

The averages of running pgbench three times on *SATA-SSD* are:
B)
wal_sync_method = open_datasync tps = 7224.07146
wal_sync_method = fdatasync tps = 7222.19177

C)
wal_sync_method = open_datasync tps = 7258.79093
wal_sync_method = fdatasync tps = 7263.19878
--------------------------------------------------------------------

From the above results, it show that wal_sync_method=pmem_drain was
about faster than wal_sync_method=open_datasync/fdatasync.
When pgbench ran on SATA-SSD, wal_sync_method=fdatasync was as fast
as wal_sync_method=open_datasync.

Do you know any good WAL I/O intensive benchmarks? DBT2?

pgbench is quite a WAL-intensive benchmark; it is much more
write-heavy than what most systems experience in real life, at least
in my experience. Your comparison of DAX FS to DAX FS + PMDK is very
interesting, but in real life the bandwidth of DAX FS is already so
high -- and the latency so low -- that I think most real-world
workloads won't gain very much. At least, that is my impression based
on internal testing EnterpriseDB did a few months back. (Thanks to
Mithun and Kuntal for that work.)

In the near future, many physical devices will send sensing data
(IoT might allow devices to exhaust tens Giga network bandwidth).
The amount of data inserted in the DB will significantly increase.
I think that PMEM will be needed for use cases like IoT.

<CA+TgmobDO4qj2nMLdm2Dv5VRT8cVQjv7kftsS_P-kNpNw=TRug@mail.gmail.com>
Thu, 25 Jan 2018 09:30:45 -0500Robert Haas <robertmhaas@gmail.com> wrote
:

Well, some day persistent memory may be a common enough storage
technology that such a change makes sense, but these days most people
have either SSD or spinning disks, where the change would probably be
a net negative. It seems more like something we might think about
changing in PG 20 or PG 30.

Oracle and Microsoft SQL Server suported PMEM [1]http://dbheartbeat.blogspot.jp/2017/11/doag-2017-oracle-18c-dbim-oracle.htm[2]https://www.snia.org/sites/default/files/PM-Summit/2018/presentations/06_PM_Summit_2018_Talpey-Final_Post-CORRECTED.pdf.
I think it is not too early for PostgreSQL to support PMEM.

[1]: http://dbheartbeat.blogspot.jp/2017/11/doag-2017-oracle-18c-dbim-oracle.htm
[2]: https://www.snia.org/sites/default/files/PM-Summit/2018/presentations/06_PM_Summit_2018_Talpey-Final_Post-CORRECTED.pdf

--
Yoshimi Ichiyanagi

#20Robert Haas
robertmhaas@gmail.com
In reply to: Yoshimi Ichiyanagi (#19)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Tue, Jan 30, 2018 at 3:37 AM, Yoshimi Ichiyanagi
<ichiyanagi.yoshimi@lab.ntt.co.jp> wrote:

Oracle and Microsoft SQL Server suported PMEM [1][2].
I think it is not too early for PostgreSQL to support PMEM.

I agree; it's good to have the option available for those who have
access to the hardware.

If you haven't added your patch to the next CommitFest, please do so.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#21Yoshimi Ichiyanagi
ichiyanagi.yoshimi@lab.ntt.co.jp
In reply to: Robert Haas (#20)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Tue, Jan 30, 2018 at 3:37 AM, Yoshimi Ichiyanagi
<ichiyanagi.yoshimi@lab.ntt.co.jp> wrote:

Oracle and Microsoft SQL Server suported PMEM [1][2].
I think it is not too early for PostgreSQL to support PMEM.

I agree; it's good to have the option available for those who have
access to the hardware.

If you haven't added your patch to the next CommitFest, please do so.

Thank you for your time.

I added my patches to the CommitFest 2018-3.
https://commitfest.postgresql.org/17/1485/

Oh by the way, we submitted this proposal(Introducing PMDK into
PostgreSQL) to PGcon2018.
If our proposal is accepted and you have time, please listen to
our presentation.

--
Yoshimi Ichiyanagi
Mailto : ichiyanagi.yoshimi@lab.ntt.co.jp

#22Andres Freund
andres@anarazel.de
In reply to: Yoshimi Ichiyanagi (#21)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On 2018-02-05 09:59:25 +0900, Yoshimi Ichiyanagi wrote:

I added my patches to the CommitFest 2018-3.
https://commitfest.postgresql.org/17/1485/

Unfortunately this is the last CF for the v11 development cycle. This is
a major project submitted late for v11, there's been no code level
review, the goals aren't agreed upon yet, etc. So I'd unfortunately like
to move this to the next CF?

Greetings,

Andres Freund

#23Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Yoshimi Ichiyanagi (#1)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On 16/01/18 15:00, Yoshimi Ichiyanagi wrote:

Hi.

These patches enable to use Persistent Memory Development Kit(PMDK)[1]
for reading/writing WAL logs on persistent memory(PMEM).
PMEM is next generation storage and it has a number of nice features:
fast, byte-addressable and non-volatile.

Interesting. How does this compare with using good old mmap()? I think
just doing that would allow eliminating much of the complexity around
managing the shared_buffers. And if the OS is smart about persistent
memory (I don't know what the state of the art on that is), presumably
msync() and fsync() on an file that lives in persistent memory is
lightning fast.

- Heikki

#24Yoshimi Ichiyanagi
ichiyanagi.yoshimi@lab.ntt.co.jp
In reply to: Andres Freund (#22)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

<20180301103641.tudam4mavba3god7@alap3.anarazel.de>
Thu, 1 Mar 2018 02:36:41 -0800Andres Freund <andres@anarazel.de> wrote :

Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent
memory

On 2018-02-05 09:59:25 +0900, Yoshimi Ichiyanagi wrote:

I added my patches to the CommitFest 2018-3.
https://commitfest.postgresql.org/17/1485/

Unfortunately this is the last CF for the v11 development cycle. This is
a major project submitted late for v11, there's been no code level
review, the goals aren't agreed upon yet, etc. So I'd unfortunately like
to move this to the next CF?

I get it. I modified the status to "move to next CF".

--
Yoshimi Ichiyanagi
NTT laboratories

#25Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Heikki Linnakangas (#23)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On 01/03/18 12:40, Heikki Linnakangas wrote:

On 16/01/18 15:00, Yoshimi Ichiyanagi wrote:

These patches enable to use Persistent Memory Development Kit(PMDK)[1]
for reading/writing WAL logs on persistent memory(PMEM).
PMEM is next generation storage and it has a number of nice features:
fast, byte-addressable and non-volatile.

Interesting. How does this compare with using good old mmap()? I think
just doing that would allow eliminating much of the complexity around
managing the shared_buffers. And if the OS is smart about persistent
memory (I don't know what the state of the art on that is), presumably
msync() and fsync() on an file that lives in persistent memory is
lightning fast.

I briefly looked at the docs at pmem.io. pmem_map_file() uses mmap()
under the hood, but it does some extra checks to test if the files is on
a persistent memory device, and makes a note of it.

I think the way forward with this patch would be to map WAL segments
with plain old mmap(), and use msync(). If that's faster than the status
quo, great. If not, it would still be a good stepping stone for actually
using PMDK. If nothing else, it would provide a way to test most of the
code paths, without actually having a persistent memory device, or
libpmem. The examples at http://pmem.io/pmdk/libpmem/ actually sugest
doing exactly that: use libpmem to map a file to memory, and check if it
lives on persistent memory using libpmem's pmem_is_pmem() function. If
it returns yes, use pmem_drain(), if it return false, fall back to using
msync().

- Heikki

#26Yoshimi Ichiyanagi
ichiyanagi.yoshimi@lab.ntt.co.jp
In reply to: Heikki Linnakangas (#25)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

I'm sorry for the delay in replying your mail.

<91411837-8c65-bf7d-7ca3-d69bdcb4968a@iki.fi>
Thu, 1 Mar 2018 18:40:05 +0800Heikki Linnakangas <hlinnaka@iki.fi> wrote
:

Interesting. How does this compare with using good old mmap()?

The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
alignment, since it could reduce the number of page faults.
In addition, libpmem's pmem_memcpy_nodrain() is the function
to copy data using single instruction, multiple data(SIMD) instructions
and NT store instructions(MOVNT).
As a result, using these APIs is faster than using old mmap()/memcpy().

Please see the PGCon2018 presentation[1]https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf for the details.

[1]: https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf

<83eafbfd-d9c5-6623-2423-7cab1be3888c@iki.fi>
Fri, 20 Jul 2018 23:18:05 +0300Heikki Linnakangas <hlinnaka@iki.fi>
wrote :

I think the way forward with this patch would be to map WAL segments
with plain old mmap(), and use msync(). If that's faster than the status
quo, great. If not, it would still be a good stepping stone for actually
using PMDK.

I think so too.

I wrote this patch to replace read/write syscalls with libpmem's
API only. I believe that PMDK can make the current PostgreSQL faster.

If nothing else, it would provide a way to test most of the
code paths, without actually having a persistent memory device, or
libpmem. The examples at http://pmem.io/pmdk/libpmem/ actually sugest
doing exactly that: use libpmem to map a file to memory, and check if it
lives on persistent memory using libpmem's pmem_is_pmem() function. If
it returns yes, use pmem_drain(), if it return false, fall back to using
msync().

When PMEM_IS_PMEM_FORCE(the environment variable[2]http://pmem.io/pmdk/manpages/linux/v1.4/libpmem/libpmem.7.html) is set to 1,
pmem_is_pmem() return yes.

Linux 4.15 and more supported MAP_SYNC and MAP_SHARED_VALIDATE of
mmap() flags to check if the mapped file is stored on PMEM.
An application that used both flags in its mmap() call can be sure
that MAP_SYNC is actually supported by both the kernel and
the filesystem that the mapped file is stored in[3]https://lwn.net/Articles/758594/.
But pmem_is_pmem() doesn't support this mechanism for now.

[2]: http://pmem.io/pmdk/manpages/linux/v1.4/libpmem/libpmem.7.html
[3]: https://lwn.net/Articles/758594/

--
Yoshimi Ichiyanagi
NTT Software Innovation Center
e-mail : ichiyanagi.yoshimi@lab.ntt.co.jp

#27Michael Paquier
michael@paquier.xyz
In reply to: Yoshimi Ichiyanagi (#26)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:

The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
alignment, since it could reduce the number of page faults.
In addition, libpmem's pmem_memcpy_nodrain() is the function
to copy data using single instruction, multiple data(SIMD) instructions
and NT store instructions(MOVNT).
As a result, using these APIs is faster than using old mmap()/memcpy().

Please see the PGCon2018 presentation[1] for the details.

[1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf

So you say that this represents a 3% gain based on the presentation?
That may be interesting to dig into it. Could you provide fresher
performance numbers? I am moving this patch to the next CF 2018-10 for
now, waiting for input from the author.
--
Michael

#28Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Michael Paquier (#27)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:

The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
alignment, since it could reduce the number of page faults.
In addition, libpmem's pmem_memcpy_nodrain() is the function
to copy data using single instruction, multiple data(SIMD) instructions
and NT store instructions(MOVNT).
As a result, using these APIs is faster than using old mmap()/memcpy().

Please see the PGCon2018 presentation[1] for the details.

[1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf

So you say that this represents a 3% gain based on the presentation?
That may be interesting to dig into it. Could you provide fresher
performance numbers? I am moving this patch to the next CF 2018-10 for
now, waiting for input from the author.

Unfortunately, the patch has some conflicts now, so probably not only fresher
performance numbers are necessary, but also a rebased version.

#29Dmitry Dolgov
9erthalion6@gmail.com
In reply to: Dmitry Dolgov (#28)
3 attachment(s)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On Thu, Nov 29, 2018 at 6:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:

The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
alignment, since it could reduce the number of page faults.
In addition, libpmem's pmem_memcpy_nodrain() is the function
to copy data using single instruction, multiple data(SIMD) instructions
and NT store instructions(MOVNT).
As a result, using these APIs is faster than using old mmap()/memcpy().

Please see the PGCon2018 presentation[1] for the details.

[1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf

So you say that this represents a 3% gain based on the presentation?
That may be interesting to dig into it. Could you provide fresher
performance numbers? I am moving this patch to the next CF 2018-10 for
now, waiting for input from the author.

Unfortunately, the patch has some conflicts now, so probably not only fresher
performance numbers are necessary, but also a rebased version.

I believe the idea behind this patch is quite important (thanks to CMU DG for
inspiring lectures), so I decided to put some efforts and rebase it to prevent
from rotting. At the same time I have a vague impression that the patch itself
suggests quite narrow way of using of PMDK.

On 01/03/18 12:40, Heikki Linnakangas wrote:

On 16/01/18 15:00, Yoshimi Ichiyanagi wrote:

These patches enable to use Persistent Memory Development Kit(PMDK)[1]
for reading/writing WAL logs on persistent memory(PMEM).
PMEM is next generation storage and it has a number of nice features:
fast, byte-addressable and non-volatile.

Interesting. How does this compare with using good old mmap()?

E.g. byte-addressability is not used here at all, and it's probably one of the
most cool properties, when we write not a block/page, but a small amount of
data and flush it using PMDK.

Attachments:

0001-Add-configure-option-for-PMDK-v2.patchapplication/octet-stream; name=0001-Add-configure-option-for-PMDK-v2.patchDownload
From f62ce0a15d56dbf05e8e861f7c17d4c99fd8c97e Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Mon, 10 Dec 2018 17:57:57 +0100
Subject: [PATCH 1/3] Add configuration option for PMDK

---
 configure                  | 92 ++++++++++++++++++++++++++++++++++++++++++++++
 configure.in               | 16 ++++++++
 src/include/pg_config.h.in |  6 +++
 3 files changed, 114 insertions(+)

diff --git a/configure b/configure
index dce6d98cf6..831977362f 100755
--- a/configure
+++ b/configure
@@ -702,6 +702,7 @@ EGREP
 GREP
 with_zlib
 with_system_tzdata
+with_libpmem
 with_libxslt
 with_libxml
 XML2_CONFIG
@@ -863,6 +864,7 @@ with_uuid
 with_ossp_uuid
 with_libxml
 with_libxslt
+with_libpmem
 with_system_tzdata
 with_zlib
 with_gnu_ld
@@ -1564,6 +1566,7 @@ Optional Packages:
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
   --with-libxslt          use XSLT support when building contrib/xml2
+  --with-libpmem          use PMEM support for WAL I/O
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
@@ -7325,6 +7328,33 @@ if test "$with_icu" = yes; then
 
 
 
+#
+# PMEM
+#
+
+
+
+# Check whether --with-libpmem was given.
+if test "${with_libpmem+set}" = set; then :
+  withval=$with_libpmem;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBPMEM 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-libpmem option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_libpmem=no
+
+fi
 
 
 
@@ -12343,6 +12373,57 @@ fi
 
 fi
 
+if test "$with_libpmem" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'pmwm' is required for PMEM support" "$LINENO" 5
+fi
+
+fi
+
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -13169,6 +13250,17 @@ else
 fi
 
 
+fi
+
+if test "$with_libpmem" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for PMEM support" "$LINENO" 5
+fi
+
+
 fi
 
 if test "$with_ldap" = yes ; then
diff --git a/configure.in b/configure.in
index e5123ac122..65ac2fbd25 100644
--- a/configure.in
+++ b/configure.in
@@ -949,6 +949,14 @@ PGAC_ARG_BOOL(with, libxslt, no, [use XSLT support when building contrib/xml2],
 
 AC_SUBST(with_libxslt)
 
+#
+# PMEM
+#
+PGAC_ARG_BOOL(with, libpmem, no, [use PMEM support for WAL I/O],
+	      [AC_DEFINE([USE_LIBPMEM], 1, [Define to 1 to use PMEM support for WAL I/O. (--with-libpmem)])])
+
+AC_SUBST(with_libpmem)
+
 #
 # tzdata
 #
@@ -1230,6 +1238,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_LIB(xslt, xsltCleanupGlobals, [], [AC_MSG_ERROR([library 'xslt' is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [], [AC_MSG_ERROR([library 'pmem' is required for PMEM support])])
+fi
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -1418,6 +1430,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_HEADER(libxslt/xslt.h, [], [AC_MSG_ERROR([header file <libxslt/xslt.h> is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for PMEM support])])
+fi
+
 if test "$with_ldap" = yes ; then
   if test "$PORTNAME" != "win32"; then
      AC_CHECK_HEADERS(ldap.h, [],
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 6ac75cd02c..f11e5ef916 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -378,6 +378,9 @@
 /* Define to 1 if you have the `xslt' library (-lxslt). */
 #undef HAVE_LIBXSLT
 
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
 /* Define to 1 if you have the `z' library (-lz). */
 #undef HAVE_LIBZ
 
@@ -915,6 +918,9 @@
 /* Define to 1 to build with LLVM based JIT support. (--with-llvm) */
 #undef USE_LLVM
 
+/* Define to 1 to use PMEM support for WAL I/O. (--with-libpmem) */
+#undef USE_LIBPMEM
+
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
-- 
2.16.4

0003-Walreceiver-WAL-IO-using-PMDK-v2.patchapplication/octet-stream; name=0003-Walreceiver-WAL-IO-using-PMDK-v2.patchDownload
From ff658e5fecf317888ede6f11827be50495ec4f00 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Mon, 10 Dec 2018 21:51:15 +0100
Subject: [PATCH 3/3] Walreceiver WAL IO using PMDK

---
 src/backend/replication/walreceiver.c | 66 +++++++++++++++++++++--------------
 1 file changed, 40 insertions(+), 26 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9643c2ed7b..e03b660109 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -61,6 +61,7 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -91,6 +92,7 @@ static int	recvFile = -1;
 static TimeLineID recvFileTLI = 0;
 static XLogSegNo recvSegNo = 0;
 static uint32 recvOff = 0;
+void	*mappedFileAddr = NULL;
 
 /*
  * Flags set by interrupt handlers of walreceiver for later service in the
@@ -599,12 +601,12 @@ WalReceiverMain(void)
 		 * End of WAL reached on the requested timeline. Close the last
 		 * segment, and await for new orders from the startup process.
 		 */
-		if (recvFile >= 0)
+		if (recvFile >= 0 || mappedFileAddr != NULL)
 		{
 			char		xlogfname[MAXFNAMELEN];
 
 			XLogWalRcvFlush(false);
-			if (close(recvFile) != 0)
+			if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not close log segment %s: %m",
@@ -621,6 +623,7 @@ WalReceiverMain(void)
 				XLogArchiveNotify(xlogfname);
 		}
 		recvFile = -1;
+		mappedFileAddr = NULL;
 
 		elog(DEBUG1, "walreceiver ended streaming and awaits new instructions");
 		WalRcvWaitForStartPosition(&startpoint, &startpointTLI);
@@ -931,7 +934,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		if ((recvFile < 0 && mappedFileAddr == NULL) ||
+				!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
 			bool		use_existent;
 
@@ -939,7 +943,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			 * fsync() and close current file before we switch to next one. We
 			 * would otherwise have to reopen this file to fsync it later
 			 */
-			if (recvFile >= 0)
+			if (recvFile >= 0 || mappedFileAddr != NULL)
 			{
 				char		xlogfname[MAXFNAMELEN];
 
@@ -950,7 +954,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				 * process soon, so we don't advise the OS to release cache
 				 * pages associated with the file like XLogFileClose() does.
 				 */
-				if (close(recvFile) != 0)
+				if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 					ereport(PANIC,
 							(errcode_for_file_access(),
 							 errmsg("could not close log segment %s: %m",
@@ -967,11 +971,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveNotify(xlogfname);
 			}
 			recvFile = -1;
+			mappedFileAddr = NULL;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			use_existent = true;
-			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
+			recvFile = XLogFileInit(recvSegNo, &use_existent, true, &mappedFileAddr);
 			recvFileTLI = ThisTimeLineID;
 			recvOff = 0;
 		}
@@ -987,30 +992,39 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* Need to seek in the file? */
 		if (recvOff != startoff)
 		{
-			if (lseek(recvFile, (off_t) startoff, SEEK_SET) < 0)
-				ereport(PANIC,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(recvFileTLI, recvSegNo),
-								startoff)));
+			if (!mappedFileAddr)
+				if (lseek(recvFile, (off_t) startoff, SEEK_SET) < 0)
+					ereport(PANIC,
+							(errcode_for_file_access(),
+							 errmsg("could not seek in log segment %s to offset %u: %m",
+								 XLogFileNameP(recvFileTLI, recvSegNo),
+								 startoff)));
 			recvOff = startoff;
 		}
 
-		/* OK to write the logs */
-		errno = 0;
-
-		byteswritten = write(recvFile, buf, segbytes);
-		if (byteswritten <= 0)
+		if (mappedFileAddr)
 		{
-			/* if write didn't set errno, assume no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not write to log segment %s "
-							"at offset %u, length %lu: %m",
-							XLogFileNameP(recvFileTLI, recvSegNo),
-							recvOff, (unsigned long) segbytes)));
+			PmemFileWrite((char *)mappedFileAddr+startoff, buf, segbytes);
+			byteswritten = segbytes;
+		}
+		else
+		{
+			/* OK to write the logs */
+			errno = 0;
+
+			byteswritten = write(recvFile, buf, segbytes);
+			if (byteswritten <= 0)
+			{
+				/* if write didn't set errno, assume no disk space */
+				if (errno == 0)
+					errno = ENOSPC;
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not write to log segment %s "
+							 "at offset %u, length %lu: %m",
+							 XLogFileNameP(recvFileTLI, recvSegNo),
+							 recvOff, (unsigned long) segbytes)));
+			}
 		}
 
 		/* Update state for write */
-- 
2.16.4

0002-Read-write-WAL-files-using-PMDK-v2.patchapplication/octet-stream; name=0002-Read-write-WAL-files-using-PMDK-v2.patchDownload
From 0661cd379372543bb286a758ca1eaecbcd424986 Mon Sep 17 00:00:00 2001
From: erthalion <9erthalion6@gmail.com>
Date: Mon, 10 Dec 2018 21:50:36 +0100
Subject: [PATCH 2/3] Read write WAL files using PMDK

---
 src/backend/access/transam/timeline.c         |   4 +-
 src/backend/access/transam/xlog.c             | 422 ++++++++++++++++++--------
 src/backend/storage/file/Makefile             |   2 +-
 src/backend/storage/file/fd.c                 | 132 +++++++-
 src/backend/storage/file/pmem.c               | 188 ++++++++++++
 src/backend/utils/misc/guc.c                  |   2 +-
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   8 +-
 src/include/storage/fd.h                      |  17 +-
 src/include/storage/pmem.h                    |  32 ++
 src/include/utils/guc.h                       |   1 +
 11 files changed, 674 insertions(+), 135 deletions(-)
 create mode 100644 src/backend/storage/file/pmem.c
 create mode 100644 src/include/storage/pmem.h

diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index 70eec5676e..765b72cade 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -426,7 +426,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	durable_link_or_rename(tmppath, path, ERROR);
+	durable_link_or_rename(tmppath, path, ERROR, true);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -505,7 +505,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	durable_link_or_rename(tmppath, path, ERROR);
+	durable_link_or_rename(tmppath, path, ERROR, true);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c80b14ed97..81087b2595 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -58,6 +58,7 @@
 #include "storage/ipc.h"
 #include "storage/large_object.h"
 #include "storage/latch.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -141,6 +142,9 @@ const struct config_enum_entry sync_method_options[] = {
 #endif
 #ifdef OPEN_DATASYNC_FLAG
 	{"open_datasync", SYNC_METHOD_OPEN_DSYNC, false},
+#endif
+#ifdef USE_LIBPMEM
+	{"pmem_drain", SYNC_METHOD_PMEM_DRAIN, false},
 #endif
 	{NULL, 0, false}
 };
@@ -779,6 +783,7 @@ static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
 static uint32 openLogOff = 0;
+static void	*mappedLogFileAddr = NULL;
 
 /*
  * These variables are used similarly to the ones above, but for reading
@@ -793,6 +798,7 @@ static XLogSegNo readSegNo = 0;
 static uint32 readOff = 0;
 static uint32 readLen = 0;
 static XLogSource readSource = 0;	/* XLOG_FROM_* code */
+static void	*mappedReadFileAddr = NULL;
 
 /*
  * Keeps track of which source we're currently reading from. This is
@@ -878,13 +884,15 @@ static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
 static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
+static int do_XLogFileOpen(char *pathname, int flags, void **addr);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, XLogSegNo max_segno,
-					   bool use_lock);
+					   bool use_lock, bool fsync_file);
 static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 int source, bool notfoundOk);
-static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
+			 int source, bool notfoundOk, void **addr);
+static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source,
+		void **addr);
 static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 			 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
 			 TimeLineID *readTLI);
@@ -2361,6 +2369,15 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
 	return false;
 }
 
+static int
+do_XLogFileOpen(char *pathname, int flags,  void **addr)
+{
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		return PmemFileOpen(pathname, flags, wal_segment_size, addr);
+	else
+		return BasicOpenFile(pathname, flags);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2440,23 +2457,25 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			 * pages here (since we dump what we have at segment end).
 			 */
 			Assert(npages == 0);
-			if (openLogFile >= 0)
+			if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 				XLogFileClose();
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
 
 			/* create/use new log file */
 			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
+			openLogFile = XLogFileInit(openLogSegNo, &use_existent,
+					true, &mappedLogFileAddr);
 			openLogOff = 0;
 		}
 
 		/* Make sure we have the current logfile open */
-		if (openLogFile < 0)
+		if (openLogFile < 0 && mappedLogFileAddr == NULL)
 		{
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
-			openLogFile = XLogFileOpen(openLogSegNo);
+			openLogFile = XLogFileOpen(openLogSegNo,
+					&mappedLogFileAddr);
 			openLogOff = 0;
 		}
 
@@ -2497,6 +2516,13 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+				if (mappedLogFileAddr != NULL)
+				{
+					pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+					PmemFileWrite((char *)mappedLogFileAddr+openLogOff, from, nleft);
+					pgstat_report_wait_end();
+					break;
+				}
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
@@ -2594,15 +2620,16 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		if (sync_method != SYNC_METHOD_OPEN &&
 			sync_method != SYNC_METHOD_OPEN_DSYNC)
 		{
-			if (openLogFile >= 0 &&
+			if ((openLogFile >= 0 || mappedLogFileAddr != NULL) &&
 				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
 				XLogFileClose();
-			if (openLogFile < 0)
+			if (openLogFile < 0 && mappedLogFileAddr == NULL)
 			{
 				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 								wal_segment_size);
-				openLogFile = XLogFileOpen(openLogSegNo);
+				openLogFile = XLogFileOpen(openLogSegNo,
+						&mappedLogFileAddr);
 				openLogOff = 0;
 			}
 
@@ -3027,7 +3054,7 @@ XLogBackgroundFlush(void)
 	 */
 	if (WriteRqst.Write <= LogwrtResult.Flush)
 	{
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
@@ -3208,7 +3235,8 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
+XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+		void **addr)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3217,6 +3245,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	XLogSegNo	max_segno;
 	int			fd;
 	int			nbytes;
+	void	*tmpaddr = NULL;
+	bool	fsync_file = true;
 
 	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
 
@@ -3225,16 +3255,20 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 */
 	if (*use_existent)
 	{
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-		if (fd < 0)
+		fd = do_XLogFileOpen(path,
+				O_RDWR | PG_BINARY | get_sync_bit(sync_method),
+				&tmpaddr);
+		if (fd < 0 && tmpaddr == NULL)
 		{
 			if (errno != ENOENT)
 				ereport(ERROR,
 						(errcode_for_file_access(),
 						 errmsg("could not open file \"%s\": %m", path)));
 		}
-		else
+		else {
+			*addr = tmpaddr;
 			return fd;
+		}
 	}
 
 	/*
@@ -3250,8 +3284,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	fd = do_XLogFileOpen(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+			&tmpaddr);
+	if (fd < 0 && tmpaddr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
@@ -3268,35 +3303,45 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	memset(zbuffer.data, 0, XLOG_BLCKSZ);
 	for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
 	{
-		errno = 0;
-		pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-		if ((int) write(fd, zbuffer.data, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ)
+		if (tmpaddr != NULL)
 		{
-			int			save_errno = errno;
+			fsync_file = false;
+			pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
+			PmemFileWrite((char *)tmpaddr + nbytes, zbuffer.data, XLOG_BLCKSZ);
+		}
+		else
+		{
+			errno = 0;
+			pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
+			if ((int) write(fd, zbuffer.data, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ)
+			{
+				int			save_errno = errno;
 
-			/*
-			 * If we fail to make the file, delete it to release disk space
-			 */
-			unlink(tmppath);
+				/*
+				 * If we fail to make the file, delete it to release disk space
+				 */
+				unlink(tmppath);
 
-			close(fd);
+				close(fd);
 
-			/* if write didn't set errno, assume problem is no disk space */
-			errno = save_errno ? save_errno : ENOSPC;
+				/* if write didn't set errno, assume problem is no disk space */
+				errno = save_errno ? save_errno : ENOSPC;
 
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write to file \"%s\": %m", tmppath)));
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write to file \"%s\": %m", tmppath)));
+
+			}
 		}
 		pgstat_report_wait_end();
 	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, tmpaddr) != 0)
 	{
 		int			save_errno = errno;
 
-		close(fd);
+		do_XLogFileClose(fd, tmpaddr);
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3304,7 +3349,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	}
 	pgstat_report_wait_end();
 
-	if (close(fd))
+	if (do_XLogFileClose(fd, tmpaddr))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
@@ -3331,7 +3376,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	max_segno = logsegno + CheckPointSegments;
 	if (!InstallXLogFileSegment(&installed_segno, tmppath,
 								*use_existent, max_segno,
-								use_lock))
+								use_lock,
+								fsync_file))
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3345,8 +3391,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	*use_existent = false;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+			O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+
+	if (fd < 0 && *addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3381,13 +3429,21 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	int			srcfd;
 	int			fd;
 	int			nbytes;
+	void		*src_addr = NULL, *dst_addr = NULL;
+	bool		fsync_file = true;
 
 	/*
 	 * Open the source file
 	 */
 	XLogFilePath(path, srcTLI, srcsegno, wal_segment_size);
-	srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (srcfd < 0)
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		srcfd = MapTransientFile(path, O_RDONLY | PG_BINARY,
+				wal_segment_size, &src_addr);
+
+	if (src_addr == NULL)
+		srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+
+	if (srcfd < 0 && src_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3400,15 +3456,32 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = OpenTransientFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	if ( src_addr != NULL && sync_method == SYNC_METHOD_PMEM_DRAIN)
+		fd = MapTransientFile(tmppath,
+				O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+				wal_segment_size, &dst_addr);
+	else
+		fd = OpenTransientFile(tmppath,
+				O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+
+	if (fd < 0 && dst_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m", tmppath)));
+				 errmsg("could not create file \"%s\": %m",
+					 tmppath)));
 
 	/*
 	 * Do the data copying.
 	 */
+	if (src_addr && dst_addr) {
+		pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_READ);
+		PmemFileWrite(dst_addr, src_addr, wal_segment_size);
+		pgstat_report_wait_end();
+		fsync_file = false;
+
+		goto done_copy;
+	}
+
 	for (nbytes = 0; nbytes < wal_segment_size; nbytes += sizeof(buffer))
 	{
 		int			nread;
@@ -3465,24 +3538,36 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 		pgstat_report_wait_end();
 	}
 
+done_copy:
 	pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, dst_addr) != 0)
 		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
 
-	if (CloseTransientFile(fd))
+	if (dst_addr)
+	{
+		if (UnmapTransientFile(dst_addr, wal_segment_size))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not unmap file \"%s\": %m",
+						 tmppath)));
+	}
+	else if (CloseTransientFile(fd))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
-	CloseTransientFile(srcfd);
+	if (src_addr)
+		UnmapTransientFile(src_addr, wal_segment_size);
+	else
+		CloseTransientFile(srcfd);
 
 	/*
 	 * Now move the segment into place with its final name.
 	 */
-	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
+	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false, fsync_file))
 		elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
@@ -3517,7 +3602,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, XLogSegNo max_segno,
-					   bool use_lock)
+					   bool use_lock, bool fsync_file)
 {
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
@@ -3556,7 +3641,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	if (durable_link_or_rename(tmppath, path, LOG) != 0)
+	if (durable_link_or_rename(tmppath, path, LOG, fsync_file) != 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
@@ -3574,15 +3659,16 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  * Open a pre-existing logfile segment for writing.
  */
 int
-XLogFileOpen(XLogSegNo segno)
+XLogFileOpen(XLogSegNo segno, void **addr)
 {
 	char		path[MAXPGPATH];
 	int			fd;
 
 	XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+			O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+	if (fd < 0 && *addr == NULL)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3598,7 +3684,7 @@ XLogFileOpen(XLogSegNo segno)
  */
 static int
 XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 int source, bool notfoundOk)
+			 int source, bool notfoundOk, void **addr)
 {
 	char		xlogfname[MAXFNAMELEN];
 	char		activitymsg[MAXFNAMELEN + 16];
@@ -3647,8 +3733,8 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlogfname);
 	}
 
-	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-	if (fd >= 0)
+	fd = do_XLogFileOpen(path, O_RDONLY | PG_BINARY, addr);
+	if (fd >= 0 || *addr != NULL)
 	{
 		/* Success! */
 		curFileTLI = tli;
@@ -3680,7 +3766,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  * This version searches for the segment with any TLI listed in expectedTLEs.
  */
 static int
-XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
+XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source, void **addr)
 {
 	char		path[MAXPGPATH];
 	ListCell   *cell;
@@ -3720,8 +3806,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_ARCHIVE, true);
-			if (fd != -1)
+					XLOG_FROM_ARCHIVE, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				elog(DEBUG1, "got WAL segment from archive");
 				if (!expectedTLEs)
@@ -3733,8 +3819,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_PG_WAL, true);
-			if (fd != -1)
+					XLOG_FROM_PG_WAL, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				if (!expectedTLEs)
 					expectedTLEs = tles;
@@ -3752,13 +3838,22 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 	return -1;
 }
 
+int
+do_XLogFileClose(int fd, void *addr)
+{
+	if (!addr)
+		return close(fd);
+
+	return PmemFileClose(addr, wal_segment_size);
+}
+
 /*
  * Close the current logfile segment for writing.
  */
 static void
 XLogFileClose(void)
 {
-	Assert(openLogFile >= 0);
+	Assert(openLogFile >= 0 || mappedLogFileAddr != NULL);
 
 	/*
 	 * WAL segment files will not be re-read in normal operation, so we advise
@@ -3767,15 +3862,16 @@ XLogFileClose(void)
 	 * use the cache to read the WAL segment.
 	 */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	if (!XLogIsNeeded())
+	if (!XLogIsNeeded() && openLogFile > 0)
 		(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif
 
-	if (close(openLogFile))
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
 						XLogFileNameP(ThisTimeLineID, openLogSegNo))));
+	mappedLogFileAddr = NULL;
 	openLogFile = -1;
 }
 
@@ -3795,6 +3891,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	XLogSegNo	_logSegNo;
 	int			lf;
 	bool		use_existent;
+	void		*laddr = NULL;
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3803,8 +3900,8 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	{
 		_logSegNo++;
 		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent, true);
-		close(lf);
+		lf = XLogFileInit(_logSegNo, &use_existent, true, &laddr);
+		do_XLogFileClose(lf, laddr);
 		if (!use_existent)
 			CheckpointStats.ckpt_segs_added++;
 	}
@@ -4053,6 +4150,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 	struct stat statbuf;
 	XLogSegNo	endlogSegNo;
 	XLogSegNo	recycleSegNo;
+	bool		fsync_file = true;
 
 	/*
 	 * Initialize info about where to try to recycle to.
@@ -4065,6 +4163,9 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 
 	snprintf(path, MAXPGPATH, XLOGDIR "/%s", segname);
 
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		fsync_file = false;
+
 	/*
 	 * Before deleting the file, see if it can be recycled as a future log
 	 * segment. Only recycle normal files, pg_standby for example can create
@@ -4073,7 +4174,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 	if (endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
-							   true, recycleSegNo, true))
+			true, recycleSegNo, true, fsync_file))
 	{
 		ereport(DEBUG2,
 				(errmsg("recycled write-ahead log file \"%s\"",
@@ -4240,9 +4341,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			if (readFile >= 0)
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
 			{
-				close(readFile);
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+				mappedReadFileAddr = NULL;
 				readFile = -1;
 			}
 
@@ -4781,7 +4883,8 @@ UpdateControlFile(void)
 	pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE);
 	if (write(fd, ControlFile, sizeof(ControlFileData)) != sizeof(ControlFileData))
 	{
-		/* if write didn't set errno, assume problem is no disk space */
+		/* if write didn't set errno, assume problem is no disk
+		 * space */
 		if (errno == 0)
 			errno = ENOSPC;
 		ereport(PANIC,
@@ -5213,34 +5316,44 @@ BootStrapXLOG(void)
 
 	/* Create first XLOG segment file */
 	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	openLogFile = XLogFileInit(1, &use_existent, false, &mappedLogFileAddr);
 
 	/* Write the first page with the initial record */
 	errno = 0;
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+
+	if (mappedLogFileAddr != NULL)
 	{
-		/* if write didn't set errno, assume problem is no disk space */
-		if (errno == 0)
-			errno = ENOSPC;
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not write bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		PmemFileWrite(mappedLogFileAddr, page, XLOG_BLCKSZ);
+	}
+	else
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write bootstrap write-ahead log file: %m")));
+		}
 	}
 	pgstat_report_wait_end();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
+	if (xlog_fsync(openLogFile, (void *)mappedLogFileAddr) != 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
 	pgstat_report_wait_end();
 
-	if (close(openLogFile))
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not close bootstrap write-ahead log file: %m")));
 
+	mappedLogFileAddr = NULL;
 	openLogFile = -1;
 
 	/* Now create pg_control */
@@ -5474,9 +5587,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * If the ending log segment is still open, close it (to avoid problems on
 	 * Windows with trying to rename or delete an open file).
 	 */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 
@@ -5515,10 +5629,11 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 */
 		bool		use_existent = true;
 		int			fd;
+		void		*tmpaddr = NULL;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+		fd = XLogFileInit(startLogSegNo, &use_existent, true, &tmpaddr);
 
-		if (close(fd))
+		if (do_XLogFileClose(fd, tmpaddr))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not close file \"%s\": %m",
@@ -7715,9 +7830,10 @@ StartupXLOG(void)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/* Shut down xlogreader */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 	XLogReaderFree(xlogreader);
@@ -10056,6 +10172,9 @@ get_sync_bit(int method)
 		case SYNC_METHOD_FSYNC:
 		case SYNC_METHOD_FSYNC_WRITETHROUGH:
 		case SYNC_METHOD_FDATASYNC:
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+#endif
 			return 0;
 #ifdef OPEN_SYNC_FLAG
 		case SYNC_METHOD_OPEN:
@@ -10073,7 +10192,36 @@ get_sync_bit(int method)
 }
 
 /*
- * GUC support
+ * GUC check_hook for xlog_sync_method
+ */
+bool
+check_xlog_sync_method(int *newval, void **extra, GucSource source)
+{
+	bool ret;
+	char tmppath[MAXPGPATH] = {};
+	int val = newval ? *newval : sync_method;
+
+	if (val != SYNC_METHOD_PMEM_DRAIN)
+		return true;
+
+	snprintf(tmppath, MAXPGPATH, "%s/" XLOGDIR "/pmem.tmp", DataDir);
+
+	ret = CheckPmem(tmppath);
+
+	if (!ret)
+	{
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("invalid value for parameter \"wal_sync_method\": \"pmem_drain\"");
+		GUC_check_errmsg("%s isn't stored on persistent memory(pmem_is_pmem() returned false).",
+				XLOGDIR);
+		GUC_check_errhint("Please see also ENVIRONMENT VARIABLES section in man libpmem.");
+	}
+
+	return ret;
+}
+
+/*
+ * GUC assign_hook for xlog_sync_method
  */
 void
 assign_xlog_sync_method(int new_sync_method, void *extra)
@@ -10086,10 +10234,10 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 		 * changing, close the log file so it will be reopened (with new flag
 		 * bit) at next use.
 		 */
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN);
-			if (pg_fsync(openLogFile) != 0)
+			if (xlog_fsync(openLogFile, (void *)mappedLogFileAddr) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not fsync file \"%s\": %m",
@@ -10138,6 +10286,11 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 						 errmsg("could not fdatasync file \"%s\": %m",
 								XLogFileNameP(ThisTimeLineID, segno))));
 			break;
+#endif
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+			PmemFileSync();
+			break;
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
@@ -10150,6 +10303,17 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	pgstat_report_wait_end();
 }
 
+int
+xlog_fsync(int fd, void *addr)
+{
+	if (!addr)
+		return pg_fsync(fd);
+
+	PmemFileSync();
+	return 0;
+}
+
+
 /*
  * Return the filename of given log segment, as a palloc'd string.
  */
@@ -11563,7 +11727,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 &&
+	if ((readFile >= 0 || mappedReadFileAddr != NULL) &&
 		!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
 	{
 		/*
@@ -11580,7 +11744,8 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			}
 		}
 
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 		readSource = 0;
 	}
@@ -11589,7 +11754,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
+	if ((readFile < 0 && mappedReadFileAddr == NULL) ||
 		(readSource == XLOG_FROM_STREAM &&
 		 receivedUpto < targetPagePtr + reqLen))
 	{
@@ -11598,8 +11763,9 @@ retry:
 										 private->fetching_ckpt,
 										 targetRecPtr))
 		{
-			if (readFile >= 0)
-				close(readFile);
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+			mappedReadFileAddr = NULL;
 			readFile = -1;
 			readLen = 0;
 			readSource = 0;
@@ -11612,7 +11778,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(readFile != -1 || mappedReadFileAddr != NULL);
 
 	/*
 	 * If the current segment is being streamed from master, calculate how
@@ -11634,29 +11800,36 @@ retry:
 	/* Read the requested page */
 	readOff = targetPageOff;
 
-	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
-	if (r != XLOG_BLCKSZ)
+	if (mappedReadFileAddr) {
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		PmemFileRead((char *) mappedReadFileAddr + readOff, readBuf, XLOG_BLCKSZ);
+	}
+	else
 	{
-		char		fname[MAXFNAMELEN];
-		int			save_errno = errno;
-
-		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		if (r < 0)
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+		if (r != XLOG_BLCKSZ)
 		{
-			errno = save_errno;
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
+			char		fname[MAXFNAMELEN];
+			int			save_errno = errno;
+
+			pgstat_report_wait_end();
+			XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+			if (r < 0)
+			{
+				errno = save_errno;
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode_for_file_access(),
+						 errmsg("could not read from log segment %s, offset %u: %m",
+								fname, readOff)));
+			}
+			else
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+								fname, readOff, r, (Size) XLOG_BLCKSZ)));
+			goto next_record_is_invalid;
 		}
-		else
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							fname, readOff, r, (Size) XLOG_BLCKSZ)));
-		goto next_record_is_invalid;
 	}
 	pgstat_report_wait_end();
 
@@ -11704,8 +11877,9 @@ retry:
 next_record_is_invalid:
 	lastSourceFailed = true;
 
-	if (readFile >= 0)
-		close(readFile);
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+	mappedReadFileAddr = NULL;
 	readFile = -1;
 	readLen = 0;
 	readSource = 0;
@@ -11962,9 +12136,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 				/* Close any old file we might have open. */
-				if (readFile >= 0)
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 				{
-					close(readFile);
+					do_XLogFileClose(readFile,
+							mappedReadFileAddr);
+					mappedReadFileAddr = NULL;
 					readFile = -1;
 				}
 				/* Reset curFileTLI if random fetch. */
@@ -11977,8 +12153,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 */
 				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
-				if (readFile >= 0)
+											  currentSource, &mappedReadFileAddr);
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 					return true;	/* success! */
 
 				/*
@@ -12042,14 +12218,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						 * info is set correctly and XLogReceiptTime isn't
 						 * changed.
 						 */
-						if (readFile < 0)
+						if (readFile < 0 && mappedReadFileAddr == NULL)
 						{
 							if (!expectedTLEs)
 								expectedTLEs = readTimeLineHistory(receiveTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
 													receiveTLI,
-													XLOG_FROM_STREAM, false);
-							Assert(readFile >= 0);
+													XLOG_FROM_STREAM, false, &mappedReadFileAddr);
+							Assert(readFile >= 0 || mappedReadFileAddr != NULL);
 						}
 						else
 						{
diff --git a/src/backend/storage/file/Makefile b/src/backend/storage/file/Makefile
index ca6a0e4f7d..9271153553 100644
--- a/src/backend/storage/file/Makefile
+++ b/src/backend/storage/file/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/file
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = fd.o buffile.o copydir.o reinit.o sharedfileset.o
+OBJS = fd.o buffile.o copydir.o reinit.o sharedfileset.o pmem.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 9e596e7868..01b269d9a5 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -88,6 +88,7 @@
 #include "portability/mem.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
@@ -223,6 +224,9 @@ static uint64 temporary_files_size = 0;
 typedef enum
 {
 	AllocateDescFile,
+#ifdef USE_LIBPMEM
+	AllocateDescMap,
+#endif
 	AllocateDescPipe,
 	AllocateDescDir,
 	AllocateDescRawFD
@@ -237,6 +241,10 @@ typedef struct
 		FILE	   *file;
 		DIR		   *dir;
 		int			fd;
+#ifdef USE_LIBPMEM
+		size_t	fsize;
+		void	   *addr;
+#endif
 	}			desc;
 } AllocateDesc;
 
@@ -705,14 +713,16 @@ durable_unlink(const char *fname, int elevel)
  * valid upon return.
  */
 int
-durable_link_or_rename(const char *oldfile, const char *newfile, int elevel)
+durable_link_or_rename(const char *oldfile, const char *newfile, int elevel,
+		bool fsync_file)
 {
 	/*
 	 * Ensure that, if we crash directly after the rename/link, a file with
 	 * valid contents is moved into place.
 	 */
-	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
-		return -1;
+	if (fsync_file)
+		if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
+			return -1;
 
 #if HAVE_WORKING_LINK
 	if (link(oldfile, newfile) < 0)
@@ -740,8 +750,9 @@ durable_link_or_rename(const char *oldfile, const char *newfile, int elevel)
 	 * Make change persistent in case of an OS crash, both the new entry and
 	 * its parent directory need to be flushed.
 	 */
-	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
-		return -1;
+	if (fsync_file)
+		if (fsync_fname_ext(newfile, false, false, elevel) != 0)
+			return -1;
 
 	/* Same for parent directory */
 	if (fsync_parent_path(newfile, elevel) != 0)
@@ -1556,6 +1567,76 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
 	return file;
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Mmap a file with MapTransientFilePerm() and pass default file mode for
+ * the fileMode parameter.
+ */
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return MapTransientFilePerm(fileName, fileFlags, PG_FILE_MODE_DEFAULT,
+			fsize, addr);
+}
+
+/*
+ * Like AllocateFile, but returns an unbuffered pointer to the mapped area
+ * like mmap(2)
+ */
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+		size_t fsize, void **addr)
+{
+	int			fd;
+
+	DO_DB(elog(LOG, "MapTransientFilePerm: Allocated %d (%s)",
+			   numAllocatedDescs, fileName));
+
+	/* Can we allocate another non-virtual FD? */
+	if (!reserveAllocatedDesc())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("exceeded maxAllocatedDescs (%d) while trying to open file \"%s\"",
+						maxAllocatedDescs, fileName)));
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	if (addr != NULL)
+	{
+		void *ret_addr = NULL;
+		fd = PmemFileOpenPerm(fileName, fileFlags, fileMode, fsize, &ret_addr);
+		if (ret_addr != NULL)
+		{
+			AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
+			*addr = ret_addr;
+
+			desc->kind = AllocateDescMap;
+			desc->desc.addr = ret_addr;
+			desc->desc.fsize = fsize;
+			desc->create_subid = GetCurrentSubTransactionId();
+			numAllocatedDescs++;
+
+			return fd;
+		}
+	}
+
+	return -1;					/* failure */
+}
+#else
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return -1;
+}
+
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+		size_t fsize, void **addr)
+{
+	return -1;
+}
+#endif
 
 /*
  * Create a new file.  The directory containing it must already exist.  Files
@@ -2361,6 +2442,11 @@ FreeDesc(AllocateDesc *desc)
 		case AllocateDescRawFD:
 			result = close(desc->desc.fd);
 			break;
+#ifdef USE_LIBPMEM
+		case AllocateDescMap:
+			result = PmemFileClose(desc->desc.addr, desc->desc.fsize);
+			break;
+#endif
 		default:
 			elog(ERROR, "AllocateDesc kind not recognized");
 			result = 0;			/* keep compiler quiet */
@@ -2402,6 +2488,42 @@ FreeFile(FILE *file)
 	return fclose(file);
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Unmap a file returned by MapTransientFile.
+ *
+ * Note we do not check unmap's return value --- it is up to the caller
+ * to handle unmap errors.
+ */
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	int			i;
+
+	DO_DB(elog(LOG, "UnmapTransientFile: Allocated %d", numAllocatedDescs));
+
+	/* Remove fd from list of allocated files, if it's present */
+	for (i = numAllocatedDescs; --i >= 0;)
+	{
+		AllocateDesc *desc = &allocatedDescs[i];
+
+		if (desc->kind == AllocateDescMap && desc->desc.addr == addr)
+			return FreeDesc(desc);
+	}
+
+	/* Only get here if someone passes us a file not in allocatedDescs */
+	elog(WARNING, "fd passed to UnmapTransientFile was not obtained from MapTransientFile");
+
+	return PmemFileClose(addr, fsize);
+}
+#else
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	return -1;
+}
+#endif
+
 /*
  * Close a file returned by OpenTransientFile.
  *
diff --git a/src/backend/storage/file/pmem.c b/src/backend/storage/file/pmem.c
new file mode 100644
index 0000000000..85fed32b52
--- /dev/null
+++ b/src/backend/storage/file/pmem.c
@@ -0,0 +1,188 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.c
+ *	  Virtual file descriptor code.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/file/pmem.c
+ *
+ * NOTES:
+ *
+ * This code manages an memory-mapped file on a filesystem mounted with DAX on
+ * persistent memory device using the Persistent Memory Development Kit
+ * (http://pmem.io/pmdk/).
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/pmem.h"
+#include "storage/fd.h"
+
+#ifdef USE_LIBPMEM
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <libpmem.h>
+#include <sys/mman.h>
+#include <string.h>
+
+#define PmemFileSize 32
+
+/*
+ * This function returns true, only if the file is stored on persistent memory.
+ */
+bool
+CheckPmem(const char *path)
+{
+	int    is_pmem = 0; /* false */
+	size_t mapped_len = 0;
+	bool   ret = true;
+	void   *tmpaddr;
+
+	/*
+	 * The value of is_pmem is 0, if the file(path) isn't stored on
+	 * persistent memory.
+	 */
+	tmpaddr = pmem_map_file(path, PmemFileSize, PMEM_FILE_CREATE,
+			PG_FILE_MODE_DEFAULT, &mapped_len, &is_pmem);
+
+	if (tmpaddr)
+	{
+		pmem_unmap(tmpaddr, mapped_len);
+		unlink(path);
+	}
+
+	if (is_pmem)
+		elog(LOG, "%s is stored on persistent memory.", path);
+	else
+		ret = false;
+
+	return ret;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return PmemFileOpenPerm(pathname, flags, PG_FILE_MODE_DEFAULT, fsize, addr);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+		void **addr)
+{
+	int mapped_flag = 0;
+	size_t mapped_len = 0, size = 0;
+	void *ret_addr;
+
+	if (addr == NULL)
+		return BasicOpenFile(pathname, flags);
+
+	/* non-zero 'len' not allowed without PMEM_FILE_CREATE */
+	if (flags & O_CREAT)
+	{
+		mapped_flag = PMEM_FILE_CREATE;
+		size = fsize;
+	}
+
+	if (flags & O_EXCL)
+		mapped_flag |= PMEM_FILE_EXCL;
+
+	ret_addr = pmem_map_file(pathname, size, mapped_flag, mode, &mapped_len,
+			NULL);
+
+	if (fsize != mapped_len)
+	{
+		if (ret_addr != NULL)
+			pmem_unmap(ret_addr, mapped_len);
+
+		return -1;
+	}
+
+	if (mapped_flag & PMEM_FILE_CREATE)
+		if (msync(ret_addr, mapped_len, MS_SYNC))
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not msync log file %s: %m", pathname)));
+
+	*addr = ret_addr;
+
+	return NO_FD_FOR_MAPPED_FILE;
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	pmem_memcpy_nodrain((void *)dest, src, len);
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	memcpy(buf, (void *)map_addr, len);
+}
+
+void
+PmemFileSync(void)
+{
+	return pmem_drain();
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	return pmem_unmap((void *)addr, fsize);
+}
+
+
+#else
+bool
+CheckPmem(const char *path)
+{
+	return true;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return BasicOpenFile(pathname, flags);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+		void **addr)
+{
+	return BasicOpenFilePerm(pathname, flags, mode);
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileSync(void)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+	return -1;
+}
+#endif
+
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 03594e77fe..033edbfa79 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -4311,7 +4311,7 @@ static struct config_enum ConfigureNamesEnum[] =
 		},
 		&sync_method,
 		DEFAULT_SYNC_METHOD, sync_method_options,
-		NULL, assign_xlog_sync_method, NULL
+		check_xlog_sync_method, assign_xlog_sync_method, NULL
 	},
 
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 1fa02d2c93..b2ebe56bb5 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -196,6 +196,7 @@
 					#   fsync
 					#   fsync_writethrough
 					#   open_sync
+					#   pmem_drain
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
 #wal_log_hints = off			# also do full page writes of non-critical updates
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f3a7ba4d42..0f7786c64a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -27,6 +27,7 @@
 #define SYNC_METHOD_OPEN		2	/* for O_SYNC */
 #define SYNC_METHOD_FSYNC_WRITETHROUGH	3
 #define SYNC_METHOD_OPEN_DSYNC	4	/* for O_DSYNC */
+#define SYNC_METHOD_PMEM_DRAIN	5		/* for Persistent Memory Development Kit */
 extern int	sync_method;
 
 extern PGDLLIMPORT TimeLineID ThisTimeLineID;	/* current TLI */
@@ -259,8 +260,10 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
-extern int	XLogFileOpen(XLogSegNo segno);
+extern int	XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+		void **addr);
+extern int	XLogFileOpen(XLogSegNo segno, void **addr);
+extern int	do_XLogFileClose(int fd, void *addr);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
@@ -272,6 +275,7 @@ extern void xlog_desc(StringInfo buf, XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
+extern int	xlog_fsync(int fd, void *addr);
 
 extern bool RecoveryInProgress(void);
 extern bool HotStandbyActive(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index cb882fb74e..0bf50aa3d0 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -45,6 +45,13 @@
 typedef int File;
 
 
+/*
+ * Default mode for created files, unless something else is specified using
+ * the *Perm() function variants.
+ */
+#define PG_FILE_MODE_DEFAULT	(S_IRUSR | S_IWUSR)
+
+
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
@@ -104,6 +111,13 @@ extern int	OpenTransientFile(const char *fileName, int fileFlags);
 extern int	OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern int	CloseTransientFile(int fd);
 
+/* Operations to allow use of a memory-mapped file */
+extern int	MapTransientFile(const char *fileName, int fileFlags, size_t fsize,
+		void **addr);
+extern int	MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+		size_t fsize, void **addr);
+extern int	UnmapTransientFile(void *addr, size_t fsize);
+
 /* If you've really really gotta have a plain kernel FD, use this */
 extern int	BasicOpenFile(const char *fileName, int fileFlags);
 extern int	BasicOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -133,7 +147,8 @@ extern void pg_flush_data(int fd, off_t offset, off_t amount);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel);
 extern int	durable_unlink(const char *fname, int loglevel);
-extern int	durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
+extern int	durable_link_or_rename(const char *oldfile, const char *newfile,
+		int loglevel, bool fsync_fname);
 extern void SyncDataDirectory(void);
 extern int data_sync_elevel(int elevel);
 
diff --git a/src/include/storage/pmem.h b/src/include/storage/pmem.h
new file mode 100644
index 0000000000..823889ab38
--- /dev/null
+++ b/src/include/storage/pmem.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.h
+ *		Virtual file descriptor definitions for persistent memory.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/pmem.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PMEM_H
+#define PMEM_H
+
+#include "postgres.h"
+
+#define NO_FD_FOR_MAPPED_FILE -2
+
+extern bool	CheckPmem(const char *path);
+extern int	PmemFileOpen(const char *pathname, int flags, size_t fsize,
+		void **addr);
+extern int	PmemFileOpenPerm(const char *pathname, int flags, int mode,
+		size_t fsize, void **addr);
+extern void	PmemFileWrite(void *dest, void *src, size_t len);
+extern void	PmemFileRead(void *map_addr, void *buf, size_t len);
+extern void	PmemFileSync(void);
+extern int	PmemFileClose(void *addr, size_t fsize);
+
+#endif /* PMEM_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 64457c792a..685d5aece5 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -433,6 +433,7 @@ extern void assign_search_path(const char *newval, void *extra);
 
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_xlog_sync_method(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
 #endif							/* GUC_H */
-- 
2.16.4

#30Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Dmitry Dolgov (#29)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On 10/12/2018 23:37, Dmitry Dolgov wrote:

On Thu, Nov 29, 2018 at 6:48 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:

On Tue, Oct 2, 2018 at 4:53 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Aug 06, 2018 at 06:00:54PM +0900, Yoshimi Ichiyanagi wrote:

The libpmem's pmem_map_file() supported 2M/1G(the size of huge page)
alignment, since it could reduce the number of page faults.
In addition, libpmem's pmem_memcpy_nodrain() is the function
to copy data using single instruction, multiple data(SIMD) instructions
and NT store instructions(MOVNT).
As a result, using these APIs is faster than using old mmap()/memcpy().

Please see the PGCon2018 presentation[1] for the details.

[1] https://www.pgcon.org/2018/schedule/attachments/507_PGCon2018_Introducing_PMDK_into_PostgreSQL.pdf

So you say that this represents a 3% gain based on the presentation?
That may be interesting to dig into it. Could you provide fresher
performance numbers? I am moving this patch to the next CF 2018-10 for
now, waiting for input from the author.

Unfortunately, the patch has some conflicts now, so probably not only fresher
performance numbers are necessary, but also a rebased version.

I believe the idea behind this patch is quite important (thanks to CMU DG for
inspiring lectures), so I decided to put some efforts and rebase it to prevent
from rotting. At the same time I have a vague impression that the patch itself
suggests quite narrow way of using of PMDK.

Thanks.

To re-iterate what I said earlier in this thread, I think the next step
here is to write a patch that modifies xlog.c to use plain old
mmap()/msync() to memory-map the WAL files, to replace the WAL buffers.
Let's see what the performance of that is, with or without NVM hardware.
I think that might actually make the code simpler. There's a bunch of
really hairy code around locking the WAL buffers, which could be made
simpler if each backend memory-mapped the WAL segment files independently.

One thing to watch out for, is that if you read() a file, and there's an
I/O error, you have a chance to ereport() it. If you try to read from a
memory-mapped file, and there's an I/O error, the process is killed with
SIGBUS. So I think we have to be careful with using memory-mapped I/O
for reading files. But for writing WAL files, it seems like a good fit.

Once we have a reliable mmap()/msync() implementation running, it should
be straightforward to change it to use MAP_SYNC and the special CPU
instructions for the flushing.

- Heikki

#31Andres Freund
andres@anarazel.de
In reply to: Heikki Linnakangas (#30)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

Hi,

On 2019-01-23 18:45:42 +0200, Heikki Linnakangas wrote:

To re-iterate what I said earlier in this thread, I think the next step here
is to write a patch that modifies xlog.c to use plain old mmap()/msync() to
memory-map the WAL files, to replace the WAL buffers. Let's see what the
performance of that is, with or without NVM hardware. I think that might
actually make the code simpler. There's a bunch of really hairy code around
locking the WAL buffers, which could be made simpler if each backend
memory-mapped the WAL segment files independently.

One thing to watch out for, is that if you read() a file, and there's an I/O
error, you have a chance to ereport() it. If you try to read from a
memory-mapped file, and there's an I/O error, the process is killed with
SIGBUS. So I think we have to be careful with using memory-mapped I/O for
reading files. But for writing WAL files, it seems like a good fit.

Once we have a reliable mmap()/msync() implementation running, it should be
straightforward to change it to use MAP_SYNC and the special CPU
instructions for the flushing.

FWIW, I don't think we should go there as the sole implementation. I'm
fairly convinced that we're going to need to go to direct-IO in more
cases here, and that'll not work well with mmap. I think this'd be a
worthwhile experiment, but I'm doubtful it'd end up simplifying our
code.

Greetings,

Andres Freund

#32Takashi Menjo
menjo.takashi@lab.ntt.co.jp
In reply to: Andres Freund (#31)
3 attachment(s)
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

Hello,

On behalf of Yoshimi, I rebased the patchset onto the latest master
(e3565fd6).
Please see the attachment. It also includes an additional bug fix (in patch
0002)
about temporary filename.

Note that PMDK 1.4.2+ supports MAP_SYNC and MAP_SHARED_VALIDATE flags,
so please use a new version of PMDK when you test. The latest version is
1.5.

Heikki Linnakangas wrote:

To re-iterate what I said earlier in this thread, I think the next step
here is to write a patch that modifies xlog.c to use plain old
mmap()/msync() to memory-map the WAL files, to replace the WAL buffers.

Sorry but my new patchset still uses PMDK, because PMDK is supported on
Linux
_and Windows_, and I think someone may want to test this patchset on
Windows...

Regards,
Takashi

--
Takashi Menjo - NTT Software Innovation Center
<menjo.takashi@lab.ntt.co.jp>

Attachments:

0001-Add-configure-option-for-PMDK-v2.patchapplication/octet-stream; name=0001-Add-configure-option-for-PMDK-v2.patchDownload
diff --git a/configure b/configure
index ddb3c8b1ba..a23d13d602 100755
--- a/configure
+++ b/configure
@@ -702,6 +702,7 @@ EGREP
 GREP
 with_zlib
 with_system_tzdata
+with_libpmem
 with_libxslt
 with_libxml
 XML2_CONFIG
@@ -861,6 +862,7 @@ with_uuid
 with_ossp_uuid
 with_libxml
 with_libxslt
+with_libpmem
 with_system_tzdata
 with_zlib
 with_gnu_ld
@@ -1566,6 +1568,7 @@ Optional Packages:
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
   --with-libxslt          use XSLT support when building contrib/xml2
+  --with-libpmem          use PMEM support for WAL I/O
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
@@ -8241,6 +8244,33 @@ fi
 
 
 
+#
+# PMEM
+#
+
+
+
+# Check whether --with-libpmem was given.
+if test "${with_libpmem+set}" = set; then :
+  withval=$with_libpmem;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBPMEM 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-libpmem option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_libpmem=no
+
+fi
 
 
 
@@ -12322,6 +12352,57 @@ fi
 
 fi
 
+if test "$with_libpmem" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'pmwm' is required for PMEM support" "$LINENO" 5
+fi
+
+fi
+
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -13148,6 +13229,17 @@ else
 fi
 
 
+fi
+
+if test "$with_libpmem" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for PMEM support" "$LINENO" 5
+fi
+
+
 fi
 
 if test "$with_ldap" = yes ; then
diff --git a/configure.in b/configure.in
index 3d8888805c..91ef21a1cc 100644
--- a/configure.in
+++ b/configure.in
@@ -943,6 +943,14 @@ PGAC_ARG_BOOL(with, libxslt, no, [use XSLT support when building contrib/xml2],
 
 AC_SUBST(with_libxslt)
 
+#
+# PMEM
+#
+PGAC_ARG_BOOL(with, libpmem, no, [use PMEM support for WAL I/O],
+	      [AC_DEFINE([USE_LIBPMEM], 1, [Define to 1 to use PMEM support for WAL I/O. (--with-libpmem)])])
+
+AC_SUBST(with_libpmem)
+
 #
 # tzdata
 #
@@ -1224,6 +1232,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_LIB(xslt, xsltCleanupGlobals, [], [AC_MSG_ERROR([library 'xslt' is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [], [AC_MSG_ERROR([library 'pmem' is required for PMEM support])])
+fi
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -1412,6 +1424,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_HEADER(libxslt/xslt.h, [], [AC_MSG_ERROR([header file <libxslt/xslt.h> is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for PMEM support])])
+fi
+
 if test "$with_ldap" = yes ; then
   if test "$PORTNAME" != "win32"; then
      AC_CHECK_HEADERS(ldap.h, [],
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 82547f321f..fde27aaa70 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -374,6 +374,9 @@
 /* Define to 1 if you have the `xslt' library (-lxslt). */
 #undef HAVE_LIBXSLT
 
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
 /* Define to 1 if you have the `z' library (-lz). */
 #undef HAVE_LIBZ
 
@@ -917,6 +920,9 @@
 /* Define to 1 to build with LLVM based JIT support. (--with-llvm) */
 #undef USE_LLVM
 
+/* Define to 1 to use PMEM support for WAL I/O. (--with-libpmem) */
+#undef USE_LIBPMEM
+
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
0002-Read-write-WAL-files-using-PMDK-v2.patchapplication/octet-stream; name=0002-Read-write-WAL-files-using-PMDK-v2.patchDownload
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c96c8b60ba..eef94100bc 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -426,7 +426,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	durable_link_or_rename(tmppath, path, ERROR);
+	durable_link_or_rename(tmppath, path, ERROR, true);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -505,7 +505,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	durable_link_or_rename(tmppath, path, ERROR);
+	durable_link_or_rename(tmppath, path, ERROR, true);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ab7d804f0..77d8dd2aeb 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -58,6 +58,7 @@
 #include "storage/ipc.h"
 #include "storage/large_object.h"
 #include "storage/latch.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -140,6 +141,9 @@ const struct config_enum_entry sync_method_options[] = {
 #endif
 #ifdef OPEN_DATASYNC_FLAG
 	{"open_datasync", SYNC_METHOD_OPEN_DSYNC, false},
+#endif
+#ifdef USE_LIBPMEM
+	{"pmem_drain", SYNC_METHOD_PMEM_DRAIN, false},
 #endif
 	{NULL, 0, false}
 };
@@ -778,6 +782,7 @@ static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
 static uint32 openLogOff = 0;
+static void *mappedLogFileAddr = NULL;
 
 /*
  * These variables are used similarly to the ones above, but for reading
@@ -792,6 +797,7 @@ static XLogSegNo readSegNo = 0;
 static uint32 readOff = 0;
 static uint32 readLen = 0;
 static XLogSource readSource = 0;	/* XLOG_FROM_* code */
+static void *mappedReadFileAddr = NULL;
 
 /*
  * Keeps track of which source we're currently reading from. This is
@@ -877,13 +883,15 @@ static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
 static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
+static int	do_XLogFileOpen(char *pathname, int flags, void **addr);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, XLogSegNo max_segno,
-					   bool use_lock);
+					   bool use_lock, bool fsync_file);
 static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 int source, bool notfoundOk);
-static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
+			 int source, bool notfoundOk, void **addr);
+static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source,
+				   void **addr);
 static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 			 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
 			 TimeLineID *readTLI);
@@ -2360,6 +2368,15 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
 	return false;
 }
 
+static int
+do_XLogFileOpen(char *pathname, int flags, void **addr)
+{
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		return PmemFileOpen(pathname, flags, wal_segment_size, addr);
+	else
+		return BasicOpenFile(pathname, flags);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2439,23 +2456,25 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			 * pages here (since we dump what we have at segment end).
 			 */
 			Assert(npages == 0);
-			if (openLogFile >= 0)
+			if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 				XLogFileClose();
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
 
 			/* create/use new log file */
 			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
+			openLogFile = XLogFileInit(openLogSegNo, &use_existent,
+									   true, &mappedLogFileAddr);
 			openLogOff = 0;
 		}
 
 		/* Make sure we have the current logfile open */
-		if (openLogFile < 0)
+		if (openLogFile < 0 && mappedLogFileAddr == NULL)
 		{
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
-			openLogFile = XLogFileOpen(openLogSegNo);
+			openLogFile = XLogFileOpen(openLogSegNo,
+									   &mappedLogFileAddr);
 			openLogOff = 0;
 		}
 
@@ -2496,6 +2515,13 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			do
 			{
 				errno = 0;
+				if (mappedLogFileAddr != NULL)
+				{
+					pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+					PmemFileWrite((char *) mappedLogFileAddr + openLogOff, from, nleft);
+					pgstat_report_wait_end();
+					break;
+				}
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
 				written = pg_pwrite(openLogFile, from, nleft, startoffset);
 				pgstat_report_wait_end();
@@ -2593,15 +2619,16 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		if (sync_method != SYNC_METHOD_OPEN &&
 			sync_method != SYNC_METHOD_OPEN_DSYNC)
 		{
-			if (openLogFile >= 0 &&
+			if ((openLogFile >= 0 || mappedLogFileAddr != NULL) &&
 				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
 				XLogFileClose();
-			if (openLogFile < 0)
+			if (openLogFile < 0 && mappedLogFileAddr == NULL)
 			{
 				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 								wal_segment_size);
-				openLogFile = XLogFileOpen(openLogSegNo);
+				openLogFile = XLogFileOpen(openLogSegNo,
+										   &mappedLogFileAddr);
 				openLogOff = 0;
 			}
 
@@ -3026,7 +3053,7 @@ XLogBackgroundFlush(void)
 	 */
 	if (WriteRqst.Write <= LogwrtResult.Flush)
 	{
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
@@ -3207,7 +3234,8 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
+XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+			 void **addr)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3216,6 +3244,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	XLogSegNo	max_segno;
 	int			fd;
 	int			nbytes;
+	void	   *tmpaddr = NULL;
+	bool		fsync_file = true;
 
 	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
 
@@ -3224,8 +3254,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 */
 	if (*use_existent)
 	{
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-		if (fd < 0)
+		fd = do_XLogFileOpen(path,
+							 O_RDWR | PG_BINARY | get_sync_bit(sync_method),
+							 &tmpaddr);
+		if (fd < 0 && tmpaddr == NULL)
 		{
 			if (errno != ENOENT)
 				ereport(ERROR,
@@ -3233,7 +3265,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 						 errmsg("could not open file \"%s\": %m", path)));
 		}
 		else
+		{
+			*addr = tmpaddr;
 			return fd;
+		}
 	}
 
 	/*
@@ -3249,8 +3284,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	fd = do_XLogFileOpen(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+						 &tmpaddr);
+	if (fd < 0 && tmpaddr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
@@ -3267,35 +3303,49 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	memset(zbuffer.data, 0, XLOG_BLCKSZ);
 	for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
 	{
-		errno = 0;
-		pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-		if ((int) write(fd, zbuffer.data, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ)
+		if (tmpaddr != NULL)
 		{
-			int			save_errno = errno;
+			fsync_file = false;
+			pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
+			PmemFileWrite((char *) tmpaddr + nbytes, zbuffer.data,
+						  XLOG_BLCKSZ);
+		}
+		else
+		{
+			errno = 0;
+			pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
+			if ((int) write(fd, zbuffer.data, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ)
+			{
+				int			save_errno = errno;
 
-			/*
-			 * If we fail to make the file, delete it to release disk space
-			 */
-			unlink(tmppath);
+				/*
+				 * If we fail to make the file, delete it to release disk
+				 * space
+				 */
+				unlink(tmppath);
 
-			close(fd);
+				close(fd);
 
-			/* if write didn't set errno, assume problem is no disk space */
-			errno = save_errno ? save_errno : ENOSPC;
+				/*
+				 * if write didn't set errno, assume problem is no disk space
+				 */
+				errno = save_errno ? save_errno : ENOSPC;
 
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write to file \"%s\": %m", tmppath)));
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write to file \"%s\": %m",
+								tmppath)));
+			}
 		}
 		pgstat_report_wait_end();
 	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, tmpaddr) != 0)
 	{
 		int			save_errno = errno;
 
-		close(fd);
+		do_XLogFileClose(fd, tmpaddr);
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3303,7 +3353,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	}
 	pgstat_report_wait_end();
 
-	if (close(fd))
+	if (do_XLogFileClose(fd, tmpaddr))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
@@ -3330,7 +3380,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	max_segno = logsegno + CheckPointSegments;
 	if (!InstallXLogFileSegment(&installed_segno, tmppath,
 								*use_existent, max_segno,
-								use_lock))
+								use_lock,
+								fsync_file))
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3344,8 +3395,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	*use_existent = false;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+						 O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+
+	if (fd < 0 && *addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3380,13 +3433,22 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	int			srcfd;
 	int			fd;
 	int			nbytes;
+	void	   *src_addr = NULL;
+	void	   *dst_addr = NULL;
+	bool		fsync_file = true;
 
 	/*
 	 * Open the source file
 	 */
 	XLogFilePath(path, srcTLI, srcsegno, wal_segment_size);
-	srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (srcfd < 0)
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		srcfd = MapTransientFile(path, O_RDONLY | PG_BINARY,
+								 wal_segment_size, &src_addr);
+
+	if (src_addr == NULL)
+		srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+
+	if (srcfd < 0 && src_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3399,15 +3461,33 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = OpenTransientFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	if (src_addr != NULL && sync_method == SYNC_METHOD_PMEM_DRAIN)
+		fd = MapTransientFile(tmppath,
+							  O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  wal_segment_size, &dst_addr);
+	else
+		fd = OpenTransientFile(tmppath,
+							   O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+
+	if (fd < 0 && dst_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m", tmppath)));
+				 errmsg("could not create file \"%s\": %m",
+						tmppath)));
 
 	/*
 	 * Do the data copying.
 	 */
+	if (src_addr && dst_addr)
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_READ);
+		PmemFileWrite(dst_addr, src_addr, wal_segment_size);
+		pgstat_report_wait_end();
+		fsync_file = false;
+
+		goto done_copy;
+	}
+
 	for (nbytes = 0; nbytes < wal_segment_size; nbytes += sizeof(buffer))
 	{
 		int			nread;
@@ -3459,29 +3539,42 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not write to file \"%s\": %m", tmppath)));
+					 errmsg("could not write to file \"%s\": %m",
+							tmppath)));
 		}
 		pgstat_report_wait_end();
 	}
 
+done_copy:
 	pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, dst_addr) != 0)
 		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
 
-	if (CloseTransientFile(fd))
+	if (dst_addr)
+	{
+		if (UnmapTransientFile(dst_addr, wal_segment_size))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not unmap file \"%s\": %m",
+							tmppath)));
+	}
+	else if (CloseTransientFile(fd))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
-	CloseTransientFile(srcfd);
+	if (src_addr)
+		UnmapTransientFile(src_addr, wal_segment_size);
+	else
+		CloseTransientFile(srcfd);
 
 	/*
 	 * Now move the segment into place with its final name.
 	 */
-	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
+	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false, fsync_file))
 		elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
@@ -3516,7 +3609,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, XLogSegNo max_segno,
-					   bool use_lock)
+					   bool use_lock, bool fsync_file)
 {
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
@@ -3555,7 +3648,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	if (durable_link_or_rename(tmppath, path, LOG) != 0)
+	if (durable_link_or_rename(tmppath, path, LOG, fsync_file) != 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
@@ -3573,15 +3666,16 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  * Open a pre-existing logfile segment for writing.
  */
 int
-XLogFileOpen(XLogSegNo segno)
+XLogFileOpen(XLogSegNo segno, void **addr)
 {
 	char		path[MAXPGPATH];
 	int			fd;
 
 	XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+						 O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+	if (fd < 0 && *addr == NULL)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3597,7 +3691,7 @@ XLogFileOpen(XLogSegNo segno)
  */
 static int
 XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 int source, bool notfoundOk)
+			 int source, bool notfoundOk, void **addr)
 {
 	char		xlogfname[MAXFNAMELEN];
 	char		activitymsg[MAXFNAMELEN + 16];
@@ -3646,8 +3740,8 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlogfname);
 	}
 
-	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-	if (fd >= 0)
+	fd = do_XLogFileOpen(path, O_RDONLY | PG_BINARY, addr);
+	if (fd >= 0 || *addr != NULL)
 	{
 		/* Success! */
 		curFileTLI = tli;
@@ -3679,7 +3773,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  * This version searches for the segment with any TLI listed in expectedTLEs.
  */
 static int
-XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
+XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source, void **addr)
 {
 	char		path[MAXPGPATH];
 	ListCell   *cell;
@@ -3719,8 +3813,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_ARCHIVE, true);
-			if (fd != -1)
+							  XLOG_FROM_ARCHIVE, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				elog(DEBUG1, "got WAL segment from archive");
 				if (!expectedTLEs)
@@ -3732,8 +3826,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_PG_WAL, true);
-			if (fd != -1)
+							  XLOG_FROM_PG_WAL, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				if (!expectedTLEs)
 					expectedTLEs = tles;
@@ -3751,13 +3845,22 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 	return -1;
 }
 
+int
+do_XLogFileClose(int fd, void *addr)
+{
+	if (!addr)
+		return close(fd);
+
+	return PmemFileClose(addr, wal_segment_size);
+}
+
 /*
  * Close the current logfile segment for writing.
  */
 static void
 XLogFileClose(void)
 {
-	Assert(openLogFile >= 0);
+	Assert(openLogFile >= 0 || mappedLogFileAddr != NULL);
 
 	/*
 	 * WAL segment files will not be re-read in normal operation, so we advise
@@ -3766,15 +3869,16 @@ XLogFileClose(void)
 	 * use the cache to read the WAL segment.
 	 */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	if (!XLogIsNeeded())
+	if (!XLogIsNeeded() && openLogFile > 0)
 		(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif
 
-	if (close(openLogFile))
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
 						XLogFileNameP(ThisTimeLineID, openLogSegNo))));
+	mappedLogFileAddr = NULL;
 	openLogFile = -1;
 }
 
@@ -3794,6 +3898,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	XLogSegNo	_logSegNo;
 	int			lf;
 	bool		use_existent;
+	void	   *laddr = NULL;
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3802,8 +3907,8 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	{
 		_logSegNo++;
 		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent, true);
-		close(lf);
+		lf = XLogFileInit(_logSegNo, &use_existent, true, &laddr);
+		do_XLogFileClose(lf, laddr);
 		if (!use_existent)
 			CheckpointStats.ckpt_segs_added++;
 	}
@@ -4052,6 +4157,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 	struct stat statbuf;
 	XLogSegNo	endlogSegNo;
 	XLogSegNo	recycleSegNo;
+	bool		fsync_file = true;
 
 	/*
 	 * Initialize info about where to try to recycle to.
@@ -4064,6 +4170,9 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 
 	snprintf(path, MAXPGPATH, XLOGDIR "/%s", segname);
 
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		fsync_file = false;
+
 	/*
 	 * Before deleting the file, see if it can be recycled as a future log
 	 * segment. Only recycle normal files, pg_standby for example can create
@@ -4072,7 +4181,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 	if (endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
-							   true, recycleSegNo, true))
+							   true, recycleSegNo, true, fsync_file))
 	{
 		ereport(DEBUG2,
 				(errmsg("recycled write-ahead log file \"%s\"",
@@ -4239,9 +4348,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			if (readFile >= 0)
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
 			{
-				close(readFile);
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+				mappedReadFileAddr = NULL;
 				readFile = -1;
 			}
 
@@ -4780,7 +4890,9 @@ UpdateControlFile(void)
 	pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE);
 	if (write(fd, ControlFile, sizeof(ControlFileData)) != sizeof(ControlFileData))
 	{
-		/* if write didn't set errno, assume problem is no disk space */
+		/*
+		 * if write didn't set errno, assume problem is no disk space
+		 */
 		if (errno == 0)
 			errno = ENOSPC;
 		ereport(PANIC,
@@ -5212,34 +5324,44 @@ BootStrapXLOG(void)
 
 	/* Create first XLOG segment file */
 	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	openLogFile = XLogFileInit(1, &use_existent, false, &mappedLogFileAddr);
 
 	/* Write the first page with the initial record */
 	errno = 0;
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+
+	if (mappedLogFileAddr != NULL)
 	{
-		/* if write didn't set errno, assume problem is no disk space */
-		if (errno == 0)
-			errno = ENOSPC;
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not write bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		PmemFileWrite(mappedLogFileAddr, page, XLOG_BLCKSZ);
+	}
+	else
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write bootstrap write-ahead log file: %m")));
+		}
 	}
 	pgstat_report_wait_end();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
+	if (xlog_fsync(openLogFile, (void *) mappedLogFileAddr) != 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
 	pgstat_report_wait_end();
 
-	if (close(openLogFile))
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not close bootstrap write-ahead log file: %m")));
 
+	mappedLogFileAddr = NULL;
 	openLogFile = -1;
 
 	/* Now create pg_control */
@@ -5480,9 +5602,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * If the ending log segment is still open, close it (to avoid problems on
 	 * Windows with trying to rename or delete an open file).
 	 */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 
@@ -5521,10 +5644,11 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 */
 		bool		use_existent = true;
 		int			fd;
+		void	   *tmpaddr = NULL;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+		fd = XLogFileInit(startLogSegNo, &use_existent, true, &tmpaddr);
 
-		if (close(fd))
+		if (do_XLogFileClose(fd, tmpaddr))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not close file \"%s\": %m",
@@ -7721,9 +7845,10 @@ StartupXLOG(void)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/* Shut down xlogreader */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 	XLogReaderFree(xlogreader);
@@ -10062,6 +10187,9 @@ get_sync_bit(int method)
 		case SYNC_METHOD_FSYNC:
 		case SYNC_METHOD_FSYNC_WRITETHROUGH:
 		case SYNC_METHOD_FDATASYNC:
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+#endif
 			return 0;
 #ifdef OPEN_SYNC_FLAG
 		case SYNC_METHOD_OPEN:
@@ -10079,7 +10207,36 @@ get_sync_bit(int method)
 }
 
 /*
- * GUC support
+ * GUC check_hook for xlog_sync_method
+ */
+bool
+check_xlog_sync_method(int *newval, void **extra, GucSource source)
+{
+	bool		ret;
+	char		tmppath[MAXPGPATH] = {};
+	int			val = newval ? *newval : sync_method;
+
+	if (val != SYNC_METHOD_PMEM_DRAIN)
+		return true;
+
+	snprintf(tmppath, MAXPGPATH, "%s/" XLOGDIR "/pmem.tmp.%d", DataDir, (int) getpid());
+
+	ret = CheckPmem(tmppath);
+
+	if (!ret)
+	{
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("invalid value for parameter \"wal_sync_method\": \"pmem_drain\"");
+		GUC_check_errmsg("%s isn't stored on persistent memory(pmem_is_pmem() returned false).",
+						 XLOGDIR);
+		GUC_check_errhint("Please see also ENVIRONMENT VARIABLES section in man libpmem.");
+	}
+
+	return ret;
+}
+
+/*
+ * GUC assign_hook for xlog_sync_method
  */
 void
 assign_xlog_sync_method(int new_sync_method, void *extra)
@@ -10092,10 +10249,10 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 		 * changing, close the log file so it will be reopened (with new flag
 		 * bit) at next use.
 		 */
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN);
-			if (pg_fsync(openLogFile) != 0)
+			if (xlog_fsync(openLogFile, (void *) mappedLogFileAddr) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not fsync file \"%s\": %m",
@@ -10144,6 +10301,11 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 						 errmsg("could not fdatasync file \"%s\": %m",
 								XLogFileNameP(ThisTimeLineID, segno))));
 			break;
+#endif
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+			PmemFileSync();
+			break;
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
@@ -10156,6 +10318,17 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	pgstat_report_wait_end();
 }
 
+int
+xlog_fsync(int fd, void *addr)
+{
+	if (!addr)
+		return pg_fsync(fd);
+
+	PmemFileSync();
+	return 0;
+}
+
+
 /*
  * Return the filename of given log segment, as a palloc'd string.
  */
@@ -11572,7 +11745,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 &&
+	if ((readFile >= 0 || mappedReadFileAddr != NULL) &&
 		!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
 	{
 		/*
@@ -11589,7 +11762,8 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			}
 		}
 
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 		readSource = 0;
 	}
@@ -11598,7 +11772,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
+	if ((readFile < 0 && mappedReadFileAddr == NULL) ||
 		(readSource == XLOG_FROM_STREAM &&
 		 receivedUpto < targetPagePtr + reqLen))
 	{
@@ -11607,8 +11781,9 @@ retry:
 										 private->fetching_ckpt,
 										 targetRecPtr))
 		{
-			if (readFile >= 0)
-				close(readFile);
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+			mappedReadFileAddr = NULL;
 			readFile = -1;
 			readLen = 0;
 			readSource = 0;
@@ -11621,7 +11796,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(readFile != -1 || mappedReadFileAddr != NULL);
 
 	/*
 	 * If the current segment is being streamed from master, calculate how
@@ -11642,30 +11817,39 @@ retry:
 
 	/* Read the requested page */
 	readOff = targetPageOff;
-
-	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
-	if (r != XLOG_BLCKSZ)
+	if (mappedReadFileAddr)
 	{
-		char		fname[MAXFNAMELEN];
-		int			save_errno = errno;
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		PmemFileRead((char *) mappedReadFileAddr + readOff, readBuf,
+					 XLOG_BLCKSZ);
 
-		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		if (r < 0)
+	}
+	else
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+		if (r != XLOG_BLCKSZ)
 		{
-			errno = save_errno;
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
+			char		fname[MAXFNAMELEN];
+			int			save_errno = errno;
+
+			pgstat_report_wait_end();
+			XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+			if (r < 0)
+			{
+				errno = save_errno;
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode_for_file_access(),
+						 errmsg("could not read from log segment %s, offset %u: %m",
+								fname, readOff)));
+			}
+			else
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+								fname, readOff, r, (Size) XLOG_BLCKSZ)));
+			goto next_record_is_invalid;
 		}
-		else
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							fname, readOff, r, (Size) XLOG_BLCKSZ)));
-		goto next_record_is_invalid;
 	}
 	pgstat_report_wait_end();
 
@@ -11713,8 +11897,9 @@ retry:
 next_record_is_invalid:
 	lastSourceFailed = true;
 
-	if (readFile >= 0)
-		close(readFile);
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+	mappedReadFileAddr = NULL;
 	readFile = -1;
 	readLen = 0;
 	readSource = 0;
@@ -11971,9 +12156,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 				/* Close any old file we might have open. */
-				if (readFile >= 0)
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 				{
-					close(readFile);
+					do_XLogFileClose(readFile,
+									 mappedReadFileAddr);
+					mappedReadFileAddr = NULL;
 					readFile = -1;
 				}
 				/* Reset curFileTLI if random fetch. */
@@ -11986,8 +12173,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 */
 				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
-				if (readFile >= 0)
+											  currentSource, &mappedReadFileAddr);
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 					return true;	/* success! */
 
 				/*
@@ -12051,14 +12238,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						 * info is set correctly and XLogReceiptTime isn't
 						 * changed.
 						 */
-						if (readFile < 0)
+						if (readFile < 0 && mappedReadFileAddr == NULL)
 						{
 							if (!expectedTLEs)
 								expectedTLEs = readTimeLineHistory(receiveTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
 													receiveTLI,
-													XLOG_FROM_STREAM, false);
-							Assert(readFile >= 0);
+													XLOG_FROM_STREAM, false, &mappedReadFileAddr);
+							Assert(readFile >= 0 || mappedReadFileAddr != NULL);
 						}
 						else
 						{
diff --git a/src/backend/storage/file/Makefile b/src/backend/storage/file/Makefile
index ca6a0e4f7d..9271153553 100644
--- a/src/backend/storage/file/Makefile
+++ b/src/backend/storage/file/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/file
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = fd.o buffile.o copydir.o reinit.o sharedfileset.o
+OBJS = fd.o buffile.o copydir.o reinit.o sharedfileset.o pmem.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 213de7698a..8b51d7e8a0 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -88,6 +88,7 @@
 #include "portability/mem.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
@@ -223,6 +224,9 @@ static uint64 temporary_files_size = 0;
 typedef enum
 {
 	AllocateDescFile,
+#ifdef USE_LIBPMEM
+	AllocateDescMap,
+#endif
 	AllocateDescPipe,
 	AllocateDescDir,
 	AllocateDescRawFD
@@ -237,6 +241,10 @@ typedef struct
 		FILE	   *file;
 		DIR		   *dir;
 		int			fd;
+#ifdef USE_LIBPMEM
+		size_t		fsize;
+		void	   *addr;
+#endif
 	}			desc;
 } AllocateDesc;
 
@@ -705,14 +713,16 @@ durable_unlink(const char *fname, int elevel)
  * valid upon return.
  */
 int
-durable_link_or_rename(const char *oldfile, const char *newfile, int elevel)
+durable_link_or_rename(const char *oldfile, const char *newfile, int elevel,
+					   bool fsync_file)
 {
 	/*
 	 * Ensure that, if we crash directly after the rename/link, a file with
 	 * valid contents is moved into place.
 	 */
-	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
-		return -1;
+	if (fsync_file)
+		if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
+			return -1;
 
 #if HAVE_WORKING_LINK
 	if (link(oldfile, newfile) < 0)
@@ -740,8 +750,9 @@ durable_link_or_rename(const char *oldfile, const char *newfile, int elevel)
 	 * Make change persistent in case of an OS crash, both the new entry and
 	 * its parent directory need to be flushed.
 	 */
-	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
-		return -1;
+	if (fsync_file)
+		if (fsync_fname_ext(newfile, false, false, elevel) != 0)
+			return -1;
 
 	/* Same for parent directory */
 	if (fsync_parent_path(newfile, elevel) != 0)
@@ -1556,6 +1567,78 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
 	return file;
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Mmap a file with MapTransientFilePerm() and pass default file mode for
+ * the fileMode parameter.
+ */
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return MapTransientFilePerm(fileName, fileFlags, PG_FILE_MODE_DEFAULT,
+								fsize, addr);
+}
+
+/*
+ * Like AllocateFile, but returns an unbuffered pointer to the mapped area
+ * like mmap(2)
+ */
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr)
+{
+	int			fd;
+
+	DO_DB(elog(LOG, "MapTransientFilePerm: Allocated %d (%s)",
+			   numAllocatedDescs, fileName));
+
+	/* Can we allocate another non-virtual FD? */
+	if (!reserveAllocatedDesc())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("exceeded maxAllocatedDescs (%d) while trying to open file \"%s\"",
+						maxAllocatedDescs, fileName)));
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	if (addr != NULL)
+	{
+		void	   *ret_addr = NULL;
+
+		fd = PmemFileOpenPerm(fileName, fileFlags, fileMode, fsize, &ret_addr);
+		if (ret_addr != NULL)
+		{
+			AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
+
+			*addr = ret_addr;
+
+			desc->kind = AllocateDescMap;
+			desc->desc.addr = ret_addr;
+			desc->desc.fsize = fsize;
+			desc->create_subid = GetCurrentSubTransactionId();
+			numAllocatedDescs++;
+
+			return fd;
+		}
+	}
+
+	return -1;					/* failure */
+}
+#else
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return -1;
+}
+
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr)
+{
+	return -1;
+}
+#endif
 
 /*
  * Create a new file.  The directory containing it must already exist.  Files
@@ -2361,6 +2444,11 @@ FreeDesc(AllocateDesc *desc)
 		case AllocateDescRawFD:
 			result = close(desc->desc.fd);
 			break;
+#ifdef USE_LIBPMEM
+		case AllocateDescMap:
+			result = PmemFileClose(desc->desc.addr, desc->desc.fsize);
+			break;
+#endif
 		default:
 			elog(ERROR, "AllocateDesc kind not recognized");
 			result = 0;			/* keep compiler quiet */
@@ -2402,6 +2490,42 @@ FreeFile(FILE *file)
 	return fclose(file);
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Unmap a file returned by MapTransientFile.
+ *
+ * Note we do not check unmap's return value --- it is up to the caller
+ * to handle unmap errors.
+ */
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	int			i;
+
+	DO_DB(elog(LOG, "UnmapTransientFile: Allocated %d", numAllocatedDescs));
+
+	/* Remove fd from list of allocated files, if it's present */
+	for (i = numAllocatedDescs; --i >= 0;)
+	{
+		AllocateDesc *desc = &allocatedDescs[i];
+
+		if (desc->kind == AllocateDescMap && desc->desc.addr == addr)
+			return FreeDesc(desc);
+	}
+
+	/* Only get here if someone passes us a file not in allocatedDescs */
+	elog(WARNING, "fd passed to UnmapTransientFile was not obtained from MapTransientFile");
+
+	return PmemFileClose(addr, fsize);
+}
+#else
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	return -1;
+}
+#endif
+
 /*
  * Close a file returned by OpenTransientFile.
  *
diff --git a/src/backend/storage/file/pmem.c b/src/backend/storage/file/pmem.c
new file mode 100644
index 0000000000..b214b6b18e
--- /dev/null
+++ b/src/backend/storage/file/pmem.c
@@ -0,0 +1,188 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.c
+ *	  Virtual file descriptor code.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/file/pmem.c
+ *
+ * NOTES:
+ *
+ * This code manages an memory-mapped file on a filesystem mounted with DAX on
+ * persistent memory device using the Persistent Memory Development Kit
+ * (http://pmem.io/pmdk/).
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/pmem.h"
+#include "storage/fd.h"
+
+#ifdef USE_LIBPMEM
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <libpmem.h>
+#include <sys/mman.h>
+#include <string.h>
+
+#define PmemFileSize 32
+
+/*
+ * This function returns true, only if the file is stored on persistent memory.
+ */
+bool
+CheckPmem(const char *path)
+{
+	int			is_pmem = 0;	/* false */
+	size_t		mapped_len = 0;
+	bool		ret = true;
+	void	   *tmpaddr;
+
+	/*
+	 * The value of is_pmem is 0, if the file(path) isn't stored on persistent
+	 * memory.
+	 */
+	tmpaddr = pmem_map_file(path, PmemFileSize, PMEM_FILE_CREATE,
+							PG_FILE_MODE_DEFAULT, &mapped_len, &is_pmem);
+
+	if (tmpaddr)
+	{
+		pmem_unmap(tmpaddr, mapped_len);
+		unlink(path);
+	}
+
+	if (is_pmem)
+		elog(LOG, "%s is stored on persistent memory.", path);
+	else
+		ret = false;
+
+	return ret;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return PmemFileOpenPerm(pathname, flags, PG_FILE_MODE_DEFAULT, fsize, addr);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+				 void **addr)
+{
+	int			mapped_flag = 0;
+	size_t		mapped_len = 0;
+	size_t		size = 0;
+	void	   *ret_addr;
+
+	if (addr == NULL)
+		return BasicOpenFile(pathname, flags);
+
+	/* non-zero 'len' not allowed without PMEM_FILE_CREATE */
+	if (flags & O_CREAT)
+	{
+		mapped_flag = PMEM_FILE_CREATE;
+		size = fsize;
+	}
+
+	if (flags & O_EXCL)
+		mapped_flag |= PMEM_FILE_EXCL;
+
+	ret_addr = pmem_map_file(pathname, size, mapped_flag, mode, &mapped_len,
+							 NULL);
+
+	if (fsize != mapped_len)
+	{
+		if (ret_addr != NULL)
+			pmem_unmap(ret_addr, mapped_len);
+
+		return -1;
+	}
+
+	if (mapped_flag & PMEM_FILE_CREATE)
+		if (msync(ret_addr, mapped_len, MS_SYNC))
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not msync log file %s: %m", pathname)));
+
+	*addr = ret_addr;
+
+	return NO_FD_FOR_MAPPED_FILE;
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	pmem_memcpy_nodrain((void *) dest, src, len);
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	memcpy(buf, (void *) map_addr, len);
+}
+
+void
+PmemFileSync(void)
+{
+	return pmem_drain();
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	return pmem_unmap((void *) addr, fsize);
+}
+
+
+#else
+bool
+CheckPmem(const char *path)
+{
+	return true;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return BasicOpenFile(pathname, flags);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+				 void **addr)
+{
+	return BasicOpenFilePerm(pathname, flags, mode);
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileSync(void)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+	return -1;
+}
+#endif
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c216ed0922..e9959023da 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -4333,7 +4333,7 @@ static struct config_enum ConfigureNamesEnum[] =
 		},
 		&sync_method,
 		DEFAULT_SYNC_METHOD, sync_method_options,
-		NULL, assign_xlog_sync_method, NULL
+		check_xlog_sync_method, assign_xlog_sync_method, NULL
 	},
 
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..8cd915eb94 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -196,6 +196,7 @@
 					#   fsync
 					#   fsync_writethrough
 					#   open_sync
+					#   pmem_drain
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
 #wal_log_hints = off			# also do full page writes of non-critical updates
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f90a6a9139..033876c3c6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -27,6 +27,7 @@
 #define SYNC_METHOD_OPEN		2	/* for O_SYNC */
 #define SYNC_METHOD_FSYNC_WRITETHROUGH	3
 #define SYNC_METHOD_OPEN_DSYNC	4	/* for O_DSYNC */
+#define SYNC_METHOD_PMEM_DRAIN	5	/* for Persistent Memory Development Kit */
 extern int	sync_method;
 
 extern PGDLLIMPORT TimeLineID ThisTimeLineID;	/* current TLI */
@@ -259,8 +260,10 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
-extern int	XLogFileOpen(XLogSegNo segno);
+extern int XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+			 void **addr);
+extern int	XLogFileOpen(XLogSegNo segno, void **addr);
+extern int	do_XLogFileClose(int fd, void *addr);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
@@ -272,6 +275,7 @@ extern void xlog_desc(StringInfo buf, XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
+extern int	xlog_fsync(int fd, void *addr);
 
 extern bool RecoveryInProgress(void);
 extern bool HotStandbyActive(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 74c34757fb..834e3f7353 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -45,6 +45,13 @@
 typedef int File;
 
 
+/*
+ * Default mode for created files, unless something else is specified using
+ * the *Perm() function variants.
+ */
+#define PG_FILE_MODE_DEFAULT	(S_IRUSR | S_IWUSR)
+
+
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
@@ -104,6 +111,13 @@ extern int	OpenTransientFile(const char *fileName, int fileFlags);
 extern int	OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern int	CloseTransientFile(int fd);
 
+/* Operations to allow use of a memory-mapped file */
+extern int MapTransientFile(const char *fileName, int fileFlags, size_t fsize,
+				 void **addr);
+extern int MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr);
+extern int	UnmapTransientFile(void *addr, size_t fsize);
+
 /* If you've really really gotta have a plain kernel FD, use this */
 extern int	BasicOpenFile(const char *fileName, int fileFlags);
 extern int	BasicOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -133,7 +147,8 @@ extern void pg_flush_data(int fd, off_t offset, off_t amount);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel);
 extern int	durable_unlink(const char *fname, int loglevel);
-extern int	durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
+extern int durable_link_or_rename(const char *oldfile, const char *newfile,
+					   int loglevel, bool fsync_fname);
 extern void SyncDataDirectory(void);
 extern int data_sync_elevel(int elevel);
 
diff --git a/src/include/storage/pmem.h b/src/include/storage/pmem.h
new file mode 100644
index 0000000000..b9b9156c91
--- /dev/null
+++ b/src/include/storage/pmem.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.h
+ *		Virtual file descriptor definitions for persistent memory.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/pmem.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PMEM_H
+#define PMEM_H
+
+#include "postgres.h"
+
+#define NO_FD_FOR_MAPPED_FILE -2
+
+extern bool CheckPmem(const char *path);
+extern int PmemFileOpen(const char *pathname, int flags, size_t fsize,
+			 void **addr);
+extern int PmemFileOpenPerm(const char *pathname, int flags, int mode,
+				 size_t fsize, void **addr);
+extern void PmemFileWrite(void *dest, void *src, size_t len);
+extern void PmemFileRead(void *map_addr, void *buf, size_t len);
+extern void PmemFileSync(void);
+extern int	PmemFileClose(void *addr, size_t fsize);
+
+#endif							/* PMEM_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index c07e7b945e..436ab961fc 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -433,6 +433,7 @@ extern void assign_search_path(const char *newval, void *extra);
 
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_xlog_sync_method(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
 #endif							/* GUC_H */
0003-Walreceiver-WAL-IO-using-PMDK-v2.patchapplication/octet-stream; name=0003-Walreceiver-WAL-IO-using-PMDK-v2.patchDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2e90944ad5..044fe078a8 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -61,6 +61,7 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -91,6 +92,7 @@ static int	recvFile = -1;
 static TimeLineID recvFileTLI = 0;
 static XLogSegNo recvSegNo = 0;
 static uint32 recvOff = 0;
+void	   *mappedFileAddr = NULL;
 
 /*
  * Flags set by interrupt handlers of walreceiver for later service in the
@@ -599,12 +601,12 @@ WalReceiverMain(void)
 		 * End of WAL reached on the requested timeline. Close the last
 		 * segment, and await for new orders from the startup process.
 		 */
-		if (recvFile >= 0)
+		if (recvFile >= 0 || mappedFileAddr != NULL)
 		{
 			char		xlogfname[MAXFNAMELEN];
 
 			XLogWalRcvFlush(false);
-			if (close(recvFile) != 0)
+			if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not close log segment %s: %m",
@@ -621,6 +623,7 @@ WalReceiverMain(void)
 				XLogArchiveNotify(xlogfname);
 		}
 		recvFile = -1;
+		mappedFileAddr = NULL;
 
 		elog(DEBUG1, "walreceiver ended streaming and awaits new instructions");
 		WalRcvWaitForStartPosition(&startpoint, &startpointTLI);
@@ -931,7 +934,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		if ((recvFile < 0 && mappedFileAddr == NULL) ||
+			!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
 			bool		use_existent;
 
@@ -939,7 +943,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			 * fsync() and close current file before we switch to next one. We
 			 * would otherwise have to reopen this file to fsync it later
 			 */
-			if (recvFile >= 0)
+			if (recvFile >= 0 || mappedFileAddr != NULL)
 			{
 				char		xlogfname[MAXFNAMELEN];
 
@@ -950,7 +954,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				 * process soon, so we don't advise the OS to release cache
 				 * pages associated with the file like XLogFileClose() does.
 				 */
-				if (close(recvFile) != 0)
+				if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 					ereport(PANIC,
 							(errcode_for_file_access(),
 							 errmsg("could not close log segment %s: %m",
@@ -967,11 +971,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveNotify(xlogfname);
 			}
 			recvFile = -1;
+			mappedFileAddr = NULL;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			use_existent = true;
-			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
+			recvFile = XLogFileInit(recvSegNo, &use_existent, true, &mappedFileAddr);
 			recvFileTLI = ThisTimeLineID;
 			recvOff = 0;
 		}
@@ -987,30 +992,39 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* Need to seek in the file? */
 		if (recvOff != startoff)
 		{
-			if (lseek(recvFile, (off_t) startoff, SEEK_SET) < 0)
-				ereport(PANIC,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(recvFileTLI, recvSegNo),
-								startoff)));
+			if (!mappedFileAddr)
+				if (lseek(recvFile, (off_t) startoff, SEEK_SET) < 0)
+					ereport(PANIC,
+							(errcode_for_file_access(),
+							 errmsg("could not seek in log segment %s to offset %u: %m",
+									XLogFileNameP(recvFileTLI, recvSegNo),
+									startoff)));
 			recvOff = startoff;
 		}
 
-		/* OK to write the logs */
-		errno = 0;
+		if (mappedFileAddr)
+		{
+			PmemFileWrite((char *) mappedFileAddr + startoff, buf, segbytes);
+			byteswritten = segbytes;
+		}
+		else
+		{
+			/* OK to write the logs */
+			errno = 0;
 
-		byteswritten = write(recvFile, buf, segbytes);
-		if (byteswritten <= 0)
-		{
-			/* if write didn't set errno, assume no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not write to log segment %s "
-							"at offset %u, length %lu: %m",
-							XLogFileNameP(recvFileTLI, recvSegNo),
-							recvOff, (unsigned long) segbytes)));
+			byteswritten = write(recvFile, buf, segbytes);
+			if (byteswritten <= 0)
+			{
+				/* if write didn't set errno, assume no disk space */
+				if (errno == 0)
+					errno = ENOSPC;
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not write to log segment %s "
+								"at offset %u, length %lu: %m",
+								XLogFileNameP(recvFileTLI, recvSegNo),
+								recvOff, (unsigned long) segbytes)));
+			}
 		}
 
 		/* Update state for write */
#33Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Takashi Menjo (#32)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On 25/01/2019 09:52, Takashi Menjo wrote:

Heikki Linnakangas wrote:

To re-iterate what I said earlier in this thread, I think the next step
here is to write a patch that modifies xlog.c to use plain old
mmap()/msync() to memory-map the WAL files, to replace the WAL buffers.

Sorry but my new patchset still uses PMDK, because PMDK is supported on
Linux
_and Windows_, and I think someone may want to test this patchset on
Windows...

When you manage the WAL (or perhaps in the future relation files)
through PMDK, is there still a file system view of it somewhere, for
browsing, debugging, and for monitoring tools?

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#34Takashi Menjo
menjo.takashi@lab.ntt.co.jp
In reply to: Peter Eisentraut (#33)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

Hi,

Peter Eisentraut wrote:

When you manage the WAL (or perhaps in the future relation files)
through PMDK, is there still a file system view of it somewhere, for
browsing, debugging, and for monitoring tools?

First, I assume that our patchset is used with a filesystem that supports
direct access (DAX) feature, and I test it with ext4 on Linux. You can cd
into pg_wal directory created by initdb -X pg_wal on such a filesystem, and
ls WAL segment files managed by PMDK at runtime.

For each PostgreSQL-specific tool, perhaps yes, but I have not tested yet.
At least, pg_waldump looks working as before.

Regards,
Takashi

--
Takashi Menjo - NTT Software Innovation Center
<menjo.takashi@lab.ntt.co.jp>

#35Takashi Menjo
menjo.takashi@lab.ntt.co.jp
In reply to: Takashi Menjo (#34)
3 attachment(s)
RE: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

Hi,

Sorry but I found that the patchset v2 had a bug in managing WAL segment
file offset. I fixed it and updated a patchset as v3 (attached).

Regards,
Takashi

--
Takashi Menjo - NTT Software Innovation Center
<menjo.takashi@lab.ntt.co.jp>

Attachments:

0001-Add-configure-option-for-PMDK-v3.patchapplication/octet-stream; name=0001-Add-configure-option-for-PMDK-v3.patchDownload
diff --git a/configure b/configure
index ddb3c8b1ba..a23d13d602 100755
--- a/configure
+++ b/configure
@@ -702,6 +702,7 @@ EGREP
 GREP
 with_zlib
 with_system_tzdata
+with_libpmem
 with_libxslt
 with_libxml
 XML2_CONFIG
@@ -861,6 +862,7 @@ with_uuid
 with_ossp_uuid
 with_libxml
 with_libxslt
+with_libpmem
 with_system_tzdata
 with_zlib
 with_gnu_ld
@@ -1566,6 +1568,7 @@ Optional Packages:
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
   --with-libxslt          use XSLT support when building contrib/xml2
+  --with-libpmem          use PMEM support for WAL I/O
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
@@ -8241,6 +8244,33 @@ fi
 
 
 
+#
+# PMEM
+#
+
+
+
+# Check whether --with-libpmem was given.
+if test "${with_libpmem+set}" = set; then :
+  withval=$with_libpmem;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBPMEM 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-libpmem option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_libpmem=no
+
+fi
 
 
 
@@ -12322,6 +12352,57 @@ fi
 
 fi
 
+if test "$with_libpmem" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'pmwm' is required for PMEM support" "$LINENO" 5
+fi
+
+fi
+
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -13148,6 +13229,17 @@ else
 fi
 
 
+fi
+
+if test "$with_libpmem" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for PMEM support" "$LINENO" 5
+fi
+
+
 fi
 
 if test "$with_ldap" = yes ; then
diff --git a/configure.in b/configure.in
index 3d8888805c..91ef21a1cc 100644
--- a/configure.in
+++ b/configure.in
@@ -943,6 +943,14 @@ PGAC_ARG_BOOL(with, libxslt, no, [use XSLT support when building contrib/xml2],
 
 AC_SUBST(with_libxslt)
 
+#
+# PMEM
+#
+PGAC_ARG_BOOL(with, libpmem, no, [use PMEM support for WAL I/O],
+	      [AC_DEFINE([USE_LIBPMEM], 1, [Define to 1 to use PMEM support for WAL I/O. (--with-libpmem)])])
+
+AC_SUBST(with_libpmem)
+
 #
 # tzdata
 #
@@ -1224,6 +1232,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_LIB(xslt, xsltCleanupGlobals, [], [AC_MSG_ERROR([library 'xslt' is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [], [AC_MSG_ERROR([library 'pmem' is required for PMEM support])])
+fi
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -1412,6 +1424,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_HEADER(libxslt/xslt.h, [], [AC_MSG_ERROR([header file <libxslt/xslt.h> is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for PMEM support])])
+fi
+
 if test "$with_ldap" = yes ; then
   if test "$PORTNAME" != "win32"; then
      AC_CHECK_HEADERS(ldap.h, [],
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index 82547f321f..fde27aaa70 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -374,6 +374,9 @@
 /* Define to 1 if you have the `xslt' library (-lxslt). */
 #undef HAVE_LIBXSLT
 
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
 /* Define to 1 if you have the `z' library (-lz). */
 #undef HAVE_LIBZ
 
@@ -917,6 +920,9 @@
 /* Define to 1 to build with LLVM based JIT support. (--with-llvm) */
 #undef USE_LLVM
 
+/* Define to 1 to use PMEM support for WAL I/O. (--with-libpmem) */
+#undef USE_LIBPMEM
+
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
0002-Read-write-WAL-files-using-PMDK-v3.patchapplication/octet-stream; name=0002-Read-write-WAL-files-using-PMDK-v3.patchDownload
diff --git a/src/backend/access/transam/timeline.c b/src/backend/access/transam/timeline.c
index c96c8b60ba..eef94100bc 100644
--- a/src/backend/access/transam/timeline.c
+++ b/src/backend/access/transam/timeline.c
@@ -426,7 +426,7 @@ writeTimeLineHistory(TimeLineID newTLI, TimeLineID parentTLI,
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	durable_link_or_rename(tmppath, path, ERROR);
+	durable_link_or_rename(tmppath, path, ERROR, true);
 
 	/* The history file can be archived immediately. */
 	if (XLogArchivingActive())
@@ -505,7 +505,7 @@ writeTimeLineHistoryFile(TimeLineID tli, char *content, int size)
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	durable_link_or_rename(tmppath, path, ERROR);
+	durable_link_or_rename(tmppath, path, ERROR, true);
 }
 
 /*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 2ab7d804f0..1c975fa5e5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -58,6 +58,7 @@
 #include "storage/ipc.h"
 #include "storage/large_object.h"
 #include "storage/latch.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -140,6 +141,9 @@ const struct config_enum_entry sync_method_options[] = {
 #endif
 #ifdef OPEN_DATASYNC_FLAG
 	{"open_datasync", SYNC_METHOD_OPEN_DSYNC, false},
+#endif
+#ifdef USE_LIBPMEM
+	{"pmem_drain", SYNC_METHOD_PMEM_DRAIN, false},
 #endif
 	{NULL, 0, false}
 };
@@ -778,6 +782,7 @@ static const char *xlogSourceNames[] = {"any", "archive", "pg_wal", "stream"};
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
 static uint32 openLogOff = 0;
+static void *mappedLogFileAddr = NULL;
 
 /*
  * These variables are used similarly to the ones above, but for reading
@@ -792,6 +797,7 @@ static XLogSegNo readSegNo = 0;
 static uint32 readOff = 0;
 static uint32 readLen = 0;
 static XLogSource readSource = 0;	/* XLOG_FROM_* code */
+static void *mappedReadFileAddr = NULL;
 
 /*
  * Keeps track of which source we're currently reading from. This is
@@ -877,13 +883,15 @@ static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
 static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
+static int	do_XLogFileOpen(char *pathname, int flags, void **addr);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, XLogSegNo max_segno,
-					   bool use_lock);
+					   bool use_lock, bool fsync_file);
 static int XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 int source, bool notfoundOk);
-static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source);
+			 int source, bool notfoundOk, void **addr);
+static int XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source,
+				   void **addr);
 static int XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 			 int reqLen, XLogRecPtr targetRecPtr, char *readBuf,
 			 TimeLineID *readTLI);
@@ -2360,6 +2368,15 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
 	return false;
 }
 
+static int
+do_XLogFileOpen(char *pathname, int flags, void **addr)
+{
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		return PmemFileOpen(pathname, flags, wal_segment_size, addr);
+	else
+		return BasicOpenFile(pathname, flags);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2439,23 +2456,25 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			 * pages here (since we dump what we have at segment end).
 			 */
 			Assert(npages == 0);
-			if (openLogFile >= 0)
+			if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 				XLogFileClose();
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
 
 			/* create/use new log file */
 			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
+			openLogFile = XLogFileInit(openLogSegNo, &use_existent,
+									   true, &mappedLogFileAddr);
 			openLogOff = 0;
 		}
 
 		/* Make sure we have the current logfile open */
-		if (openLogFile < 0)
+		if (openLogFile < 0 && mappedLogFileAddr == NULL)
 		{
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
-			openLogFile = XLogFileOpen(openLogSegNo);
+			openLogFile = XLogFileOpen(openLogSegNo,
+									   &mappedLogFileAddr);
 			openLogOff = 0;
 		}
 
@@ -2492,28 +2511,43 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
-			nleft = nbytes;
-			do
+
+			if (mappedLogFileAddr != NULL)
 			{
-				errno = 0;
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
+				PmemFileWrite((char *) mappedLogFileAddr + startoffset, from, nbytes);
 				pgstat_report_wait_end();
-				if (written <= 0)
+
+				written = nbytes;
+				nleft = 0;
+				from += nbytes;
+				startoffset += nbytes;
+			}
+			else
+			{
+				nleft = nbytes;
+				do
 				{
-					if (errno == EINTR)
-						continue;
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not write to log file %s "
-									"at offset %u, length %zu: %m",
-									XLogFileNameP(ThisTimeLineID, openLogSegNo),
-									openLogOff, nbytes)));
-				}
-				nleft -= written;
-				from += written;
-				startoffset += written;
-			} while (nleft > 0);
+					errno = 0;
+					pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+					written = pg_pwrite(openLogFile, from, nleft, startoffset);
+					pgstat_report_wait_end();
+					if (written <= 0)
+					{
+						if (errno == EINTR)
+							continue;
+						ereport(PANIC,
+								(errcode_for_file_access(),
+								 errmsg("could not write to log file %s "
+										"at offset %u, length %zu: %m",
+										XLogFileNameP(ThisTimeLineID, openLogSegNo),
+										openLogOff, nbytes)));
+					}
+					nleft -= written;
+					from += written;
+					startoffset += written;
+				} while (nleft > 0);
+			}
 
 			/* Update state for write */
 			openLogOff += nbytes;
@@ -2593,15 +2627,16 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		if (sync_method != SYNC_METHOD_OPEN &&
 			sync_method != SYNC_METHOD_OPEN_DSYNC)
 		{
-			if (openLogFile >= 0 &&
+			if ((openLogFile >= 0 || mappedLogFileAddr != NULL) &&
 				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
 				XLogFileClose();
-			if (openLogFile < 0)
+			if (openLogFile < 0 && mappedLogFileAddr == NULL)
 			{
 				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 								wal_segment_size);
-				openLogFile = XLogFileOpen(openLogSegNo);
+				openLogFile = XLogFileOpen(openLogSegNo,
+										   &mappedLogFileAddr);
 				openLogOff = 0;
 			}
 
@@ -3026,7 +3061,7 @@ XLogBackgroundFlush(void)
 	 */
 	if (WriteRqst.Write <= LogwrtResult.Flush)
 	{
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
@@ -3207,7 +3242,8 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
+XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+			 void **addr)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3216,6 +3252,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	XLogSegNo	max_segno;
 	int			fd;
 	int			nbytes;
+	void	   *tmpaddr = NULL;
+	bool		fsync_file = true;
 
 	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
 
@@ -3224,8 +3262,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 */
 	if (*use_existent)
 	{
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-		if (fd < 0)
+		fd = do_XLogFileOpen(path,
+							 O_RDWR | PG_BINARY | get_sync_bit(sync_method),
+							 &tmpaddr);
+		if (fd < 0 && tmpaddr == NULL)
 		{
 			if (errno != ENOENT)
 				ereport(ERROR,
@@ -3233,7 +3273,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 						 errmsg("could not open file \"%s\": %m", path)));
 		}
 		else
+		{
+			*addr = tmpaddr;
 			return fd;
+		}
 	}
 
 	/*
@@ -3249,8 +3292,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	fd = do_XLogFileOpen(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+						 &tmpaddr);
+	if (fd < 0 && tmpaddr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
@@ -3267,35 +3311,49 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	memset(zbuffer.data, 0, XLOG_BLCKSZ);
 	for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
 	{
-		errno = 0;
-		pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
-		if ((int) write(fd, zbuffer.data, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ)
+		if (tmpaddr != NULL)
 		{
-			int			save_errno = errno;
+			fsync_file = false;
+			pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
+			PmemFileWrite((char *) tmpaddr + nbytes, zbuffer.data,
+						  XLOG_BLCKSZ);
+		}
+		else
+		{
+			errno = 0;
+			pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_WRITE);
+			if ((int) write(fd, zbuffer.data, XLOG_BLCKSZ) != (int) XLOG_BLCKSZ)
+			{
+				int			save_errno = errno;
 
-			/*
-			 * If we fail to make the file, delete it to release disk space
-			 */
-			unlink(tmppath);
+				/*
+				 * If we fail to make the file, delete it to release disk
+				 * space
+				 */
+				unlink(tmppath);
 
-			close(fd);
+				close(fd);
 
-			/* if write didn't set errno, assume problem is no disk space */
-			errno = save_errno ? save_errno : ENOSPC;
+				/*
+				 * if write didn't set errno, assume problem is no disk space
+				 */
+				errno = save_errno ? save_errno : ENOSPC;
 
-			ereport(ERROR,
-					(errcode_for_file_access(),
-					 errmsg("could not write to file \"%s\": %m", tmppath)));
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not write to file \"%s\": %m",
+								tmppath)));
+			}
 		}
 		pgstat_report_wait_end();
 	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, tmpaddr) != 0)
 	{
 		int			save_errno = errno;
 
-		close(fd);
+		do_XLogFileClose(fd, tmpaddr);
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3303,7 +3361,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	}
 	pgstat_report_wait_end();
 
-	if (close(fd))
+	if (do_XLogFileClose(fd, tmpaddr))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
@@ -3330,7 +3388,8 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	max_segno = logsegno + CheckPointSegments;
 	if (!InstallXLogFileSegment(&installed_segno, tmppath,
 								*use_existent, max_segno,
-								use_lock))
+								use_lock,
+								fsync_file))
 	{
 		/*
 		 * No need for any more future segments, or InstallXLogFileSegment()
@@ -3344,8 +3403,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	*use_existent = false;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+						 O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+
+	if (fd < 0 && *addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3380,13 +3441,22 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	int			srcfd;
 	int			fd;
 	int			nbytes;
+	void	   *src_addr = NULL;
+	void	   *dst_addr = NULL;
+	bool		fsync_file = true;
 
 	/*
 	 * Open the source file
 	 */
 	XLogFilePath(path, srcTLI, srcsegno, wal_segment_size);
-	srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (srcfd < 0)
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		srcfd = MapTransientFile(path, O_RDONLY | PG_BINARY,
+								 wal_segment_size, &src_addr);
+
+	if (src_addr == NULL)
+		srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+
+	if (srcfd < 0 && src_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3399,15 +3469,33 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = OpenTransientFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	if (src_addr != NULL && sync_method == SYNC_METHOD_PMEM_DRAIN)
+		fd = MapTransientFile(tmppath,
+							  O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  wal_segment_size, &dst_addr);
+	else
+		fd = OpenTransientFile(tmppath,
+							   O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+
+	if (fd < 0 && dst_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
-				 errmsg("could not create file \"%s\": %m", tmppath)));
+				 errmsg("could not create file \"%s\": %m",
+						tmppath)));
 
 	/*
 	 * Do the data copying.
 	 */
+	if (src_addr && dst_addr)
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_READ);
+		PmemFileWrite(dst_addr, src_addr, wal_segment_size);
+		pgstat_report_wait_end();
+		fsync_file = false;
+
+		goto done_copy;
+	}
+
 	for (nbytes = 0; nbytes < wal_segment_size; nbytes += sizeof(buffer))
 	{
 		int			nread;
@@ -3459,29 +3547,42 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 
 			ereport(ERROR,
 					(errcode_for_file_access(),
-					 errmsg("could not write to file \"%s\": %m", tmppath)));
+					 errmsg("could not write to file \"%s\": %m",
+							tmppath)));
 		}
 		pgstat_report_wait_end();
 	}
 
+done_copy:
 	pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, dst_addr) != 0)
 		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
 
-	if (CloseTransientFile(fd))
+	if (dst_addr)
+	{
+		if (UnmapTransientFile(dst_addr, wal_segment_size))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not unmap file \"%s\": %m",
+							tmppath)));
+	}
+	else if (CloseTransientFile(fd))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
 
-	CloseTransientFile(srcfd);
+	if (src_addr)
+		UnmapTransientFile(src_addr, wal_segment_size);
+	else
+		CloseTransientFile(srcfd);
 
 	/*
 	 * Now move the segment into place with its final name.
 	 */
-	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false))
+	if (!InstallXLogFileSegment(&destsegno, tmppath, false, 0, false, fsync_file))
 		elog(ERROR, "InstallXLogFileSegment should not have failed");
 }
 
@@ -3516,7 +3617,7 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 static bool
 InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 					   bool find_free, XLogSegNo max_segno,
-					   bool use_lock)
+					   bool use_lock, bool fsync_file)
 {
 	char		path[MAXPGPATH];
 	struct stat stat_buf;
@@ -3555,7 +3656,7 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 	 * Perform the rename using link if available, paranoidly trying to avoid
 	 * overwriting an existing file (there shouldn't be one).
 	 */
-	if (durable_link_or_rename(tmppath, path, LOG) != 0)
+	if (durable_link_or_rename(tmppath, path, LOG, fsync_file) != 0)
 	{
 		if (use_lock)
 			LWLockRelease(ControlFileLock);
@@ -3573,15 +3674,16 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  * Open a pre-existing logfile segment for writing.
  */
 int
-XLogFileOpen(XLogSegNo segno)
+XLogFileOpen(XLogSegNo segno, void **addr)
 {
 	char		path[MAXPGPATH];
 	int			fd;
 
 	XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+						 O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+	if (fd < 0 && *addr == NULL)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3597,7 +3699,7 @@ XLogFileOpen(XLogSegNo segno)
  */
 static int
 XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 int source, bool notfoundOk)
+			 int source, bool notfoundOk, void **addr)
 {
 	char		xlogfname[MAXFNAMELEN];
 	char		activitymsg[MAXFNAMELEN + 16];
@@ -3646,8 +3748,8 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlogfname);
 	}
 
-	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-	if (fd >= 0)
+	fd = do_XLogFileOpen(path, O_RDONLY | PG_BINARY, addr);
+	if (fd >= 0 || *addr != NULL)
 	{
 		/* Success! */
 		curFileTLI = tli;
@@ -3679,7 +3781,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  * This version searches for the segment with any TLI listed in expectedTLEs.
  */
 static int
-XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
+XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source, void **addr)
 {
 	char		path[MAXPGPATH];
 	ListCell   *cell;
@@ -3719,8 +3821,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_ARCHIVE, true);
-			if (fd != -1)
+							  XLOG_FROM_ARCHIVE, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				elog(DEBUG1, "got WAL segment from archive");
 				if (!expectedTLEs)
@@ -3732,8 +3834,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_PG_WAL, true);
-			if (fd != -1)
+							  XLOG_FROM_PG_WAL, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				if (!expectedTLEs)
 					expectedTLEs = tles;
@@ -3751,13 +3853,22 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, int source)
 	return -1;
 }
 
+int
+do_XLogFileClose(int fd, void *addr)
+{
+	if (!addr)
+		return close(fd);
+
+	return PmemFileClose(addr, wal_segment_size);
+}
+
 /*
  * Close the current logfile segment for writing.
  */
 static void
 XLogFileClose(void)
 {
-	Assert(openLogFile >= 0);
+	Assert(openLogFile >= 0 || mappedLogFileAddr != NULL);
 
 	/*
 	 * WAL segment files will not be re-read in normal operation, so we advise
@@ -3766,15 +3877,16 @@ XLogFileClose(void)
 	 * use the cache to read the WAL segment.
 	 */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	if (!XLogIsNeeded())
+	if (!XLogIsNeeded() && openLogFile > 0)
 		(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif
 
-	if (close(openLogFile))
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m",
 						XLogFileNameP(ThisTimeLineID, openLogSegNo))));
+	mappedLogFileAddr = NULL;
 	openLogFile = -1;
 }
 
@@ -3794,6 +3906,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	XLogSegNo	_logSegNo;
 	int			lf;
 	bool		use_existent;
+	void	   *laddr = NULL;
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3802,8 +3915,8 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	{
 		_logSegNo++;
 		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent, true);
-		close(lf);
+		lf = XLogFileInit(_logSegNo, &use_existent, true, &laddr);
+		do_XLogFileClose(lf, laddr);
 		if (!use_existent)
 			CheckpointStats.ckpt_segs_added++;
 	}
@@ -4052,6 +4165,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 	struct stat statbuf;
 	XLogSegNo	endlogSegNo;
 	XLogSegNo	recycleSegNo;
+	bool		fsync_file = true;
 
 	/*
 	 * Initialize info about where to try to recycle to.
@@ -4064,6 +4178,9 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 
 	snprintf(path, MAXPGPATH, XLOGDIR "/%s", segname);
 
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		fsync_file = false;
+
 	/*
 	 * Before deleting the file, see if it can be recycled as a future log
 	 * segment. Only recycle normal files, pg_standby for example can create
@@ -4072,7 +4189,7 @@ RemoveXlogFile(const char *segname, XLogRecPtr RedoRecPtr, XLogRecPtr endptr)
 	if (endlogSegNo <= recycleSegNo &&
 		lstat(path, &statbuf) == 0 && S_ISREG(statbuf.st_mode) &&
 		InstallXLogFileSegment(&endlogSegNo, path,
-							   true, recycleSegNo, true))
+							   true, recycleSegNo, true, fsync_file))
 	{
 		ereport(DEBUG2,
 				(errmsg("recycled write-ahead log file \"%s\"",
@@ -4239,9 +4356,10 @@ ReadRecord(XLogReaderState *xlogreader, XLogRecPtr RecPtr, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			if (readFile >= 0)
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
 			{
-				close(readFile);
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+				mappedReadFileAddr = NULL;
 				readFile = -1;
 			}
 
@@ -4780,7 +4898,9 @@ UpdateControlFile(void)
 	pgstat_report_wait_start(WAIT_EVENT_CONTROL_FILE_WRITE_UPDATE);
 	if (write(fd, ControlFile, sizeof(ControlFileData)) != sizeof(ControlFileData))
 	{
-		/* if write didn't set errno, assume problem is no disk space */
+		/*
+		 * if write didn't set errno, assume problem is no disk space
+		 */
 		if (errno == 0)
 			errno = ENOSPC;
 		ereport(PANIC,
@@ -5212,34 +5332,44 @@ BootStrapXLOG(void)
 
 	/* Create first XLOG segment file */
 	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	openLogFile = XLogFileInit(1, &use_existent, false, &mappedLogFileAddr);
 
 	/* Write the first page with the initial record */
 	errno = 0;
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+
+	if (mappedLogFileAddr != NULL)
 	{
-		/* if write didn't set errno, assume problem is no disk space */
-		if (errno == 0)
-			errno = ENOSPC;
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not write bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		PmemFileWrite(mappedLogFileAddr, page, XLOG_BLCKSZ);
+	}
+	else
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write bootstrap write-ahead log file: %m")));
+		}
 	}
 	pgstat_report_wait_end();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
+	if (xlog_fsync(openLogFile, (void *) mappedLogFileAddr) != 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
 	pgstat_report_wait_end();
 
-	if (close(openLogFile))
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not close bootstrap write-ahead log file: %m")));
 
+	mappedLogFileAddr = NULL;
 	openLogFile = -1;
 
 	/* Now create pg_control */
@@ -5480,9 +5610,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * If the ending log segment is still open, close it (to avoid problems on
 	 * Windows with trying to rename or delete an open file).
 	 */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 
@@ -5521,10 +5652,11 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 */
 		bool		use_existent = true;
 		int			fd;
+		void	   *tmpaddr = NULL;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+		fd = XLogFileInit(startLogSegNo, &use_existent, true, &tmpaddr);
 
-		if (close(fd))
+		if (do_XLogFileClose(fd, tmpaddr))
 			ereport(ERROR,
 					(errcode_for_file_access(),
 					 errmsg("could not close file \"%s\": %m",
@@ -7721,9 +7853,10 @@ StartupXLOG(void)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/* Shut down xlogreader */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 	XLogReaderFree(xlogreader);
@@ -10062,6 +10195,9 @@ get_sync_bit(int method)
 		case SYNC_METHOD_FSYNC:
 		case SYNC_METHOD_FSYNC_WRITETHROUGH:
 		case SYNC_METHOD_FDATASYNC:
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+#endif
 			return 0;
 #ifdef OPEN_SYNC_FLAG
 		case SYNC_METHOD_OPEN:
@@ -10079,7 +10215,36 @@ get_sync_bit(int method)
 }
 
 /*
- * GUC support
+ * GUC check_hook for xlog_sync_method
+ */
+bool
+check_xlog_sync_method(int *newval, void **extra, GucSource source)
+{
+	bool		ret;
+	char		tmppath[MAXPGPATH] = {};
+	int			val = newval ? *newval : sync_method;
+
+	if (val != SYNC_METHOD_PMEM_DRAIN)
+		return true;
+
+	snprintf(tmppath, MAXPGPATH, "%s/" XLOGDIR "/pmem.tmp.%d", DataDir, (int) getpid());
+
+	ret = CheckPmem(tmppath);
+
+	if (!ret)
+	{
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("invalid value for parameter \"wal_sync_method\": \"pmem_drain\"");
+		GUC_check_errmsg("%s isn't stored on persistent memory(pmem_is_pmem() returned false).",
+						 XLOGDIR);
+		GUC_check_errhint("Please see also ENVIRONMENT VARIABLES section in man libpmem.");
+	}
+
+	return ret;
+}
+
+/*
+ * GUC assign_hook for xlog_sync_method
  */
 void
 assign_xlog_sync_method(int new_sync_method, void *extra)
@@ -10092,10 +10257,10 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 		 * changing, close the log file so it will be reopened (with new flag
 		 * bit) at next use.
 		 */
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN);
-			if (pg_fsync(openLogFile) != 0)
+			if (xlog_fsync(openLogFile, (void *) mappedLogFileAddr) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not fsync file \"%s\": %m",
@@ -10144,6 +10309,11 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 						 errmsg("could not fdatasync file \"%s\": %m",
 								XLogFileNameP(ThisTimeLineID, segno))));
 			break;
+#endif
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+			PmemFileSync();
+			break;
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
@@ -10156,6 +10326,17 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	pgstat_report_wait_end();
 }
 
+int
+xlog_fsync(int fd, void *addr)
+{
+	if (!addr)
+		return pg_fsync(fd);
+
+	PmemFileSync();
+	return 0;
+}
+
+
 /*
  * Return the filename of given log segment, as a palloc'd string.
  */
@@ -11572,7 +11753,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 &&
+	if ((readFile >= 0 || mappedReadFileAddr != NULL) &&
 		!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
 	{
 		/*
@@ -11589,7 +11770,8 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			}
 		}
 
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 		readSource = 0;
 	}
@@ -11598,7 +11780,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
+	if ((readFile < 0 && mappedReadFileAddr == NULL) ||
 		(readSource == XLOG_FROM_STREAM &&
 		 receivedUpto < targetPagePtr + reqLen))
 	{
@@ -11607,8 +11789,9 @@ retry:
 										 private->fetching_ckpt,
 										 targetRecPtr))
 		{
-			if (readFile >= 0)
-				close(readFile);
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+			mappedReadFileAddr = NULL;
 			readFile = -1;
 			readLen = 0;
 			readSource = 0;
@@ -11621,7 +11804,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(readFile != -1 || mappedReadFileAddr != NULL);
 
 	/*
 	 * If the current segment is being streamed from master, calculate how
@@ -11642,30 +11825,39 @@ retry:
 
 	/* Read the requested page */
 	readOff = targetPageOff;
-
-	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
-	if (r != XLOG_BLCKSZ)
+	if (mappedReadFileAddr)
 	{
-		char		fname[MAXFNAMELEN];
-		int			save_errno = errno;
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		PmemFileRead((char *) mappedReadFileAddr + readOff, readBuf,
+					 XLOG_BLCKSZ);
 
-		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		if (r < 0)
+	}
+	else
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+		if (r != XLOG_BLCKSZ)
 		{
-			errno = save_errno;
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
+			char		fname[MAXFNAMELEN];
+			int			save_errno = errno;
+
+			pgstat_report_wait_end();
+			XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+			if (r < 0)
+			{
+				errno = save_errno;
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode_for_file_access(),
+						 errmsg("could not read from log segment %s, offset %u: %m",
+								fname, readOff)));
+			}
+			else
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+								fname, readOff, r, (Size) XLOG_BLCKSZ)));
+			goto next_record_is_invalid;
 		}
-		else
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							fname, readOff, r, (Size) XLOG_BLCKSZ)));
-		goto next_record_is_invalid;
 	}
 	pgstat_report_wait_end();
 
@@ -11713,8 +11905,9 @@ retry:
 next_record_is_invalid:
 	lastSourceFailed = true;
 
-	if (readFile >= 0)
-		close(readFile);
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+	mappedReadFileAddr = NULL;
 	readFile = -1;
 	readLen = 0;
 	readSource = 0;
@@ -11971,9 +12164,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_ARCHIVE:
 			case XLOG_FROM_PG_WAL:
 				/* Close any old file we might have open. */
-				if (readFile >= 0)
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 				{
-					close(readFile);
+					do_XLogFileClose(readFile,
+									 mappedReadFileAddr);
+					mappedReadFileAddr = NULL;
 					readFile = -1;
 				}
 				/* Reset curFileTLI if random fetch. */
@@ -11986,8 +12181,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 */
 				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
-				if (readFile >= 0)
+											  currentSource, &mappedReadFileAddr);
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 					return true;	/* success! */
 
 				/*
@@ -12051,14 +12246,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						 * info is set correctly and XLogReceiptTime isn't
 						 * changed.
 						 */
-						if (readFile < 0)
+						if (readFile < 0 && mappedReadFileAddr == NULL)
 						{
 							if (!expectedTLEs)
 								expectedTLEs = readTimeLineHistory(receiveTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
 													receiveTLI,
-													XLOG_FROM_STREAM, false);
-							Assert(readFile >= 0);
+													XLOG_FROM_STREAM, false, &mappedReadFileAddr);
+							Assert(readFile >= 0 || mappedReadFileAddr != NULL);
 						}
 						else
 						{
diff --git a/src/backend/storage/file/Makefile b/src/backend/storage/file/Makefile
index ca6a0e4f7d..9271153553 100644
--- a/src/backend/storage/file/Makefile
+++ b/src/backend/storage/file/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/file
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = fd.o buffile.o copydir.o reinit.o sharedfileset.o
+OBJS = fd.o buffile.o copydir.o reinit.o sharedfileset.o pmem.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 213de7698a..8b51d7e8a0 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -88,6 +88,7 @@
 #include "portability/mem.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
@@ -223,6 +224,9 @@ static uint64 temporary_files_size = 0;
 typedef enum
 {
 	AllocateDescFile,
+#ifdef USE_LIBPMEM
+	AllocateDescMap,
+#endif
 	AllocateDescPipe,
 	AllocateDescDir,
 	AllocateDescRawFD
@@ -237,6 +241,10 @@ typedef struct
 		FILE	   *file;
 		DIR		   *dir;
 		int			fd;
+#ifdef USE_LIBPMEM
+		size_t		fsize;
+		void	   *addr;
+#endif
 	}			desc;
 } AllocateDesc;
 
@@ -705,14 +713,16 @@ durable_unlink(const char *fname, int elevel)
  * valid upon return.
  */
 int
-durable_link_or_rename(const char *oldfile, const char *newfile, int elevel)
+durable_link_or_rename(const char *oldfile, const char *newfile, int elevel,
+					   bool fsync_file)
 {
 	/*
 	 * Ensure that, if we crash directly after the rename/link, a file with
 	 * valid contents is moved into place.
 	 */
-	if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
-		return -1;
+	if (fsync_file)
+		if (fsync_fname_ext(oldfile, false, false, elevel) != 0)
+			return -1;
 
 #if HAVE_WORKING_LINK
 	if (link(oldfile, newfile) < 0)
@@ -740,8 +750,9 @@ durable_link_or_rename(const char *oldfile, const char *newfile, int elevel)
 	 * Make change persistent in case of an OS crash, both the new entry and
 	 * its parent directory need to be flushed.
 	 */
-	if (fsync_fname_ext(newfile, false, false, elevel) != 0)
-		return -1;
+	if (fsync_file)
+		if (fsync_fname_ext(newfile, false, false, elevel) != 0)
+			return -1;
 
 	/* Same for parent directory */
 	if (fsync_parent_path(newfile, elevel) != 0)
@@ -1556,6 +1567,78 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
 	return file;
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Mmap a file with MapTransientFilePerm() and pass default file mode for
+ * the fileMode parameter.
+ */
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return MapTransientFilePerm(fileName, fileFlags, PG_FILE_MODE_DEFAULT,
+								fsize, addr);
+}
+
+/*
+ * Like AllocateFile, but returns an unbuffered pointer to the mapped area
+ * like mmap(2)
+ */
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr)
+{
+	int			fd;
+
+	DO_DB(elog(LOG, "MapTransientFilePerm: Allocated %d (%s)",
+			   numAllocatedDescs, fileName));
+
+	/* Can we allocate another non-virtual FD? */
+	if (!reserveAllocatedDesc())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("exceeded maxAllocatedDescs (%d) while trying to open file \"%s\"",
+						maxAllocatedDescs, fileName)));
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	if (addr != NULL)
+	{
+		void	   *ret_addr = NULL;
+
+		fd = PmemFileOpenPerm(fileName, fileFlags, fileMode, fsize, &ret_addr);
+		if (ret_addr != NULL)
+		{
+			AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
+
+			*addr = ret_addr;
+
+			desc->kind = AllocateDescMap;
+			desc->desc.addr = ret_addr;
+			desc->desc.fsize = fsize;
+			desc->create_subid = GetCurrentSubTransactionId();
+			numAllocatedDescs++;
+
+			return fd;
+		}
+	}
+
+	return -1;					/* failure */
+}
+#else
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return -1;
+}
+
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr)
+{
+	return -1;
+}
+#endif
 
 /*
  * Create a new file.  The directory containing it must already exist.  Files
@@ -2361,6 +2444,11 @@ FreeDesc(AllocateDesc *desc)
 		case AllocateDescRawFD:
 			result = close(desc->desc.fd);
 			break;
+#ifdef USE_LIBPMEM
+		case AllocateDescMap:
+			result = PmemFileClose(desc->desc.addr, desc->desc.fsize);
+			break;
+#endif
 		default:
 			elog(ERROR, "AllocateDesc kind not recognized");
 			result = 0;			/* keep compiler quiet */
@@ -2402,6 +2490,42 @@ FreeFile(FILE *file)
 	return fclose(file);
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Unmap a file returned by MapTransientFile.
+ *
+ * Note we do not check unmap's return value --- it is up to the caller
+ * to handle unmap errors.
+ */
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	int			i;
+
+	DO_DB(elog(LOG, "UnmapTransientFile: Allocated %d", numAllocatedDescs));
+
+	/* Remove fd from list of allocated files, if it's present */
+	for (i = numAllocatedDescs; --i >= 0;)
+	{
+		AllocateDesc *desc = &allocatedDescs[i];
+
+		if (desc->kind == AllocateDescMap && desc->desc.addr == addr)
+			return FreeDesc(desc);
+	}
+
+	/* Only get here if someone passes us a file not in allocatedDescs */
+	elog(WARNING, "fd passed to UnmapTransientFile was not obtained from MapTransientFile");
+
+	return PmemFileClose(addr, fsize);
+}
+#else
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	return -1;
+}
+#endif
+
 /*
  * Close a file returned by OpenTransientFile.
  *
diff --git a/src/backend/storage/file/pmem.c b/src/backend/storage/file/pmem.c
new file mode 100644
index 0000000000..b214b6b18e
--- /dev/null
+++ b/src/backend/storage/file/pmem.c
@@ -0,0 +1,188 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.c
+ *	  Virtual file descriptor code.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/file/pmem.c
+ *
+ * NOTES:
+ *
+ * This code manages an memory-mapped file on a filesystem mounted with DAX on
+ * persistent memory device using the Persistent Memory Development Kit
+ * (http://pmem.io/pmdk/).
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/pmem.h"
+#include "storage/fd.h"
+
+#ifdef USE_LIBPMEM
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <libpmem.h>
+#include <sys/mman.h>
+#include <string.h>
+
+#define PmemFileSize 32
+
+/*
+ * This function returns true, only if the file is stored on persistent memory.
+ */
+bool
+CheckPmem(const char *path)
+{
+	int			is_pmem = 0;	/* false */
+	size_t		mapped_len = 0;
+	bool		ret = true;
+	void	   *tmpaddr;
+
+	/*
+	 * The value of is_pmem is 0, if the file(path) isn't stored on persistent
+	 * memory.
+	 */
+	tmpaddr = pmem_map_file(path, PmemFileSize, PMEM_FILE_CREATE,
+							PG_FILE_MODE_DEFAULT, &mapped_len, &is_pmem);
+
+	if (tmpaddr)
+	{
+		pmem_unmap(tmpaddr, mapped_len);
+		unlink(path);
+	}
+
+	if (is_pmem)
+		elog(LOG, "%s is stored on persistent memory.", path);
+	else
+		ret = false;
+
+	return ret;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return PmemFileOpenPerm(pathname, flags, PG_FILE_MODE_DEFAULT, fsize, addr);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+				 void **addr)
+{
+	int			mapped_flag = 0;
+	size_t		mapped_len = 0;
+	size_t		size = 0;
+	void	   *ret_addr;
+
+	if (addr == NULL)
+		return BasicOpenFile(pathname, flags);
+
+	/* non-zero 'len' not allowed without PMEM_FILE_CREATE */
+	if (flags & O_CREAT)
+	{
+		mapped_flag = PMEM_FILE_CREATE;
+		size = fsize;
+	}
+
+	if (flags & O_EXCL)
+		mapped_flag |= PMEM_FILE_EXCL;
+
+	ret_addr = pmem_map_file(pathname, size, mapped_flag, mode, &mapped_len,
+							 NULL);
+
+	if (fsize != mapped_len)
+	{
+		if (ret_addr != NULL)
+			pmem_unmap(ret_addr, mapped_len);
+
+		return -1;
+	}
+
+	if (mapped_flag & PMEM_FILE_CREATE)
+		if (msync(ret_addr, mapped_len, MS_SYNC))
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not msync log file %s: %m", pathname)));
+
+	*addr = ret_addr;
+
+	return NO_FD_FOR_MAPPED_FILE;
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	pmem_memcpy_nodrain((void *) dest, src, len);
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	memcpy(buf, (void *) map_addr, len);
+}
+
+void
+PmemFileSync(void)
+{
+	return pmem_drain();
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	return pmem_unmap((void *) addr, fsize);
+}
+
+
+#else
+bool
+CheckPmem(const char *path)
+{
+	return true;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return BasicOpenFile(pathname, flags);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+				 void **addr)
+{
+	return BasicOpenFilePerm(pathname, flags, mode);
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileSync(void)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+	return -1;
+}
+#endif
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 98d75be292..ba26b17b78 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -4334,7 +4334,7 @@ static struct config_enum ConfigureNamesEnum[] =
 		},
 		&sync_method,
 		DEFAULT_SYNC_METHOD, sync_method_options,
-		NULL, assign_xlog_sync_method, NULL
+		check_xlog_sync_method, assign_xlog_sync_method, NULL
 	},
 
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index a21865a77f..8cd915eb94 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -196,6 +196,7 @@
 					#   fsync
 					#   fsync_writethrough
 					#   open_sync
+					#   pmem_drain
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
 #wal_log_hints = off			# also do full page writes of non-critical updates
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f90a6a9139..033876c3c6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -27,6 +27,7 @@
 #define SYNC_METHOD_OPEN		2	/* for O_SYNC */
 #define SYNC_METHOD_FSYNC_WRITETHROUGH	3
 #define SYNC_METHOD_OPEN_DSYNC	4	/* for O_DSYNC */
+#define SYNC_METHOD_PMEM_DRAIN	5	/* for Persistent Memory Development Kit */
 extern int	sync_method;
 
 extern PGDLLIMPORT TimeLineID ThisTimeLineID;	/* current TLI */
@@ -259,8 +260,10 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
-extern int	XLogFileOpen(XLogSegNo segno);
+extern int XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+			 void **addr);
+extern int	XLogFileOpen(XLogSegNo segno, void **addr);
+extern int	do_XLogFileClose(int fd, void *addr);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
@@ -272,6 +275,7 @@ extern void xlog_desc(StringInfo buf, XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
+extern int	xlog_fsync(int fd, void *addr);
 
 extern bool RecoveryInProgress(void);
 extern bool HotStandbyActive(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 74c34757fb..834e3f7353 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -45,6 +45,13 @@
 typedef int File;
 
 
+/*
+ * Default mode for created files, unless something else is specified using
+ * the *Perm() function variants.
+ */
+#define PG_FILE_MODE_DEFAULT	(S_IRUSR | S_IWUSR)
+
+
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
@@ -104,6 +111,13 @@ extern int	OpenTransientFile(const char *fileName, int fileFlags);
 extern int	OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern int	CloseTransientFile(int fd);
 
+/* Operations to allow use of a memory-mapped file */
+extern int MapTransientFile(const char *fileName, int fileFlags, size_t fsize,
+				 void **addr);
+extern int MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr);
+extern int	UnmapTransientFile(void *addr, size_t fsize);
+
 /* If you've really really gotta have a plain kernel FD, use this */
 extern int	BasicOpenFile(const char *fileName, int fileFlags);
 extern int	BasicOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
@@ -133,7 +147,8 @@ extern void pg_flush_data(int fd, off_t offset, off_t amount);
 extern void fsync_fname(const char *fname, bool isdir);
 extern int	durable_rename(const char *oldfile, const char *newfile, int loglevel);
 extern int	durable_unlink(const char *fname, int loglevel);
-extern int	durable_link_or_rename(const char *oldfile, const char *newfile, int loglevel);
+extern int durable_link_or_rename(const char *oldfile, const char *newfile,
+					   int loglevel, bool fsync_fname);
 extern void SyncDataDirectory(void);
 extern int data_sync_elevel(int elevel);
 
diff --git a/src/include/storage/pmem.h b/src/include/storage/pmem.h
new file mode 100644
index 0000000000..b9b9156c91
--- /dev/null
+++ b/src/include/storage/pmem.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.h
+ *		Virtual file descriptor definitions for persistent memory.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/pmem.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PMEM_H
+#define PMEM_H
+
+#include "postgres.h"
+
+#define NO_FD_FOR_MAPPED_FILE -2
+
+extern bool CheckPmem(const char *path);
+extern int PmemFileOpen(const char *pathname, int flags, size_t fsize,
+			 void **addr);
+extern int PmemFileOpenPerm(const char *pathname, int flags, int mode,
+				 size_t fsize, void **addr);
+extern void PmemFileWrite(void *dest, void *src, size_t len);
+extern void PmemFileRead(void *map_addr, void *buf, size_t len);
+extern void PmemFileSync(void);
+extern int	PmemFileClose(void *addr, size_t fsize);
+
+#endif							/* PMEM_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index c07e7b945e..436ab961fc 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -433,6 +433,7 @@ extern void assign_search_path(const char *newval, void *extra);
 
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_xlog_sync_method(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
 #endif							/* GUC_H */
0003-Walreceiver-WAL-IO-using-PMDK-v3.patchapplication/octet-stream; name=0003-Walreceiver-WAL-IO-using-PMDK-v3.patchDownload
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2e90944ad5..044fe078a8 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -61,6 +61,7 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
 #include "utils/builtins.h"
@@ -91,6 +92,7 @@ static int	recvFile = -1;
 static TimeLineID recvFileTLI = 0;
 static XLogSegNo recvSegNo = 0;
 static uint32 recvOff = 0;
+void	   *mappedFileAddr = NULL;
 
 /*
  * Flags set by interrupt handlers of walreceiver for later service in the
@@ -599,12 +601,12 @@ WalReceiverMain(void)
 		 * End of WAL reached on the requested timeline. Close the last
 		 * segment, and await for new orders from the startup process.
 		 */
-		if (recvFile >= 0)
+		if (recvFile >= 0 || mappedFileAddr != NULL)
 		{
 			char		xlogfname[MAXFNAMELEN];
 
 			XLogWalRcvFlush(false);
-			if (close(recvFile) != 0)
+			if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not close log segment %s: %m",
@@ -621,6 +623,7 @@ WalReceiverMain(void)
 				XLogArchiveNotify(xlogfname);
 		}
 		recvFile = -1;
+		mappedFileAddr = NULL;
 
 		elog(DEBUG1, "walreceiver ended streaming and awaits new instructions");
 		WalRcvWaitForStartPosition(&startpoint, &startpointTLI);
@@ -931,7 +934,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		if ((recvFile < 0 && mappedFileAddr == NULL) ||
+			!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
 			bool		use_existent;
 
@@ -939,7 +943,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			 * fsync() and close current file before we switch to next one. We
 			 * would otherwise have to reopen this file to fsync it later
 			 */
-			if (recvFile >= 0)
+			if (recvFile >= 0 || mappedFileAddr != NULL)
 			{
 				char		xlogfname[MAXFNAMELEN];
 
@@ -950,7 +954,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				 * process soon, so we don't advise the OS to release cache
 				 * pages associated with the file like XLogFileClose() does.
 				 */
-				if (close(recvFile) != 0)
+				if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 					ereport(PANIC,
 							(errcode_for_file_access(),
 							 errmsg("could not close log segment %s: %m",
@@ -967,11 +971,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveNotify(xlogfname);
 			}
 			recvFile = -1;
+			mappedFileAddr = NULL;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			use_existent = true;
-			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
+			recvFile = XLogFileInit(recvSegNo, &use_existent, true, &mappedFileAddr);
 			recvFileTLI = ThisTimeLineID;
 			recvOff = 0;
 		}
@@ -987,30 +992,39 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		/* Need to seek in the file? */
 		if (recvOff != startoff)
 		{
-			if (lseek(recvFile, (off_t) startoff, SEEK_SET) < 0)
-				ereport(PANIC,
-						(errcode_for_file_access(),
-						 errmsg("could not seek in log segment %s to offset %u: %m",
-								XLogFileNameP(recvFileTLI, recvSegNo),
-								startoff)));
+			if (!mappedFileAddr)
+				if (lseek(recvFile, (off_t) startoff, SEEK_SET) < 0)
+					ereport(PANIC,
+							(errcode_for_file_access(),
+							 errmsg("could not seek in log segment %s to offset %u: %m",
+									XLogFileNameP(recvFileTLI, recvSegNo),
+									startoff)));
 			recvOff = startoff;
 		}
 
-		/* OK to write the logs */
-		errno = 0;
+		if (mappedFileAddr)
+		{
+			PmemFileWrite((char *) mappedFileAddr + startoff, buf, segbytes);
+			byteswritten = segbytes;
+		}
+		else
+		{
+			/* OK to write the logs */
+			errno = 0;
 
-		byteswritten = write(recvFile, buf, segbytes);
-		if (byteswritten <= 0)
-		{
-			/* if write didn't set errno, assume no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not write to log segment %s "
-							"at offset %u, length %lu: %m",
-							XLogFileNameP(recvFileTLI, recvSegNo),
-							recvOff, (unsigned long) segbytes)));
+			byteswritten = write(recvFile, buf, segbytes);
+			if (byteswritten <= 0)
+			{
+				/* if write didn't set errno, assume no disk space */
+				if (errno == 0)
+					errno = ENOSPC;
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not write to log segment %s "
+								"at offset %u, length %lu: %m",
+								XLogFileNameP(recvFileTLI, recvSegNo),
+								recvOff, (unsigned long) segbytes)));
+			}
 		}
 
 		/* Update state for write */
#36Peter Eisentraut
peter.eisentraut@2ndquadrant.com
In reply to: Takashi Menjo (#35)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

On 30/01/2019 07:16, Takashi Menjo wrote:

Sorry but I found that the patchset v2 had a bug in managing WAL segment
file offset. I fixed it and updated a patchset as v3 (attached).

I'm concerned with how this would affect the future maintenance of this
code. You are introducing a whole separate code path for PMDK beside
the normal file path (and it doesn't seem very well separated either).
Now everyone who wants to do some surgery in the WAL code needs to take
that into account. And everyone who wants to do performance work in the
WAL code needs to check that the PMDK path doesn't regress. AFAICT,
this hardware isn't very popular at the moment, so it would be very hard
to peer review any work in this area.

--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#37Takashi Menjo
menjo.takashi@lab.ntt.co.jp
In reply to: Peter Eisentraut (#36)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

Peter Eisentraut wrote:

I'm concerned with how this would affect the future maintenance of this
code. You are introducing a whole separate code path for PMDK beside
the normal file path (and it doesn't seem very well separated either).
Now everyone who wants to do some surgery in the WAL code needs to take
that into account. And everyone who wants to do performance work in the
WAL code needs to check that the PMDK path doesn't regress. AFAICT,
this hardware isn't very popular at the moment, so it would be very hard
to peer review any work in this area.

Thank you for your comment. It is reasonable that you are concerned with
maintainability. Our patchset still lacks of it. I will consider about
that when I submit a next update. (It may take a long time, so please be
patient...)

Regards,
Takashi

--
Takashi Menjo - NTT Software Innovation Center
<menjo.takashi@lab.ntt.co.jp>

#38Takashi Menjo
takashi.menjo@gmail.com
In reply to: Takashi Menjo (#37)
3 attachment(s)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

Dear hackers,

I rebased my old patchset. It would be good to compare this v4 patchset to
non-volatile WAL buffer's one [1]/messages/by-id/002101d649fb$1f5966e0$5e0c34a0$@hco.ntt.co.jp_1.

[1]: /messages/by-id/002101d649fb$1f5966e0$5e0c34a0$@hco.ntt.co.jp_1
/messages/by-id/002101d649fb$1f5966e0$5e0c34a0$@hco.ntt.co.jp_1

Regards,
Takashi

--
Takashi Menjo <takashi.menjo@gmail.com>

Attachments:

v4-0002-Read-write-WAL-files-using-PMDK.patchapplication/octet-stream; name=v4-0002-Read-write-WAL-files-using-PMDK.patchDownload
From 093e6c03c8413f2d36f3d28be89b5a93647795ba Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Tue, 4 Aug 2020 13:02:14 +0900
Subject: [PATCH v4 2/3] Read write WAL files using PMDK

Author: Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp>
---
 src/backend/access/transam/xlog.c             | 461 ++++++++++++------
 src/backend/storage/file/Makefile             |   3 +-
 src/backend/storage/file/fd.c                 | 121 +++++
 src/backend/storage/file/pmem.c               | 188 +++++++
 src/backend/utils/misc/guc.c                  |   2 +-
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   8 +-
 src/include/storage/fd.h                      |  13 +
 src/include/storage/pmem.h                    |  32 ++
 src/include/utils/guc.h                       |   1 +
 10 files changed, 685 insertions(+), 145 deletions(-)
 create mode 100644 src/backend/storage/file/pmem.c
 create mode 100644 src/include/storage/pmem.h

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 184c6672f3..ad50012138 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -63,6 +63,7 @@
 #include "storage/ipc.h"
 #include "storage/large_object.h"
 #include "storage/latch.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -148,6 +149,9 @@ const struct config_enum_entry sync_method_options[] = {
 #endif
 #ifdef OPEN_DATASYNC_FLAG
 	{"open_datasync", SYNC_METHOD_OPEN_DSYNC, false},
+#endif
+#ifdef USE_LIBPMEM
+	{"pmem_drain", SYNC_METHOD_PMEM_DRAIN, false},
 #endif
 	{NULL, 0, false}
 };
@@ -799,6 +803,7 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
  */
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
+static void *mappedLogFileAddr = NULL;
 
 /*
  * These variables are used similarly to the ones above, but for reading
@@ -816,6 +821,7 @@ static XLogSegNo readSegNo = 0;
 static uint32 readOff = 0;
 static uint32 readLen = 0;
 static XLogSource readSource = XLOG_FROM_ANY;
+static void *mappedReadFileAddr = NULL;
 
 /*
  * Keeps track of which source we're currently reading from. This is
@@ -905,13 +911,15 @@ static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
 static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
+static int	do_XLogFileOpen(char *pathname, int flags, void **addr);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 								   bool find_free, XLogSegNo max_segno,
 								   bool use_lock);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-						 XLogSource source, bool notfoundOk);
-static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
+						 XLogSource source, bool notfoundOk, void **addr);
+static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source,
+							   void **addr);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
@@ -2399,6 +2407,15 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
 	return false;
 }
 
+static int
+do_XLogFileOpen(char *pathname, int flags, void **addr)
+{
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		return PmemFileOpen(pathname, flags, wal_segment_size, addr);
+	else
+		return BasicOpenFile(pathname, flags);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2478,24 +2495,27 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			 * pages here (since we dump what we have at segment end).
 			 */
 			Assert(npages == 0);
-			if (openLogFile >= 0)
+			if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 				XLogFileClose();
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
 
 			/* create/use new log file */
 			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
-			ReserveExternalFD();
+			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true,
+									   &mappedLogFileAddr);
+			if (openLogFile >= 0)
+				ReserveExternalFD();
 		}
 
 		/* Make sure we have the current logfile open */
-		if (openLogFile < 0)
+		if (openLogFile < 0 && mappedLogFileAddr == NULL)
 		{
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
-			openLogFile = XLogFileOpen(openLogSegNo);
-			ReserveExternalFD();
+			openLogFile = XLogFileOpen(openLogSegNo, &mappedLogFileAddr);
+			if (openLogFile >= 0)
+				ReserveExternalFD();
 		}
 
 		/* Add current page to the set of pending pages-to-dump */
@@ -2531,35 +2551,49 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
-			nleft = nbytes;
-			do
+
+			if (mappedLogFileAddr != NULL)
 			{
-				errno = 0;
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
+				PmemFileWrite((char *) mappedLogFileAddr + startoffset, from, nbytes);
 				pgstat_report_wait_end();
-				if (written <= 0)
+
+				written = nbytes;
+				nleft = 0;
+				from += nbytes;
+			}
+			else
+			{
+				nleft = nbytes;
+				do
 				{
-					char		xlogfname[MAXFNAMELEN];
-					int			save_errno;
+					errno = 0;
+					pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+					written = pg_pwrite(openLogFile, from, nleft, startoffset);
+					pgstat_report_wait_end();
+					if (written <= 0)
+					{
+						char		xlogfname[MAXFNAMELEN];
+						int			save_errno;
 
-					if (errno == EINTR)
-						continue;
+						if (errno == EINTR)
+							continue;
 
-					save_errno = errno;
-					XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
-								 wal_segment_size);
-					errno = save_errno;
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not write to log file %s "
-									"at offset %u, length %zu: %m",
-									xlogfname, startoffset, nleft)));
-				}
-				nleft -= written;
-				from += written;
-				startoffset += written;
-			} while (nleft > 0);
+						save_errno = errno;
+						XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
+									 wal_segment_size);
+						errno = save_errno;
+						ereport(PANIC,
+								(errcode_for_file_access(),
+								 errmsg("could not write to log file %s "
+										"at offset %u, length %zu: %m",
+										xlogfname, startoffset, nleft)));
+					}
+					nleft -= written;
+					from += written;
+					startoffset += written;
+				} while (nleft > 0);
+			}
 
 			npages = 0;
 
@@ -2637,16 +2671,17 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		if (sync_method != SYNC_METHOD_OPEN &&
 			sync_method != SYNC_METHOD_OPEN_DSYNC)
 		{
-			if (openLogFile >= 0 &&
+			if ((openLogFile >= 0 || mappedLogFileAddr != NULL) &&
 				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
 				XLogFileClose();
-			if (openLogFile < 0)
+			if (openLogFile < 0 && mappedLogFileAddr == NULL)
 			{
 				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 								wal_segment_size);
-				openLogFile = XLogFileOpen(openLogSegNo);
-				ReserveExternalFD();
+				openLogFile = XLogFileOpen(openLogSegNo, &mappedLogFileAddr);
+				if (openLogFile >= 0)
+					ReserveExternalFD();
 			}
 
 			issue_xlog_fsync(openLogFile, openLogSegNo);
@@ -3070,7 +3105,7 @@ XLogBackgroundFlush(void)
 	 */
 	if (WriteRqst.Write <= LogwrtResult.Flush)
 	{
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
@@ -3251,7 +3286,8 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
+XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+			 void **addr)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3261,6 +3297,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	int			fd;
 	int			nbytes;
 	int			save_errno;
+	void	   *tmpaddr = NULL;
 
 	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
 
@@ -3269,8 +3306,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 */
 	if (*use_existent)
 	{
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-		if (fd < 0)
+		fd = do_XLogFileOpen(path,
+							 O_RDWR | PG_BINARY | get_sync_bit(sync_method),
+							 &tmpaddr);
+		if (fd < 0 && tmpaddr == NULL)
 		{
 			if (errno != ENOENT)
 				ereport(ERROR,
@@ -3278,7 +3317,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 						 errmsg("could not open file \"%s\": %m", path)));
 		}
 		else
+		{
+			*addr = tmpaddr;
 			return fd;
+		}
 	}
 
 	/*
@@ -3294,8 +3336,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	fd = do_XLogFileOpen(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+						 &tmpaddr);
+	if (fd < 0 && tmpaddr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
@@ -3316,29 +3359,41 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
 		for (nbytes = 0; nbytes < wal_segment_size; nbytes += XLOG_BLCKSZ)
+		{
+			if (tmpaddr != NULL)
+				PmemFileWrite((char *) tmpaddr + nbytes, zbuffer.data,
+							  XLOG_BLCKSZ);
+			else
+			{
+				errno = 0;
+				if (write(fd, zbuffer.data, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+				{
+					/* if write didn't set errno, assume no disk space */
+					save_errno = errno ? errno : ENOSPC;
+					break;
+				}
+			}
+		}
+	}
+	else
+	{
+		/*
+		 * Otherwise, seeking to the end and writing a solitary byte is
+		 * enough.
+		 */
+		if (tmpaddr != NULL)
+			PmemFileWrite((char *) tmpaddr + wal_segment_size - 1,
+						  zbuffer.data, 1);
+		else
 		{
 			errno = 0;
-			if (write(fd, zbuffer.data, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+			if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
 			{
 				/* if write didn't set errno, assume no disk space */
 				save_errno = errno ? errno : ENOSPC;
-				break;
 			}
 		}
 	}
-	else
-	{
-		/*
-		 * Otherwise, seeking to the end and writing a solitary byte is
-		 * enough.
-		 */
-		errno = 0;
-		if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
-		{
-			/* if write didn't set errno, assume no disk space */
-			save_errno = errno ? errno : ENOSPC;
-		}
-	}
 	pgstat_report_wait_end();
 
 	if (save_errno)
@@ -3358,11 +3413,11 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, tmpaddr) != 0)
 	{
 		int			save_errno = errno;
 
-		close(fd);
+		do_XLogFileClose(fd, tmpaddr);
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3370,7 +3425,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	}
 	pgstat_report_wait_end();
 
-	if (close(fd) != 0)
+	if (do_XLogFileClose(fd, tmpaddr))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
@@ -3411,8 +3466,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	*use_existent = false;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+						 O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+	if (fd < 0 && *addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3447,13 +3503,20 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	int			srcfd;
 	int			fd;
 	int			nbytes;
+	void	   *src_addr = NULL;
+	void	   *dst_addr = NULL;
 
 	/*
 	 * Open the source file
 	 */
 	XLogFilePath(path, srcTLI, srcsegno, wal_segment_size);
-	srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (srcfd < 0)
+	srcfd = -1;
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		srcfd = MapTransientFile(path, O_RDONLY | PG_BINARY,
+								 wal_segment_size, &src_addr);
+	if (src_addr == NULL)
+		srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (srcfd < 0 && src_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3466,8 +3529,15 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = OpenTransientFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	if (src_addr != NULL && sync_method == SYNC_METHOD_PMEM_DRAIN)
+		fd = MapTransientFile(tmppath,
+							  O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  wal_segment_size, &dst_addr);
+	else
+		fd = OpenTransientFile(tmppath,
+							   O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+
+	if (fd < 0 && dst_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
@@ -3475,6 +3545,15 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	/*
 	 * Do the data copying.
 	 */
+	if (src_addr && dst_addr)
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_READ);
+		PmemFileWrite(dst_addr, src_addr, wal_segment_size);
+		pgstat_report_wait_end();
+
+		goto done_copy;
+	}
+
 	for (nbytes = 0; nbytes < wal_segment_size; nbytes += sizeof(buffer))
 	{
 		int			nread;
@@ -3531,14 +3610,22 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 		pgstat_report_wait_end();
 	}
 
+done_copy:
 	pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, dst_addr) != 0)
 		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
 
-	if (CloseTransientFile(fd) != 0)
+	if (dst_addr)
+	{
+		if (UnmapTransientFile(dst_addr, wal_segment_size))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not unmap file \"%s\": %m", tmppath)));
+	}
+	else if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
@@ -3547,6 +3634,13 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", path)));
+	if (src_addr)
+		UnmapTransientFile(src_addr, wal_segment_size);
+	else
+		if (CloseTransientFile(srcfd) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not close file \"%s\": %m", path)));
 
 	/*
 	 * Now move the segment into place with its final name.
@@ -3643,15 +3737,16 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  * Open a pre-existing logfile segment for writing.
  */
 int
-XLogFileOpen(XLogSegNo segno)
+XLogFileOpen(XLogSegNo segno, void **addr)
 {
 	char		path[MAXPGPATH];
 	int			fd;
 
 	XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+						 O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+	if (fd < 0 && *addr == NULL)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3667,7 +3762,7 @@ XLogFileOpen(XLogSegNo segno)
  */
 static int
 XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 XLogSource source, bool notfoundOk)
+			 XLogSource source, bool notfoundOk, void **addr)
 {
 	char		xlogfname[MAXFNAMELEN];
 	char		activitymsg[MAXFNAMELEN + 16];
@@ -3716,8 +3811,8 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlogfname);
 	}
 
-	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-	if (fd >= 0)
+	fd = do_XLogFileOpen(path, O_RDONLY | PG_BINARY, addr);
+	if (fd >= 0 || *addr != NULL)
 	{
 		/* Success! */
 		curFileTLI = tli;
@@ -3749,7 +3844,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  * This version searches for the segment with any TLI listed in expectedTLEs.
  */
 static int
-XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
+XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source, void **addr)
 {
 	char		path[MAXPGPATH];
 	ListCell   *cell;
@@ -3814,8 +3909,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_ARCHIVE, true);
-			if (fd != -1)
+							  XLOG_FROM_ARCHIVE, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				elog(DEBUG1, "got WAL segment from archive");
 				if (!expectedTLEs)
@@ -3827,8 +3922,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_PG_WAL, true);
-			if (fd != -1)
+							  XLOG_FROM_PG_WAL, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				if (!expectedTLEs)
 					expectedTLEs = tles;
@@ -3846,13 +3941,22 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 	return -1;
 }
 
+int
+do_XLogFileClose(int fd, void *addr)
+{
+	if (!addr)
+		return close(fd);
+
+	return PmemFileClose(addr, wal_segment_size);
+}
+
 /*
  * Close the current logfile segment for writing.
  */
 static void
 XLogFileClose(void)
 {
-	Assert(openLogFile >= 0);
+	Assert(openLogFile >= 0 || mappedLogFileAddr != NULL);
 
 	/*
 	 * WAL segment files will not be re-read in normal operation, so we advise
@@ -3861,11 +3965,11 @@ XLogFileClose(void)
 	 * use the cache to read the WAL segment.
 	 */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	if (!XLogIsNeeded())
+	if (!XLogIsNeeded() && openLogFile > 0)
 		(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif
 
-	if (close(openLogFile) != 0)
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 	{
 		char		xlogfname[MAXFNAMELEN];
 		int			save_errno = errno;
@@ -3877,8 +3981,12 @@ XLogFileClose(void)
 				 errmsg("could not close file \"%s\": %m", xlogfname)));
 	}
 
-	openLogFile = -1;
-	ReleaseExternalFD();
+	mappedLogFileAddr = NULL;
+	if (openLogFile >= 0)
+	{
+		openLogFile = -1;
+		ReleaseExternalFD();
+	}
 }
 
 /*
@@ -3897,6 +4005,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	XLogSegNo	_logSegNo;
 	int			lf;
 	bool		use_existent;
+	void	   *laddr = NULL;
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3905,8 +4014,8 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	{
 		_logSegNo++;
 		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent, true);
-		close(lf);
+		lf = XLogFileInit(_logSegNo, &use_existent, true, &laddr);
+		do_XLogFileClose(lf, laddr);
 		if (!use_existent)
 			CheckpointStats.ckpt_segs_added++;
 	}
@@ -4349,9 +4458,10 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			if (readFile >= 0)
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
 			{
-				close(readFile);
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+				mappedReadFileAddr = NULL;
 				readFile = -1;
 			}
 
@@ -5299,7 +5409,7 @@ BootStrapXLOG(void)
 
 	/* Create first XLOG segment file */
 	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	openLogFile = XLogFileInit(1, &use_existent, false, &mappedLogFileAddr);
 
 	/*
 	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
@@ -5308,30 +5418,39 @@ BootStrapXLOG(void)
 
 	/* Write the first page with the initial record */
 	errno = 0;
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+	if (mappedLogFileAddr != NULL)
 	{
-		/* if write didn't set errno, assume problem is no disk space */
-		if (errno == 0)
-			errno = ENOSPC;
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not write bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		PmemFileWrite(mappedLogFileAddr, page, XLOG_BLCKSZ);
+	}
+	else
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write bootstrap write-ahead log file: %m")));
+		}
 	}
 	pgstat_report_wait_end();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
+	if (xlog_fsync(openLogFile, (void *) mappedLogFileAddr) != 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
 	pgstat_report_wait_end();
 
-	if (close(openLogFile) != 0)
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not close bootstrap write-ahead log file: %m")));
 
+	mappedLogFileAddr = NULL;
 	openLogFile = -1;
 
 	/* Now create pg_control */
@@ -5566,9 +5685,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * If the ending log segment is still open, close it (to avoid problems on
 	 * Windows with trying to rename or delete an open file).
 	 */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 
@@ -5607,10 +5727,11 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 */
 		bool		use_existent = true;
 		int			fd;
+		void	   *tmpaddr = NULL;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+		fd = XLogFileInit(startLogSegNo, &use_existent, true, &tmpaddr);
 
-		if (close(fd) != 0)
+		if (do_XLogFileClose(fd, tmpaddr))
 		{
 			char		xlogfname[MAXFNAMELEN];
 			int			save_errno = errno;
@@ -7899,9 +8020,10 @@ StartupXLOG(void)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/* Shut down xlogreader */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 	XLogReaderFree(xlogreader);
@@ -10341,6 +10463,9 @@ get_sync_bit(int method)
 		case SYNC_METHOD_FSYNC:
 		case SYNC_METHOD_FSYNC_WRITETHROUGH:
 		case SYNC_METHOD_FDATASYNC:
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+#endif
 			return 0;
 #ifdef OPEN_SYNC_FLAG
 		case SYNC_METHOD_OPEN:
@@ -10358,7 +10483,36 @@ get_sync_bit(int method)
 }
 
 /*
- * GUC support
+ * GUC check_hook for xlog_sync_method
+ */
+bool
+check_xlog_sync_method(int *newval, void **extra, GucSource source)
+{
+	bool		ret;
+	char		tmppath[MAXPGPATH] = {};
+	int			val = newval ? *newval : sync_method;
+
+	if (val != SYNC_METHOD_PMEM_DRAIN)
+		return true;
+
+	snprintf(tmppath, MAXPGPATH, "%s/" XLOGDIR "/pmem.tmp.%d", DataDir, (int) getpid());
+
+	ret = CheckPmem(tmppath);
+
+	if (!ret)
+	{
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("invalid value for parameter \"wal_sync_method\": \"pmem_drain\"");
+		GUC_check_errmsg("%s isn't stored on persistent memory(pmem_is_pmem() returned false).",
+						 XLOGDIR);
+		GUC_check_errhint("Please see also ENVIRONMENT VARIABLES section in man libpmem.");
+	}
+
+	return ret;
+}
+
+/*
+ * GUC assign_hook for xlog_sync_method
  */
 void
 assign_xlog_sync_method(int new_sync_method, void *extra)
@@ -10371,10 +10525,10 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 		 * changing, close the log file so it will be reopened (with new flag
 		 * bit) at next use.
 		 */
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN);
-			if (pg_fsync(openLogFile) != 0)
+			if (xlog_fsync(openLogFile, (void *) mappedLogFileAddr) != 0)
 			{
 				char		xlogfname[MAXFNAMELEN];
 				int			save_errno;
@@ -10425,6 +10579,11 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 			if (pg_fdatasync(fd) != 0)
 				msg = _("could not fdatasync file \"%s\": %m");
 			break;
+#endif
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+			PmemFileSync();
+			break;
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
@@ -10452,6 +10611,16 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	pgstat_report_wait_end();
 }
 
+int
+xlog_fsync(int fd, void *addr)
+{
+	if (!addr)
+		return pg_fsync(fd);
+
+	PmemFileSync();
+	return 0;
+}
+
 /*
  * do_pg_start_backup
  *
@@ -11887,7 +12056,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 &&
+	if ((readFile >= 0 || mappedReadFileAddr != NULL) &&
 		!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
 	{
 		/*
@@ -11904,7 +12073,8 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			}
 		}
 
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 		readSource = XLOG_FROM_ANY;
 	}
@@ -11913,7 +12083,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
+	if ((readFile < 0 && mappedReadFileAddr == NULL) ||
 		(readSource == XLOG_FROM_STREAM &&
 		 flushedUpto < targetPagePtr + reqLen))
 	{
@@ -11922,8 +12092,9 @@ retry:
 										 private->fetching_ckpt,
 										 targetRecPtr))
 		{
-			if (readFile >= 0)
-				close(readFile);
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+			mappedReadFileAddr = NULL;
 			readFile = -1;
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
@@ -11936,7 +12107,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(readFile != -1 || mappedReadFileAddr != NULL);
 
 	/*
 	 * If the current segment is being streamed from the primary, calculate how
@@ -11959,28 +12130,33 @@ retry:
 	readOff = targetPageOff;
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
-	if (r != XLOG_BLCKSZ)
+	if (mappedReadFileAddr)
+		PmemFileRead((char *) mappedReadFileAddr + readOff, readBuf, XLOG_BLCKSZ);
+	else
 	{
-		char		fname[MAXFNAMELEN];
-		int			save_errno = errno;
-
-		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		if (r < 0)
+		r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+		if (r != XLOG_BLCKSZ)
 		{
-			errno = save_errno;
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
+			char		fname[MAXFNAMELEN];
+			int			save_errno = errno;
+
+			pgstat_report_wait_end();
+			XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+			if (r < 0)
+			{
+				errno = save_errno;
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode_for_file_access(),
+						 errmsg("could not read from log segment %s, offset %u: %m",
+								fname, readOff)));
+			}
+			else
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+								fname, readOff, r, (Size) XLOG_BLCKSZ)));
+			goto next_record_is_invalid;
 		}
-		else
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							fname, readOff, r, (Size) XLOG_BLCKSZ)));
-		goto next_record_is_invalid;
 	}
 	pgstat_report_wait_end();
 
@@ -12028,8 +12204,9 @@ retry:
 next_record_is_invalid:
 	lastSourceFailed = true;
 
-	if (readFile >= 0)
-		close(readFile);
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+	mappedReadFileAddr = NULL;
 	readFile = -1;
 	readLen = 0;
 	readSource = XLOG_FROM_ANY;
@@ -12269,9 +12446,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				Assert(!WalRcvStreaming());
 
 				/* Close any old file we might have open. */
-				if (readFile >= 0)
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 				{
-					close(readFile);
+					do_XLogFileClose(readFile,
+									 mappedReadFileAddr);
+					mappedReadFileAddr = NULL;
 					readFile = -1;
 				}
 				/* Reset curFileTLI if random fetch. */
@@ -12284,8 +12463,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 */
 				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
-				if (readFile >= 0)
+											  currentSource, &mappedReadFileAddr);
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 					return true;	/* success! */
 
 				/*
@@ -12419,14 +12598,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						 * info is set correctly and XLogReceiptTime isn't
 						 * changed.
 						 */
-						if (readFile < 0)
+						if (readFile < 0 && mappedReadFileAddr == NULL)
 						{
 							if (!expectedTLEs)
 								expectedTLEs = readTimeLineHistory(receiveTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
 													receiveTLI,
-													XLOG_FROM_STREAM, false);
-							Assert(readFile >= 0);
+													XLOG_FROM_STREAM, false, &mappedReadFileAddr);
+							Assert(readFile >= 0 || mappedReadFileAddr != NULL);
 						}
 						else
 						{
diff --git a/src/backend/storage/file/Makefile b/src/backend/storage/file/Makefile
index 5e1291bf2d..462c71bb03 100644
--- a/src/backend/storage/file/Makefile
+++ b/src/backend/storage/file/Makefile
@@ -17,6 +17,7 @@ OBJS = \
 	copydir.o \
 	fd.o \
 	reinit.o \
-	sharedfileset.o
+	sharedfileset.o \
+	pmem.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index 5f6420efb2..3281cf146f 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -94,6 +94,7 @@
 #include "portability/mem.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
@@ -231,6 +232,9 @@ static uint64 temporary_files_size = 0;
 typedef enum
 {
 	AllocateDescFile,
+#ifdef USE_LIBPMEM
+	AllocateDescMap,
+#endif
 	AllocateDescPipe,
 	AllocateDescDir,
 	AllocateDescRawFD
@@ -245,6 +249,10 @@ typedef struct
 		FILE	   *file;
 		DIR		   *dir;
 		int			fd;
+#ifdef USE_LIBPMEM
+		size_t		fsize;
+		void	   *addr;
+#endif
 	}			desc;
 } AllocateDesc;
 
@@ -1695,6 +1703,78 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
 	return file;
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Mmap a file with MapTransientFilePerm() and pass default file mode for
+ * the fileMode parameter.
+ */
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return MapTransientFilePerm(fileName, fileFlags, PG_FILE_MODE_DEFAULT,
+								fsize, addr);
+}
+
+/*
+ * Like AllocateFile, but returns an unbuffered pointer to the mapped area
+ * like mmap(2)
+ */
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr)
+{
+	int			fd;
+
+	DO_DB(elog(LOG, "MapTransientFilePerm: Allocated %d (%s)",
+			   numAllocatedDescs, fileName));
+
+	/* Can we allocate another non-virtual FD? */
+	if (!reserveAllocatedDesc())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("exceeded maxAllocatedDescs (%d) while trying to open file \"%s\"",
+						maxAllocatedDescs, fileName)));
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	if (addr != NULL)
+	{
+		void	   *ret_addr = NULL;
+
+		fd = PmemFileOpenPerm(fileName, fileFlags, fileMode, fsize, &ret_addr);
+		if (ret_addr != NULL)
+		{
+			AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
+
+			*addr = ret_addr;
+
+			desc->kind = AllocateDescMap;
+			desc->desc.addr = ret_addr;
+			desc->desc.fsize = fsize;
+			desc->create_subid = GetCurrentSubTransactionId();
+			numAllocatedDescs++;
+
+			return fd;
+		}
+	}
+
+	return -1;					/* failure */
+}
+#else
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return -1;
+}
+
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr)
+{
+	return -1;
+}
+#endif
 
 /*
  * Create a new file.  The directory containing it must already exist.  Files
@@ -2498,6 +2578,11 @@ FreeDesc(AllocateDesc *desc)
 		case AllocateDescRawFD:
 			result = close(desc->desc.fd);
 			break;
+#ifdef USE_LIBPMEM
+		case AllocateDescMap:
+			result = PmemFileClose(desc->desc.addr, desc->desc.fsize);
+			break;
+#endif
 		default:
 			elog(ERROR, "AllocateDesc kind not recognized");
 			result = 0;			/* keep compiler quiet */
@@ -2539,6 +2624,42 @@ FreeFile(FILE *file)
 	return fclose(file);
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Unmap a file returned by MapTransientFile.
+ *
+ * Note we do not check unmap's return value --- it is up to the caller
+ * to handle unmap errors.
+ */
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	int			i;
+
+	DO_DB(elog(LOG, "UnmapTransientFile: Allocated %d", numAllocatedDescs));
+
+	/* Remove fd from list of allocated files, if it's present */
+	for (i = numAllocatedDescs; --i >= 0;)
+	{
+		AllocateDesc *desc = &allocatedDescs[i];
+
+		if (desc->kind == AllocateDescMap && desc->desc.addr == addr)
+			return FreeDesc(desc);
+	}
+
+	/* Only get here if someone passes us a file not in allocatedDescs */
+	elog(WARNING, "fd passed to UnmapTransientFile was not obtained from MapTransientFile");
+
+	return PmemFileClose(addr, fsize);
+}
+#else
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	return -1;
+}
+#endif
+
 /*
  * Close a file returned by OpenTransientFile.
  *
diff --git a/src/backend/storage/file/pmem.c b/src/backend/storage/file/pmem.c
new file mode 100644
index 0000000000..b214b6b18e
--- /dev/null
+++ b/src/backend/storage/file/pmem.c
@@ -0,0 +1,188 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.c
+ *	  Virtual file descriptor code.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/file/pmem.c
+ *
+ * NOTES:
+ *
+ * This code manages an memory-mapped file on a filesystem mounted with DAX on
+ * persistent memory device using the Persistent Memory Development Kit
+ * (http://pmem.io/pmdk/).
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/pmem.h"
+#include "storage/fd.h"
+
+#ifdef USE_LIBPMEM
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <libpmem.h>
+#include <sys/mman.h>
+#include <string.h>
+
+#define PmemFileSize 32
+
+/*
+ * This function returns true, only if the file is stored on persistent memory.
+ */
+bool
+CheckPmem(const char *path)
+{
+	int			is_pmem = 0;	/* false */
+	size_t		mapped_len = 0;
+	bool		ret = true;
+	void	   *tmpaddr;
+
+	/*
+	 * The value of is_pmem is 0, if the file(path) isn't stored on persistent
+	 * memory.
+	 */
+	tmpaddr = pmem_map_file(path, PmemFileSize, PMEM_FILE_CREATE,
+							PG_FILE_MODE_DEFAULT, &mapped_len, &is_pmem);
+
+	if (tmpaddr)
+	{
+		pmem_unmap(tmpaddr, mapped_len);
+		unlink(path);
+	}
+
+	if (is_pmem)
+		elog(LOG, "%s is stored on persistent memory.", path);
+	else
+		ret = false;
+
+	return ret;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return PmemFileOpenPerm(pathname, flags, PG_FILE_MODE_DEFAULT, fsize, addr);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+				 void **addr)
+{
+	int			mapped_flag = 0;
+	size_t		mapped_len = 0;
+	size_t		size = 0;
+	void	   *ret_addr;
+
+	if (addr == NULL)
+		return BasicOpenFile(pathname, flags);
+
+	/* non-zero 'len' not allowed without PMEM_FILE_CREATE */
+	if (flags & O_CREAT)
+	{
+		mapped_flag = PMEM_FILE_CREATE;
+		size = fsize;
+	}
+
+	if (flags & O_EXCL)
+		mapped_flag |= PMEM_FILE_EXCL;
+
+	ret_addr = pmem_map_file(pathname, size, mapped_flag, mode, &mapped_len,
+							 NULL);
+
+	if (fsize != mapped_len)
+	{
+		if (ret_addr != NULL)
+			pmem_unmap(ret_addr, mapped_len);
+
+		return -1;
+	}
+
+	if (mapped_flag & PMEM_FILE_CREATE)
+		if (msync(ret_addr, mapped_len, MS_SYNC))
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not msync log file %s: %m", pathname)));
+
+	*addr = ret_addr;
+
+	return NO_FD_FOR_MAPPED_FILE;
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	pmem_memcpy_nodrain((void *) dest, src, len);
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	memcpy(buf, (void *) map_addr, len);
+}
+
+void
+PmemFileSync(void)
+{
+	return pmem_drain();
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	return pmem_unmap((void *) addr, fsize);
+}
+
+
+#else
+bool
+CheckPmem(const char *path)
+{
+	return true;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return BasicOpenFile(pathname, flags);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+				 void **addr)
+{
+	return BasicOpenFilePerm(pathname, flags, mode);
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileSync(void)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+	return -1;
+}
+#endif
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6f603cbbe8..47eb89f885 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -4680,7 +4680,7 @@ static struct config_enum ConfigureNamesEnum[] =
 		},
 		&sync_method,
 		DEFAULT_SYNC_METHOD, sync_method_options,
-		NULL, assign_xlog_sync_method, NULL
+		check_xlog_sync_method, assign_xlog_sync_method, NULL
 	},
 
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 5a0b8e9821..eeb5ba3a0e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -208,6 +208,7 @@
 					#   fsync
 					#   fsync_writethrough
 					#   open_sync
+					#   pmem_drain
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
 #wal_log_hints = off			# also do full page writes of non-critical updates
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 219a7299e1..278a4a1dcf 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -27,6 +27,7 @@
 #define SYNC_METHOD_OPEN		2	/* for O_SYNC */
 #define SYNC_METHOD_FSYNC_WRITETHROUGH	3
 #define SYNC_METHOD_OPEN_DSYNC	4	/* for O_DSYNC */
+#define SYNC_METHOD_PMEM_DRAIN	5	/* for Persistent Memory Development Kit */
 extern int	sync_method;
 
 extern PGDLLIMPORT TimeLineID ThisTimeLineID;	/* current TLI */
@@ -287,8 +288,10 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
-extern int	XLogFileOpen(XLogSegNo segno);
+extern int XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+						void **addr);
+extern int	XLogFileOpen(XLogSegNo segno, void **addr);
+extern int	do_XLogFileClose(int fd, void *addr);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
@@ -300,6 +303,7 @@ extern void xlog_desc(StringInfo buf, XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
+extern int	xlog_fsync(int fd, void *addr);
 
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 8cd125d7df..c3ec6ecbb3 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -49,6 +49,12 @@
 typedef int File;
 
 
+/*
+ * Default mode for created files, unless something else is specified using
+ * the *Perm() function variants.
+ */
+#define PG_FILE_MODE_DEFAULT	(S_IRUSR | S_IWUSR)
+
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
@@ -120,6 +126,13 @@ extern int	OpenTransientFile(const char *fileName, int fileFlags);
 extern int	OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern int	CloseTransientFile(int fd);
 
+/* Operations to allow use of a memory-mapped file */
+extern int MapTransientFile(const char *fileName, int fileFlags, size_t fsize,
+				 void **addr);
+extern int MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr);
+extern int	UnmapTransientFile(void *addr, size_t fsize);
+
 /* If you've really really gotta have a plain kernel FD, use this */
 extern int	BasicOpenFile(const char *fileName, int fileFlags);
 extern int	BasicOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
diff --git a/src/include/storage/pmem.h b/src/include/storage/pmem.h
new file mode 100644
index 0000000000..b9b9156c91
--- /dev/null
+++ b/src/include/storage/pmem.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.h
+ *		Virtual file descriptor definitions for persistent memory.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/pmem.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PMEM_H
+#define PMEM_H
+
+#include "postgres.h"
+
+#define NO_FD_FOR_MAPPED_FILE -2
+
+extern bool CheckPmem(const char *path);
+extern int PmemFileOpen(const char *pathname, int flags, size_t fsize,
+			 void **addr);
+extern int PmemFileOpenPerm(const char *pathname, int flags, int mode,
+				 size_t fsize, void **addr);
+extern void PmemFileWrite(void *dest, void *src, size_t len);
+extern void PmemFileRead(void *map_addr, void *buf, size_t len);
+extern void PmemFileSync(void);
+extern int	PmemFileClose(void *addr, size_t fsize);
+
+#endif							/* PMEM_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 2819282181..802d281245 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -438,6 +438,7 @@ extern void assign_search_path(const char *newval, void *extra);
 
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_xlog_sync_method(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
 #endif							/* GUC_H */
-- 
2.25.1

v4-0001-Add-configure-option-for-PMDK.patchapplication/octet-stream; name=v4-0001-Add-configure-option-for-PMDK.patchDownload
From 71fe47b5d2b4edc4b738f9764e014082323be6aa Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Tue, 4 Aug 2020 12:59:29 +0900
Subject: [PATCH v4 1/3] Add configure option for PMDK

Author: Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp>
---
 configure                  | 92 ++++++++++++++++++++++++++++++++++++++
 configure.ac               | 16 +++++++
 src/include/pg_config.h.in |  6 +++
 3 files changed, 114 insertions(+)

diff --git a/configure b/configure
index cb8fbe1051..8795969c81 100755
--- a/configure
+++ b/configure
@@ -701,6 +701,7 @@ LDFLAGS_SL
 LDFLAGS_EX
 with_zlib
 with_system_tzdata
+with_libpmem
 with_libxslt
 XML2_LIBS
 XML2_CFLAGS
@@ -864,6 +865,7 @@ with_uuid
 with_ossp_uuid
 with_libxml
 with_libxslt
+with_libpmem
 with_system_tzdata
 with_zlib
 with_gnu_ld
@@ -1567,6 +1569,7 @@ Optional Packages:
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
   --with-libxslt          use XSLT support when building contrib/xml2
+  --with-libpmem          use PMEM support for WAL I/O
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
@@ -8537,6 +8540,33 @@ fi
 
 
 
+#
+# PMEM
+#
+
+
+
+# Check whether --with-libpmem was given.
+if test "${with_libpmem+set}" = set; then :
+  withval=$with_libpmem;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBPMEM 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-libpmem option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_libpmem=no
+
+fi
 
 
 
@@ -12586,6 +12616,57 @@ fi
 
 fi
 
+if test "$with_libpmem" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'pmwm' is required for PMEM support" "$LINENO" 5
+fi
+
+fi
+
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -13412,6 +13493,17 @@ else
 fi
 
 
+fi
+
+if test "$with_libpmem" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for PMEM support" "$LINENO" 5
+fi
+
+
 fi
 
 if test "$with_ldap" = yes ; then
diff --git a/configure.ac b/configure.ac
index eb2c731b58..bc003624f0 100644
--- a/configure.ac
+++ b/configure.ac
@@ -980,6 +980,14 @@ PGAC_ARG_BOOL(with, libxslt, no, [use XSLT support when building contrib/xml2],
 
 AC_SUBST(with_libxslt)
 
+#
+# PMEM
+#
+PGAC_ARG_BOOL(with, libpmem, no, [use PMEM support for WAL I/O],
+	      [AC_DEFINE([USE_LIBPMEM], 1, [Define to 1 to use PMEM support for WAL I/O. (--with-libpmem)])])
+
+AC_SUBST(with_libpmem)
+
 #
 # tzdata
 #
@@ -1242,6 +1250,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_LIB(xslt, xsltCleanupGlobals, [], [AC_MSG_ERROR([library 'xslt' is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [], [AC_MSG_ERROR([library 'pmem' is required for PMEM support])])
+fi
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -1427,6 +1439,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_HEADER(libxslt/xslt.h, [], [AC_MSG_ERROR([header file <libxslt/xslt.h> is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for PMEM support])])
+fi
+
 if test "$with_ldap" = yes ; then
   if test "$PORTNAME" != "win32"; then
      AC_CHECK_HEADERS(ldap.h, [],
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index fb270df678..1d4f6efe67 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -343,6 +343,9 @@
 /* Define to 1 if you have the `xslt' library (-lxslt). */
 #undef HAVE_LIBXSLT
 
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
 /* Define to 1 if you have the `z' library (-lz). */
 #undef HAVE_LIBZ
 
@@ -881,6 +884,9 @@
 /* Define to 1 to build with LLVM based JIT support. (--with-llvm) */
 #undef USE_LLVM
 
+/* Define to 1 to use PMEM support for WAL I/O. (--with-libpmem) */
+#undef USE_LIBPMEM
+
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
-- 
2.25.1

v4-0003-Walreceiver-WAL-IO-using-PMDK.patchapplication/octet-stream; name=v4-0003-Walreceiver-WAL-IO-using-PMDK.patchDownload
From 6839ee902bcc4e725e2144b4aa2bcb125efd05eb Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Tue, 4 Aug 2020 13:03:02 +0900
Subject: [PATCH v4 3/3] Walreceiver WAL IO using PMDK

Author: Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp>
---
 src/backend/replication/walreceiver.c | 62 ++++++++++++++++-----------
 1 file changed, 38 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index d5a9b568a6..b7fbd841ae 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -68,6 +68,7 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
 #include "storage/procsignal.h"
@@ -103,6 +104,8 @@ WalReceiverFunctionsType *WalReceiverFunctions = NULL;
 static int	recvFile = -1;
 static TimeLineID recvFileTLI = 0;
 static XLogSegNo recvSegNo = 0;
+static uint32 recvOff = 0;
+void	   *mappedFileAddr = NULL;
 
 /*
  * Flags set by interrupt handlers of walreceiver for later service in the
@@ -610,13 +613,13 @@ WalReceiverMain(void)
 		 * End of WAL reached on the requested timeline. Close the last
 		 * segment, and await for new orders from the startup process.
 		 */
-		if (recvFile >= 0)
+		if (recvFile >= 0 || mappedFileAddr != NULL)
 		{
 			char		xlogfname[MAXFNAMELEN];
 
 			XLogWalRcvFlush(false);
 			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-			if (close(recvFile) != 0)
+			if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not close log segment %s: %m",
@@ -632,6 +635,7 @@ WalReceiverMain(void)
 				XLogArchiveNotify(xlogfname);
 		}
 		recvFile = -1;
+		mappedFileAddr = NULL;
 
 		elog(DEBUG1, "walreceiver ended streaming and awaits new instructions");
 		WalRcvWaitForStartPosition(&startpoint, &startpointTLI);
@@ -902,7 +906,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		if ((recvFile < 0 && mappedFileAddr == NULL) ||
+			!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
 			bool		use_existent;
 
@@ -910,7 +915,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			 * fsync() and close current file before we switch to next one. We
 			 * would otherwise have to reopen this file to fsync it later
 			 */
-			if (recvFile >= 0)
+			if (recvFile >= 0 || mappedFileAddr != NULL)
 			{
 				char		xlogfname[MAXFNAMELEN];
 
@@ -923,7 +928,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				 * process soon, so we don't advise the OS to release cache
 				 * pages associated with the file like XLogFileClose() does.
 				 */
-				if (close(recvFile) != 0)
+				if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 					ereport(PANIC,
 							(errcode_for_file_access(),
 							 errmsg("could not close log segment %s: %m",
@@ -939,11 +944,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveNotify(xlogfname);
 			}
 			recvFile = -1;
+			mappedFileAddr = NULL;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			use_existent = true;
-			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
+			recvFile = XLogFileInit(recvSegNo, &use_existent, true, &mappedFileAddr);
 			recvFileTLI = ThisTimeLineID;
 		}
 
@@ -955,27 +961,35 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		else
 			segbytes = nbytes;
 
-		/* OK to write the logs */
-		errno = 0;
-
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
-		if (byteswritten <= 0)
+		if (mappedFileAddr)
+		{
+			PmemFileWrite((char *) mappedFileAddr + startoff, buf, segbytes);
+			byteswritten = segbytes;
+		}
+		else
 		{
-			char		xlogfname[MAXFNAMELEN];
-			int			save_errno;
+			/* OK to write the logs */
+			errno = 0;
+
+			byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+			if (byteswritten <= 0)
+			{
+				char		xlogfname[MAXFNAMELEN];
+				int			save_errno;
 
-			/* if write didn't set errno, assume no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
+				/* if write didn't set errno, assume no disk space */
+				if (errno == 0)
+					errno = ENOSPC;
 
-			save_errno = errno;
-			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-			errno = save_errno;
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not write to log segment %s "
-							"at offset %u, length %lu: %m",
-							xlogfname, startoff, (unsigned long) segbytes)));
+				save_errno = errno;
+				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+				errno = save_errno;
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not write to log segment %s "
+								"at offset %u, length %lu: %m",
+								xlogfname, startoff, (unsigned long) segbytes)));
+			}
 		}
 
 		/* Update state for write */
-- 
2.25.1

#39Takashi Menjo
takashi.menjo@gmail.com
In reply to: Peter Eisentraut (#33)
3 attachment(s)
Re: [HACKERS][PATCH] Applying PMDK to WAL operations for persistent memory

Attachments:

v5-0001-Add-configure-option-for-PMDK.patchapplication/octet-stream; name=v5-0001-Add-configure-option-for-PMDK.patchDownload
From 66b102ae7a7e5fbb904e45e71ea8758c730b9b3d Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Tue, 4 Aug 2020 12:59:29 +0900
Subject: [PATCH v5 1/3] Add configure option for PMDK

Author: Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp>
---
 configure                  | 92 ++++++++++++++++++++++++++++++++++++++
 configure.ac               | 16 +++++++
 src/include/pg_config.h.in |  6 +++
 3 files changed, 114 insertions(+)

diff --git a/configure b/configure
index 8af4b99021..2b54b8618d 100755
--- a/configure
+++ b/configure
@@ -700,6 +700,7 @@ LDFLAGS_SL
 LDFLAGS_EX
 with_zlib
 with_system_tzdata
+with_libpmem
 with_libxslt
 XML2_LIBS
 XML2_CFLAGS
@@ -864,6 +865,7 @@ with_uuid
 with_ossp_uuid
 with_libxml
 with_libxslt
+with_libpmem
 with_system_tzdata
 with_zlib
 with_gnu_ld
@@ -1567,6 +1569,7 @@ Optional Packages:
   --with-ossp-uuid        obsolete spelling of --with-uuid=ossp
   --with-libxml           build with XML support
   --with-libxslt          use XSLT support when building contrib/xml2
+  --with-libpmem          use PMEM support for WAL I/O
   --with-system-tzdata=DIR
                           use system time zone data in DIR
   --without-zlib          do not use Zlib
@@ -8543,6 +8546,33 @@ fi
 
 
 
+#
+# PMEM
+#
+
+
+
+# Check whether --with-libpmem was given.
+if test "${with_libpmem+set}" = set; then :
+  withval=$with_libpmem;
+  case $withval in
+    yes)
+
+$as_echo "#define USE_LIBPMEM 1" >>confdefs.h
+
+      ;;
+    no)
+      :
+      ;;
+    *)
+      as_fn_error $? "no argument expected for --with-libpmem option" "$LINENO" 5
+      ;;
+  esac
+
+else
+  with_libpmem=no
+
+fi
 
 
 
@@ -12592,6 +12622,57 @@ fi
 
 fi
 
+if test "$with_libpmem" = yes ; then
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking for pmem_map_file in -lpmem" >&5
+$as_echo_n "checking for pmem_map_file in -lpmem... " >&6; }
+if ${ac_cv_lib_pmem_pmem_map_file+:} false; then :
+  $as_echo_n "(cached) " >&6
+else
+  ac_check_lib_save_LIBS=$LIBS
+LIBS="-lpmem  $LIBS"
+cat confdefs.h - <<_ACEOF >conftest.$ac_ext
+/* end confdefs.h.  */
+
+/* Override any GCC internal prototype to avoid an error.
+   Use char because int might match the return type of a GCC
+   builtin and then its argument prototype would still apply.  */
+#ifdef __cplusplus
+extern "C"
+#endif
+char pmem_map_file ();
+int
+main ()
+{
+return pmem_map_file ();
+  ;
+  return 0;
+}
+_ACEOF
+if ac_fn_c_try_link "$LINENO"; then :
+  ac_cv_lib_pmem_pmem_map_file=yes
+else
+  ac_cv_lib_pmem_pmem_map_file=no
+fi
+rm -f core conftest.err conftest.$ac_objext \
+    conftest$ac_exeext conftest.$ac_ext
+LIBS=$ac_check_lib_save_LIBS
+fi
+{ $as_echo "$as_me:${as_lineno-$LINENO}: result: $ac_cv_lib_pmem_pmem_map_file" >&5
+$as_echo "$ac_cv_lib_pmem_pmem_map_file" >&6; }
+if test "x$ac_cv_lib_pmem_pmem_map_file" = xyes; then :
+  cat >>confdefs.h <<_ACEOF
+#define HAVE_LIBPMEM 1
+_ACEOF
+
+  LIBS="-lpmem $LIBS"
+
+else
+  as_fn_error $? "library 'pmwm' is required for PMEM support" "$LINENO" 5
+fi
+
+fi
+
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -13418,6 +13499,17 @@ else
 fi
 
 
+fi
+
+if test "$with_libpmem" = yes ; then
+  ac_fn_c_check_header_mongrel "$LINENO" "libpmem.h" "ac_cv_header_libpmem_h" "$ac_includes_default"
+if test "x$ac_cv_header_libpmem_h" = xyes; then :
+
+else
+  as_fn_error $? "header file <libpmem.h> is required for PMEM support" "$LINENO" 5
+fi
+
+
 fi
 
 if test "$with_ldap" = yes ; then
diff --git a/configure.ac b/configure.ac
index 868a94c9ba..8e40247352 100644
--- a/configure.ac
+++ b/configure.ac
@@ -985,6 +985,14 @@ PGAC_ARG_BOOL(with, libxslt, no, [use XSLT support when building contrib/xml2],
 
 AC_SUBST(with_libxslt)
 
+#
+# PMEM
+#
+PGAC_ARG_BOOL(with, libpmem, no, [use PMEM support for WAL I/O],
+	      [AC_DEFINE([USE_LIBPMEM], 1, [Define to 1 to use PMEM support for WAL I/O. (--with-libpmem)])])
+
+AC_SUBST(with_libpmem)
+
 #
 # tzdata
 #
@@ -1247,6 +1255,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_LIB(xslt, xsltCleanupGlobals, [], [AC_MSG_ERROR([library 'xslt' is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_LIB(pmem, pmem_map_file, [], [AC_MSG_ERROR([library 'pmem' is required for PMEM support])])
+fi
+
 # Note: We can test for libldap_r only after we know PTHREAD_LIBS
 if test "$with_ldap" = yes ; then
   _LIBS="$LIBS"
@@ -1433,6 +1445,10 @@ if test "$with_libxslt" = yes ; then
   AC_CHECK_HEADER(libxslt/xslt.h, [], [AC_MSG_ERROR([header file <libxslt/xslt.h> is required for XSLT support])])
 fi
 
+if test "$with_libpmem" = yes ; then
+  AC_CHECK_HEADER(libpmem.h, [], [AC_MSG_ERROR([header file <libpmem.h> is required for PMEM support])])
+fi
+
 if test "$with_ldap" = yes ; then
   if test "$PORTNAME" != "win32"; then
      AC_CHECK_HEADERS(ldap.h, [],
diff --git a/src/include/pg_config.h.in b/src/include/pg_config.h.in
index f4d9f3b408..977746922a 100644
--- a/src/include/pg_config.h.in
+++ b/src/include/pg_config.h.in
@@ -343,6 +343,9 @@
 /* Define to 1 if you have the `xslt' library (-lxslt). */
 #undef HAVE_LIBXSLT
 
+/* Define to 1 if you have the `pmem' library (-lpmem). */
+#undef HAVE_LIBPMEM
+
 /* Define to 1 if you have the `z' library (-lz). */
 #undef HAVE_LIBZ
 
@@ -896,6 +899,9 @@
 /* Define to 1 to build with LLVM based JIT support. (--with-llvm) */
 #undef USE_LLVM
 
+/* Define to 1 to use PMEM support for WAL I/O. (--with-libpmem) */
+#undef USE_LIBPMEM
+
 /* Define to select named POSIX semaphores. */
 #undef USE_NAMED_POSIX_SEMAPHORES
 
-- 
2.25.1

v5-0003-Walreceiver-WAL-IO-using-PMDK.patchapplication/octet-stream; name=v5-0003-Walreceiver-WAL-IO-using-PMDK.patchDownload
From 7fc63665c3975ac484f82f3a4f8bb0febd7df487 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Tue, 4 Aug 2020 13:03:02 +0900
Subject: [PATCH v5 3/3] Walreceiver WAL IO using PMDK

Author: Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp>
---
 src/backend/replication/walreceiver.c | 61 ++++++++++++++++-----------
 1 file changed, 37 insertions(+), 24 deletions(-)

diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 723f513d8b..560bcd7301 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -68,6 +68,7 @@
 #include "replication/walreceiver.h"
 #include "replication/walsender.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/procarray.h"
 #include "storage/procsignal.h"
@@ -103,6 +104,7 @@ WalReceiverFunctionsType *WalReceiverFunctions = NULL;
 static int	recvFile = -1;
 static TimeLineID recvFileTLI = 0;
 static XLogSegNo recvSegNo = 0;
+void	   *mappedFileAddr = NULL;
 
 /*
  * LogstreamResult indicates the byte positions that we have already
@@ -596,13 +598,13 @@ WalReceiverMain(void)
 		 * End of WAL reached on the requested timeline. Close the last
 		 * segment, and await for new orders from the startup process.
 		 */
-		if (recvFile >= 0)
+		if (recvFile >= 0 || mappedFileAddr != NULL)
 		{
 			char		xlogfname[MAXFNAMELEN];
 
 			XLogWalRcvFlush(false);
 			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-			if (close(recvFile) != 0)
+			if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 				ereport(PANIC,
 						(errcode_for_file_access(),
 						 errmsg("could not close log segment %s: %m",
@@ -618,6 +620,7 @@ WalReceiverMain(void)
 				XLogArchiveNotify(xlogfname);
 		}
 		recvFile = -1;
+		mappedFileAddr = NULL;
 
 		elog(DEBUG1, "walreceiver ended streaming and awaits new instructions");
 		WalRcvWaitForStartPosition(&startpoint, &startpointTLI);
@@ -875,7 +878,8 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 	{
 		int			segbytes;
 
-		if (recvFile < 0 || !XLByteInSeg(recptr, recvSegNo, wal_segment_size))
+		if ((recvFile < 0 && mappedFileAddr == NULL) ||
+			!XLByteInSeg(recptr, recvSegNo, wal_segment_size))
 		{
 			bool		use_existent;
 
@@ -883,7 +887,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 			 * fsync() and close current file before we switch to next one. We
 			 * would otherwise have to reopen this file to fsync it later
 			 */
-			if (recvFile >= 0)
+			if (recvFile >= 0 || mappedFileAddr != NULL)
 			{
 				char		xlogfname[MAXFNAMELEN];
 
@@ -896,7 +900,7 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 				 * process soon, so we don't advise the OS to release cache
 				 * pages associated with the file like XLogFileClose() does.
 				 */
-				if (close(recvFile) != 0)
+				if (do_XLogFileClose(recvFile, mappedFileAddr) != 0)
 					ereport(PANIC,
 							(errcode_for_file_access(),
 							 errmsg("could not close log segment %s: %m",
@@ -912,11 +916,12 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 					XLogArchiveNotify(xlogfname);
 			}
 			recvFile = -1;
+			mappedFileAddr = NULL;
 
 			/* Create/use new log file */
 			XLByteToSeg(recptr, recvSegNo, wal_segment_size);
 			use_existent = true;
-			recvFile = XLogFileInit(recvSegNo, &use_existent, true);
+			recvFile = XLogFileInit(recvSegNo, &use_existent, true, &mappedFileAddr);
 			recvFileTLI = ThisTimeLineID;
 		}
 
@@ -928,27 +933,35 @@ XLogWalRcvWrite(char *buf, Size nbytes, XLogRecPtr recptr)
 		else
 			segbytes = nbytes;
 
-		/* OK to write the logs */
-		errno = 0;
-
-		byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
-		if (byteswritten <= 0)
+		if (mappedFileAddr)
+		{
+			PmemFileWrite((char *) mappedFileAddr + startoff, buf, segbytes);
+			byteswritten = segbytes;
+		}
+		else
 		{
-			char		xlogfname[MAXFNAMELEN];
-			int			save_errno;
+			/* OK to write the logs */
+			errno = 0;
+
+			byteswritten = pg_pwrite(recvFile, buf, segbytes, (off_t) startoff);
+			if (byteswritten <= 0)
+			{
+				char		xlogfname[MAXFNAMELEN];
+				int			save_errno;
 
-			/* if write didn't set errno, assume no disk space */
-			if (errno == 0)
-				errno = ENOSPC;
+				/* if write didn't set errno, assume no disk space */
+				if (errno == 0)
+					errno = ENOSPC;
 
-			save_errno = errno;
-			XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
-			errno = save_errno;
-			ereport(PANIC,
-					(errcode_for_file_access(),
-					 errmsg("could not write to log segment %s "
-							"at offset %u, length %lu: %m",
-							xlogfname, startoff, (unsigned long) segbytes)));
+				save_errno = errno;
+				XLogFileName(xlogfname, recvFileTLI, recvSegNo, wal_segment_size);
+				errno = save_errno;
+				ereport(PANIC,
+						(errcode_for_file_access(),
+						 errmsg("could not write to log segment %s "
+								"at offset %u, length %lu: %m",
+								xlogfname, startoff, (unsigned long) segbytes)));
+			}
 		}
 
 		/* Update state for write */
-- 
2.25.1

v5-0002-Read-write-WAL-files-using-PMDK.patchapplication/octet-stream; name=v5-0002-Read-write-WAL-files-using-PMDK.patchDownload
From 9b2e309113d97ce5982b581c9cd86fc9001b2952 Mon Sep 17 00:00:00 2001
From: Takashi Menjo <takashi.menjou.vg@hco.ntt.co.jp>
Date: Tue, 4 Aug 2020 13:02:14 +0900
Subject: [PATCH v5 2/3] Read write WAL files using PMDK

Author: Yoshimi Ichiyanagi <ichiyanagi.yoshimi@lab.ntt.co.jp>
---
 src/backend/access/transam/xlog.c             | 479 ++++++++++++------
 src/backend/storage/file/Makefile             |   3 +-
 src/backend/storage/file/fd.c                 | 121 +++++
 src/backend/storage/file/pmem.c               | 188 +++++++
 src/backend/utils/misc/guc.c                  |   2 +-
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/access/xlog.h                     |   8 +-
 src/include/storage/fd.h                      |  13 +
 src/include/storage/pmem.h                    |  32 ++
 src/include/utils/guc.h                       |   1 +
 10 files changed, 696 insertions(+), 152 deletions(-)
 create mode 100644 src/backend/storage/file/pmem.c
 create mode 100644 src/include/storage/pmem.h

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 470e113b33..d77725059d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -64,6 +64,7 @@
 #include "storage/ipc.h"
 #include "storage/large_object.h"
 #include "storage/latch.h"
+#include "storage/pmem.h"
 #include "storage/pmsignal.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -150,6 +151,9 @@ const struct config_enum_entry sync_method_options[] = {
 #endif
 #ifdef OPEN_DATASYNC_FLAG
 	{"open_datasync", SYNC_METHOD_OPEN_DSYNC, false},
+#endif
+#ifdef USE_LIBPMEM
+	{"pmem_drain", SYNC_METHOD_PMEM_DRAIN, false},
 #endif
 	{NULL, 0, false}
 };
@@ -808,6 +812,7 @@ static const char *const xlogSourceNames[] = {"any", "archive", "pg_wal", "strea
  */
 static int	openLogFile = -1;
 static XLogSegNo openLogSegNo = 0;
+static void *mappedLogFileAddr = NULL;
 
 /*
  * These variables are used similarly to the ones above, but for reading
@@ -825,6 +830,7 @@ static XLogSegNo readSegNo = 0;
 static uint32 readOff = 0;
 static uint32 readLen = 0;
 static XLogSource readSource = XLOG_FROM_ANY;
+static void *mappedReadFileAddr = NULL;
 
 /*
  * Keeps track of which source we're currently reading from. This is
@@ -914,13 +920,15 @@ static XLogRecPtr XLogGetReplicationSlotMinimumLSN(void);
 
 static void AdvanceXLInsertBuffer(XLogRecPtr upto, bool opportunistic);
 static bool XLogCheckpointNeeded(XLogSegNo new_segno);
+static int	do_XLogFileOpen(char *pathname, int flags, void **addr);
 static void XLogWrite(XLogwrtRqst WriteRqst, bool flexible);
 static bool InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
 								   bool find_free, XLogSegNo max_segno,
 								   bool use_lock);
 static int	XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-						 XLogSource source, bool notfoundOk);
-static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source);
+						 XLogSource source, bool notfoundOk, void **addr);
+static int	XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source,
+							   void **addr);
 static int	XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr,
 						 int reqLen, XLogRecPtr targetRecPtr, char *readBuf);
 static bool WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
@@ -2412,6 +2420,15 @@ XLogCheckpointNeeded(XLogSegNo new_segno)
 	return false;
 }
 
+static int
+do_XLogFileOpen(char *pathname, int flags, void **addr)
+{
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		return PmemFileOpen(pathname, flags, wal_segment_size, addr);
+	else
+		return BasicOpenFile(pathname, flags);
+}
+
 /*
  * Write and/or fsync the log at least as far as WriteRqst indicates.
  *
@@ -2491,24 +2508,27 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			 * pages here (since we dump what we have at segment end).
 			 */
 			Assert(npages == 0);
-			if (openLogFile >= 0)
+			if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 				XLogFileClose();
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
 
 			/* create/use new log file */
 			use_existent = true;
-			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true);
-			ReserveExternalFD();
+			openLogFile = XLogFileInit(openLogSegNo, &use_existent, true,
+									   &mappedLogFileAddr);
+			if (openLogFile >= 0)
+				ReserveExternalFD();
 		}
 
 		/* Make sure we have the current logfile open */
-		if (openLogFile < 0)
+		if (openLogFile < 0 && mappedLogFileAddr == NULL)
 		{
 			XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 							wal_segment_size);
-			openLogFile = XLogFileOpen(openLogSegNo);
-			ReserveExternalFD();
+			openLogFile = XLogFileOpen(openLogSegNo, &mappedLogFileAddr);
+			if (openLogFile >= 0)
+				ReserveExternalFD();
 		}
 
 		/* Add current page to the set of pending pages-to-dump */
@@ -2544,35 +2564,49 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 			/* OK to write the page(s) */
 			from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
 			nbytes = npages * (Size) XLOG_BLCKSZ;
-			nleft = nbytes;
-			do
+
+			if (mappedLogFileAddr != NULL)
 			{
-				errno = 0;
 				pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
-				written = pg_pwrite(openLogFile, from, nleft, startoffset);
+				PmemFileWrite((char *) mappedLogFileAddr + startoffset, from, nbytes);
 				pgstat_report_wait_end();
-				if (written <= 0)
+
+				written = nbytes;
+				nleft = 0;
+				from += nbytes;
+			}
+			else
+			{
+				nleft = nbytes;
+				do
 				{
-					char		xlogfname[MAXFNAMELEN];
-					int			save_errno;
+					errno = 0;
+					pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
+					written = pg_pwrite(openLogFile, from, nleft, startoffset);
+					pgstat_report_wait_end();
+					if (written <= 0)
+					{
+						char		xlogfname[MAXFNAMELEN];
+						int			save_errno;
 
-					if (errno == EINTR)
-						continue;
+						if (errno == EINTR)
+							continue;
 
-					save_errno = errno;
-					XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
-								 wal_segment_size);
-					errno = save_errno;
-					ereport(PANIC,
-							(errcode_for_file_access(),
-							 errmsg("could not write to log file %s "
-									"at offset %u, length %zu: %m",
-									xlogfname, startoffset, nleft)));
-				}
-				nleft -= written;
-				from += written;
-				startoffset += written;
-			} while (nleft > 0);
+						save_errno = errno;
+						XLogFileName(xlogfname, ThisTimeLineID, openLogSegNo,
+									 wal_segment_size);
+						errno = save_errno;
+						ereport(PANIC,
+								(errcode_for_file_access(),
+								 errmsg("could not write to log file %s "
+										"at offset %u, length %zu: %m",
+										xlogfname, startoffset, nleft)));
+					}
+					nleft -= written;
+					from += written;
+					startoffset += written;
+				} while (nleft > 0);
+			}
 
 			npages = 0;
 
@@ -2650,16 +2684,17 @@ XLogWrite(XLogwrtRqst WriteRqst, bool flexible)
 		if (sync_method != SYNC_METHOD_OPEN &&
 			sync_method != SYNC_METHOD_OPEN_DSYNC)
 		{
-			if (openLogFile >= 0 &&
+			if ((openLogFile >= 0 || mappedLogFileAddr != NULL) &&
 				!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
 				XLogFileClose();
-			if (openLogFile < 0)
+			if (openLogFile < 0 && mappedLogFileAddr == NULL)
 			{
 				XLByteToPrevSeg(LogwrtResult.Write, openLogSegNo,
 								wal_segment_size);
-				openLogFile = XLogFileOpen(openLogSegNo);
-				ReserveExternalFD();
+				openLogFile = XLogFileOpen(openLogSegNo, &mappedLogFileAddr);
+				if (openLogFile >= 0)
+					ReserveExternalFD();
 			}
 
 			issue_xlog_fsync(openLogFile, openLogSegNo);
@@ -3083,7 +3118,7 @@ XLogBackgroundFlush(void)
 	 */
 	if (WriteRqst.Write <= LogwrtResult.Flush)
 	{
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			if (!XLByteInPrevSeg(LogwrtResult.Write, openLogSegNo,
 								 wal_segment_size))
@@ -3264,7 +3299,8 @@ XLogNeedsFlush(XLogRecPtr record)
  * in a critical section.
  */
 int
-XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
+XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+			 void **addr)
 {
 	char		path[MAXPGPATH];
 	char		tmppath[MAXPGPATH];
@@ -3273,6 +3309,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	XLogSegNo	max_segno;
 	int			fd;
 	int			save_errno;
+	void	   *tmpaddr = NULL;
 
 	XLogFilePath(path, ThisTimeLineID, logsegno, wal_segment_size);
 
@@ -3281,8 +3318,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	 */
 	if (*use_existent)
 	{
-		fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-		if (fd < 0)
+		fd = do_XLogFileOpen(path,
+							 O_RDWR | PG_BINARY | get_sync_bit(sync_method),
+							 &tmpaddr);
+		if (fd < 0 && tmpaddr == NULL)
 		{
 			if (errno != ENOENT)
 				ereport(ERROR,
@@ -3290,7 +3329,10 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 						 errmsg("could not open file \"%s\": %m", path)));
 		}
 		else
+		{
+			*addr = tmpaddr;
 			return fd;
+		}
 	}
 
 	/*
@@ -3306,8 +3348,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = BasicOpenFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	fd = do_XLogFileOpen(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+						 &tmpaddr);
+	if (fd < 0 && tmpaddr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
@@ -3318,9 +3361,6 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	save_errno = 0;
 	if (wal_init_zero)
 	{
-		struct iovec iov[PG_IOV_MAX];
-		int			blocks;
-
 		/*
 		 * Zero-fill the file.  With this setting, we do this the hard way to
 		 * ensure that all the file space has really been allocated.  On
@@ -3330,28 +3370,41 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		 * indirect blocks are down on disk.  Therefore, fdatasync(2) or
 		 * O_DSYNC will be sufficient to sync future writes to the log file.
 		 */
-
-		/* Prepare to write out a lot of copies of our zero buffer at once. */
-		for (int i = 0; i < lengthof(iov); ++i)
+		if (tmpaddr != NULL)
 		{
-			iov[i].iov_base = zbuffer.data;
-			iov[i].iov_len = XLOG_BLCKSZ;
+			for (int i = 0; i < wal_segment_size / XLOG_BLCKSZ; ++i)
+			{
+				PmemFileWrite((char *) tmpaddr + i * XLOG_BLCKSZ, zbuffer.data,
+							  XLOG_BLCKSZ);
+			}
 		}
-
-		/* Loop, writing as many blocks as we can for each system call. */
-		blocks = wal_segment_size / XLOG_BLCKSZ;
-		for (int i = 0; i < blocks;)
+		else
 		{
-			int 		iovcnt = Min(blocks - i, lengthof(iov));
-			off_t		offset = i * XLOG_BLCKSZ;
+			struct iovec iov[PG_IOV_MAX];
+			int			blocks;
 
-			if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0)
+			/* Prepare to write out a lot of copies of our zero buffer at once. */
+			for (int i = 0; i < lengthof(iov); ++i)
 			{
-				save_errno = errno;
-				break;
+				iov[i].iov_base = zbuffer.data;
+				iov[i].iov_len = XLOG_BLCKSZ;
 			}
 
-			i += iovcnt;
+			/* Loop, writing as many blocks as we can for each system call. */
+			blocks = wal_segment_size / XLOG_BLCKSZ;
+			for (int i = 0; i < blocks;)
+			{
+				int 		iovcnt = Min(blocks - i, lengthof(iov));
+				off_t		offset = i * XLOG_BLCKSZ;
+
+				if (pg_pwritev_with_retry(fd, iov, iovcnt, offset) < 0)
+				{
+					save_errno = errno;
+					break;
+				}
+
+				i += iovcnt;
+			}
 		}
 	}
 	else
@@ -3360,11 +3413,17 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 		 * Otherwise, seeking to the end and writing a solitary byte is
 		 * enough.
 		 */
-		errno = 0;
-		if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
+		if (tmpaddr != NULL)
+			PmemFileWrite((char *) tmpaddr + wal_segment_size - 1,
+						  zbuffer.data, 1);
+		else
 		{
-			/* if write didn't set errno, assume no disk space */
-			save_errno = errno ? errno : ENOSPC;
+			errno = 0;
+			if (pg_pwrite(fd, zbuffer.data, 1, wal_segment_size - 1) != 1)
+			{
+				/* if write didn't set errno, assume no disk space */
+				save_errno = errno ? errno : ENOSPC;
+			}
 		}
 	}
 	pgstat_report_wait_end();
@@ -3386,11 +3445,11 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	}
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_INIT_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, tmpaddr) != 0)
 	{
 		int			save_errno = errno;
 
-		close(fd);
+		do_XLogFileClose(fd, tmpaddr);
 		errno = save_errno;
 		ereport(ERROR,
 				(errcode_for_file_access(),
@@ -3398,7 +3457,7 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	}
 	pgstat_report_wait_end();
 
-	if (close(fd) != 0)
+	if (do_XLogFileClose(fd, tmpaddr))
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
@@ -3439,8 +3498,9 @@ XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock)
 	*use_existent = false;
 
 	/* Now open original target segment (might not be file I just made) */
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+						 O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+	if (fd < 0 && *addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3475,13 +3535,20 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	int			srcfd;
 	int			fd;
 	int			nbytes;
+	void	   *src_addr = NULL;
+	void	   *dst_addr = NULL;
 
 	/*
 	 * Open the source file
 	 */
 	XLogFilePath(path, srcTLI, srcsegno, wal_segment_size);
-	srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
-	if (srcfd < 0)
+	srcfd = -1;
+	if (sync_method == SYNC_METHOD_PMEM_DRAIN)
+		srcfd = MapTransientFile(path, O_RDONLY | PG_BINARY,
+								 wal_segment_size, &src_addr);
+	if (src_addr == NULL)
+		srcfd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (srcfd < 0 && src_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3494,8 +3561,15 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	unlink(tmppath);
 
 	/* do not use get_sync_bit() here --- want to fsync only at end of fill */
-	fd = OpenTransientFile(tmppath, O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
-	if (fd < 0)
+	if (src_addr != NULL && sync_method == SYNC_METHOD_PMEM_DRAIN)
+		fd = MapTransientFile(tmppath,
+							  O_RDWR | O_CREAT | O_EXCL | PG_BINARY,
+							  wal_segment_size, &dst_addr);
+	else
+		fd = OpenTransientFile(tmppath,
+							   O_RDWR | O_CREAT | O_EXCL | PG_BINARY);
+
+	if (fd < 0 && dst_addr == NULL)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not create file \"%s\": %m", tmppath)));
@@ -3503,6 +3577,15 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 	/*
 	 * Do the data copying.
 	 */
+	if (src_addr && dst_addr)
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_READ);
+		PmemFileWrite(dst_addr, src_addr, wal_segment_size);
+		pgstat_report_wait_end();
+
+		goto done_copy;
+	}
+
 	for (nbytes = 0; nbytes < wal_segment_size; nbytes += sizeof(buffer))
 	{
 		int			nread;
@@ -3559,14 +3642,22 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 		pgstat_report_wait_end();
 	}
 
+done_copy:
 	pgstat_report_wait_start(WAIT_EVENT_WAL_COPY_SYNC);
-	if (pg_fsync(fd) != 0)
+	if (xlog_fsync(fd, dst_addr) != 0)
 		ereport(data_sync_elevel(ERROR),
 				(errcode_for_file_access(),
 				 errmsg("could not fsync file \"%s\": %m", tmppath)));
 	pgstat_report_wait_end();
 
-	if (CloseTransientFile(fd) != 0)
+	if (dst_addr)
+	{
+		if (UnmapTransientFile(dst_addr, wal_segment_size))
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not unmap file \"%s\": %m", tmppath)));
+	}
+	else if (CloseTransientFile(fd) != 0)
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", tmppath)));
@@ -3575,6 +3666,13 @@ XLogFileCopy(XLogSegNo destsegno, TimeLineID srcTLI, XLogSegNo srcsegno,
 		ereport(ERROR,
 				(errcode_for_file_access(),
 				 errmsg("could not close file \"%s\": %m", path)));
+	if (src_addr)
+		UnmapTransientFile(src_addr, wal_segment_size);
+	else
+		if (CloseTransientFile(srcfd) != 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not close file \"%s\": %m", path)));
 
 	/*
 	 * Now move the segment into place with its final name.
@@ -3671,15 +3769,16 @@ InstallXLogFileSegment(XLogSegNo *segno, char *tmppath,
  * Open a pre-existing logfile segment for writing.
  */
 int
-XLogFileOpen(XLogSegNo segno)
+XLogFileOpen(XLogSegNo segno, void **addr)
 {
 	char		path[MAXPGPATH];
 	int			fd;
 
 	XLogFilePath(path, ThisTimeLineID, segno, wal_segment_size);
 
-	fd = BasicOpenFile(path, O_RDWR | PG_BINARY | get_sync_bit(sync_method));
-	if (fd < 0)
+	fd = do_XLogFileOpen(path,
+						 O_RDWR | PG_BINARY | get_sync_bit(sync_method), addr);
+	if (fd < 0 && *addr == NULL)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not open file \"%s\": %m", path)));
@@ -3695,7 +3794,7 @@ XLogFileOpen(XLogSegNo segno)
  */
 static int
 XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
-			 XLogSource source, bool notfoundOk)
+			 XLogSource source, bool notfoundOk, void **addr)
 {
 	char		xlogfname[MAXFNAMELEN];
 	char		activitymsg[MAXFNAMELEN + 16];
@@ -3744,8 +3843,8 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
 		snprintf(path, MAXPGPATH, XLOGDIR "/%s", xlogfname);
 	}
 
-	fd = BasicOpenFile(path, O_RDONLY | PG_BINARY);
-	if (fd >= 0)
+	fd = do_XLogFileOpen(path, O_RDONLY | PG_BINARY, addr);
+	if (fd >= 0 || *addr != NULL)
 	{
 		/* Success! */
 		curFileTLI = tli;
@@ -3777,7 +3876,7 @@ XLogFileRead(XLogSegNo segno, int emode, TimeLineID tli,
  * This version searches for the segment with any TLI listed in expectedTLEs.
  */
 static int
-XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
+XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source, void **addr)
 {
 	char		path[MAXPGPATH];
 	ListCell   *cell;
@@ -3842,8 +3941,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_ARCHIVE)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_ARCHIVE, true);
-			if (fd != -1)
+							  XLOG_FROM_ARCHIVE, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				elog(DEBUG1, "got WAL segment from archive");
 				if (!expectedTLEs)
@@ -3855,8 +3954,8 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 		if (source == XLOG_FROM_ANY || source == XLOG_FROM_PG_WAL)
 		{
 			fd = XLogFileRead(segno, emode, tli,
-							  XLOG_FROM_PG_WAL, true);
-			if (fd != -1)
+							  XLOG_FROM_PG_WAL, true, addr);
+			if (fd != -1 || *addr != NULL)
 			{
 				if (!expectedTLEs)
 					expectedTLEs = tles;
@@ -3874,13 +3973,22 @@ XLogFileReadAnyTLI(XLogSegNo segno, int emode, XLogSource source)
 	return -1;
 }
 
+int
+do_XLogFileClose(int fd, void *addr)
+{
+	if (!addr)
+		return close(fd);
+
+	return PmemFileClose(addr, wal_segment_size);
+}
+
 /*
  * Close the current logfile segment for writing.
  */
 static void
 XLogFileClose(void)
 {
-	Assert(openLogFile >= 0);
+	Assert(openLogFile >= 0 || mappedLogFileAddr != NULL);
 
 	/*
 	 * WAL segment files will not be re-read in normal operation, so we advise
@@ -3889,11 +3997,11 @@ XLogFileClose(void)
 	 * use the cache to read the WAL segment.
 	 */
 #if defined(USE_POSIX_FADVISE) && defined(POSIX_FADV_DONTNEED)
-	if (!XLogIsNeeded())
+	if (!XLogIsNeeded() && openLogFile > 0)
 		(void) posix_fadvise(openLogFile, 0, 0, POSIX_FADV_DONTNEED);
 #endif
 
-	if (close(openLogFile) != 0)
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 	{
 		char		xlogfname[MAXFNAMELEN];
 		int			save_errno = errno;
@@ -3905,8 +4013,12 @@ XLogFileClose(void)
 				 errmsg("could not close file \"%s\": %m", xlogfname)));
 	}
 
-	openLogFile = -1;
-	ReleaseExternalFD();
+	mappedLogFileAddr = NULL;
+	if (openLogFile >= 0)
+	{
+		openLogFile = -1;
+		ReleaseExternalFD();
+	}
 }
 
 /*
@@ -3925,6 +4037,7 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	XLogSegNo	_logSegNo;
 	int			lf;
 	bool		use_existent;
+	void	   *laddr = NULL;
 	uint64		offset;
 
 	XLByteToPrevSeg(endptr, _logSegNo, wal_segment_size);
@@ -3933,8 +4046,8 @@ PreallocXlogFiles(XLogRecPtr endptr)
 	{
 		_logSegNo++;
 		use_existent = true;
-		lf = XLogFileInit(_logSegNo, &use_existent, true);
-		close(lf);
+		lf = XLogFileInit(_logSegNo, &use_existent, true, &laddr);
+		do_XLogFileClose(lf, laddr);
 		if (!use_existent)
 			CheckpointStats.ckpt_segs_added++;
 	}
@@ -4377,9 +4490,10 @@ ReadRecord(XLogReaderState *xlogreader, int emode,
 		EndRecPtr = xlogreader->EndRecPtr;
 		if (record == NULL)
 		{
-			if (readFile >= 0)
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
 			{
-				close(readFile);
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+				mappedReadFileAddr = NULL;
 				readFile = -1;
 			}
 
@@ -5327,7 +5441,7 @@ BootStrapXLOG(void)
 
 	/* Create first XLOG segment file */
 	use_existent = false;
-	openLogFile = XLogFileInit(1, &use_existent, false);
+	openLogFile = XLogFileInit(1, &use_existent, false, &mappedLogFileAddr);
 
 	/*
 	 * We needn't bother with Reserve/ReleaseExternalFD here, since we'll
@@ -5336,30 +5450,39 @@ BootStrapXLOG(void)
 
 	/* Write the first page with the initial record */
 	errno = 0;
-	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
-	if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+	if (mappedLogFileAddr != NULL)
 	{
-		/* if write didn't set errno, assume problem is no disk space */
-		if (errno == 0)
-			errno = ENOSPC;
-		ereport(PANIC,
-				(errcode_for_file_access(),
-				 errmsg("could not write bootstrap write-ahead log file: %m")));
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		PmemFileWrite(mappedLogFileAddr, page, XLOG_BLCKSZ);
+	}
+	else
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_WRITE);
+		if (write(openLogFile, page, XLOG_BLCKSZ) != XLOG_BLCKSZ)
+		{
+			/* if write didn't set errno, assume problem is no disk space */
+			if (errno == 0)
+				errno = ENOSPC;
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not write bootstrap write-ahead log file: %m")));
+		}
 	}
 	pgstat_report_wait_end();
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_BOOTSTRAP_SYNC);
-	if (pg_fsync(openLogFile) != 0)
+	if (xlog_fsync(openLogFile, (void *) mappedLogFileAddr) != 0)
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not fsync bootstrap write-ahead log file: %m")));
 	pgstat_report_wait_end();
 
-	if (close(openLogFile) != 0)
+	if (do_XLogFileClose(openLogFile, mappedLogFileAddr))
 		ereport(PANIC,
 				(errcode_for_file_access(),
 				 errmsg("could not close bootstrap write-ahead log file: %m")));
 
+	mappedLogFileAddr = NULL;
 	openLogFile = -1;
 
 	/* Now create pg_control */
@@ -5594,9 +5717,10 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 	 * If the ending log segment is still open, close it (to avoid problems on
 	 * Windows with trying to rename or delete an open file).
 	 */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 
@@ -5635,10 +5759,11 @@ exitArchiveRecovery(TimeLineID endTLI, XLogRecPtr endOfLog)
 		 */
 		bool		use_existent = true;
 		int			fd;
+		void	   *tmpaddr = NULL;
 
-		fd = XLogFileInit(startLogSegNo, &use_existent, true);
+		fd = XLogFileInit(startLogSegNo, &use_existent, true, &tmpaddr);
 
-		if (close(fd) != 0)
+		if (do_XLogFileClose(fd, tmpaddr))
 		{
 			char		xlogfname[MAXFNAMELEN];
 			int			save_errno = errno;
@@ -7975,9 +8100,10 @@ StartupXLOG(void)
 		ShutdownRecoveryTransactionEnvironment();
 
 	/* Shut down xlogreader */
-	if (readFile >= 0)
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
 	{
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 	}
 	XLogReaderFree(xlogreader);
@@ -10501,6 +10627,9 @@ get_sync_bit(int method)
 		case SYNC_METHOD_FSYNC:
 		case SYNC_METHOD_FSYNC_WRITETHROUGH:
 		case SYNC_METHOD_FDATASYNC:
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+#endif
 			return 0;
 #ifdef OPEN_SYNC_FLAG
 		case SYNC_METHOD_OPEN:
@@ -10518,7 +10647,36 @@ get_sync_bit(int method)
 }
 
 /*
- * GUC support
+ * GUC check_hook for xlog_sync_method
+ */
+bool
+check_xlog_sync_method(int *newval, void **extra, GucSource source)
+{
+	bool		ret;
+	char		tmppath[MAXPGPATH] = {};
+	int			val = newval ? *newval : sync_method;
+
+	if (val != SYNC_METHOD_PMEM_DRAIN)
+		return true;
+
+	snprintf(tmppath, MAXPGPATH, "%s/" XLOGDIR "/pmem.tmp.%d", DataDir, (int) getpid());
+
+	ret = CheckPmem(tmppath);
+
+	if (!ret)
+	{
+		GUC_check_errcode(ERRCODE_INVALID_PARAMETER_VALUE);
+		GUC_check_errmsg("invalid value for parameter \"wal_sync_method\": \"pmem_drain\"");
+		GUC_check_errmsg("%s isn't stored on persistent memory(pmem_is_pmem() returned false).",
+						 XLOGDIR);
+		GUC_check_errhint("Please see also ENVIRONMENT VARIABLES section in man libpmem.");
+	}
+
+	return ret;
+}
+
+/*
+ * GUC assign_hook for xlog_sync_method
  */
 void
 assign_xlog_sync_method(int new_sync_method, void *extra)
@@ -10531,10 +10689,10 @@ assign_xlog_sync_method(int new_sync_method, void *extra)
 		 * changing, close the log file so it will be reopened (with new flag
 		 * bit) at next use.
 		 */
-		if (openLogFile >= 0)
+		if (openLogFile >= 0 || mappedLogFileAddr != NULL)
 		{
 			pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC_METHOD_ASSIGN);
-			if (pg_fsync(openLogFile) != 0)
+			if (xlog_fsync(openLogFile, (void *) mappedLogFileAddr) != 0)
 			{
 				char		xlogfname[MAXFNAMELEN];
 				int			save_errno;
@@ -10585,6 +10743,11 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 			if (pg_fdatasync(fd) != 0)
 				msg = _("could not fdatasync file \"%s\": %m");
 			break;
+#endif
+#ifdef USE_LIBPMEM
+		case SYNC_METHOD_PMEM_DRAIN:
+			PmemFileSync();
+			break;
 #endif
 		case SYNC_METHOD_OPEN:
 		case SYNC_METHOD_OPEN_DSYNC:
@@ -10612,6 +10775,16 @@ issue_xlog_fsync(int fd, XLogSegNo segno)
 	pgstat_report_wait_end();
 }
 
+int
+xlog_fsync(int fd, void *addr)
+{
+	if (!addr)
+		return pg_fsync(fd);
+
+	PmemFileSync();
+	return 0;
+}
+
 /*
  * do_pg_start_backup
  *
@@ -12048,7 +12221,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 	 * See if we need to switch to a new segment because the requested record
 	 * is not in the currently open one.
 	 */
-	if (readFile >= 0 &&
+	if ((readFile >= 0 || mappedReadFileAddr != NULL) &&
 		!XLByteInSeg(targetPagePtr, readSegNo, wal_segment_size))
 	{
 		/*
@@ -12065,7 +12238,8 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 			}
 		}
 
-		close(readFile);
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+		mappedReadFileAddr = NULL;
 		readFile = -1;
 		readSource = XLOG_FROM_ANY;
 	}
@@ -12074,7 +12248,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 
 retry:
 	/* See if we need to retrieve more data */
-	if (readFile < 0 ||
+	if ((readFile < 0 && mappedReadFileAddr == NULL) ||
 		(readSource == XLOG_FROM_STREAM &&
 		 flushedUpto < targetPagePtr + reqLen))
 	{
@@ -12083,8 +12257,9 @@ retry:
 										 private->fetching_ckpt,
 										 targetRecPtr))
 		{
-			if (readFile >= 0)
-				close(readFile);
+			if (readFile >= 0 || mappedReadFileAddr != NULL)
+				do_XLogFileClose(readFile, mappedReadFileAddr);
+			mappedReadFileAddr = NULL;
 			readFile = -1;
 			readLen = 0;
 			readSource = XLOG_FROM_ANY;
@@ -12097,7 +12272,7 @@ retry:
 	 * At this point, we have the right segment open and if we're streaming we
 	 * know the requested record is in it.
 	 */
-	Assert(readFile != -1);
+	Assert(readFile != -1 || mappedReadFileAddr != NULL);
 
 	/*
 	 * If the current segment is being streamed from the primary, calculate how
@@ -12120,28 +12295,33 @@ retry:
 	readOff = targetPageOff;
 
 	pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
-	r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
-	if (r != XLOG_BLCKSZ)
+	if (mappedReadFileAddr)
+		PmemFileRead((char *) mappedReadFileAddr + readOff, readBuf, XLOG_BLCKSZ);
+	else
 	{
-		char		fname[MAXFNAMELEN];
-		int			save_errno = errno;
-
-		pgstat_report_wait_end();
-		XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
-		if (r < 0)
+		r = pg_pread(readFile, readBuf, XLOG_BLCKSZ, (off_t) readOff);
+		if (r != XLOG_BLCKSZ)
 		{
-			errno = save_errno;
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode_for_file_access(),
-					 errmsg("could not read from log segment %s, offset %u: %m",
-							fname, readOff)));
+			char		fname[MAXFNAMELEN];
+			int			save_errno = errno;
+
+			pgstat_report_wait_end();
+			XLogFileName(fname, curFileTLI, readSegNo, wal_segment_size);
+			if (r < 0)
+			{
+				errno = save_errno;
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode_for_file_access(),
+						 errmsg("could not read from log segment %s, offset %u: %m",
+								fname, readOff)));
+			}
+			else
+				ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
+						(errcode(ERRCODE_DATA_CORRUPTED),
+						 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
+								fname, readOff, r, (Size) XLOG_BLCKSZ)));
+			goto next_record_is_invalid;
 		}
-		else
-			ereport(emode_for_corrupt_record(emode, targetPagePtr + reqLen),
-					(errcode(ERRCODE_DATA_CORRUPTED),
-					 errmsg("could not read from log segment %s, offset %u: read %d of %zu",
-							fname, readOff, r, (Size) XLOG_BLCKSZ)));
-		goto next_record_is_invalid;
 	}
 	pgstat_report_wait_end();
 
@@ -12189,8 +12369,9 @@ retry:
 next_record_is_invalid:
 	lastSourceFailed = true;
 
-	if (readFile >= 0)
-		close(readFile);
+	if (readFile >= 0 || mappedReadFileAddr != NULL)
+		do_XLogFileClose(readFile, mappedReadFileAddr);
+	mappedReadFileAddr = NULL;
 	readFile = -1;
 	readLen = 0;
 	readSource = XLOG_FROM_ANY;
@@ -12430,9 +12611,11 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				Assert(!WalRcvStreaming());
 
 				/* Close any old file we might have open. */
-				if (readFile >= 0)
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 				{
-					close(readFile);
+					do_XLogFileClose(readFile,
+									 mappedReadFileAddr);
+					mappedReadFileAddr = NULL;
 					readFile = -1;
 				}
 				/* Reset curFileTLI if random fetch. */
@@ -12445,8 +12628,8 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 				 */
 				readFile = XLogFileReadAnyTLI(readSegNo, DEBUG2,
 											  currentSource == XLOG_FROM_ARCHIVE ? XLOG_FROM_ANY :
-											  currentSource);
-				if (readFile >= 0)
+											  currentSource, &mappedReadFileAddr);
+				if (readFile >= 0 || mappedReadFileAddr != NULL)
 					return true;	/* success! */
 
 				/*
@@ -12580,14 +12763,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						 * info is set correctly and XLogReceiptTime isn't
 						 * changed.
 						 */
-						if (readFile < 0)
+						if (readFile < 0 && mappedReadFileAddr == NULL)
 						{
 							if (!expectedTLEs)
 								expectedTLEs = readTimeLineHistory(receiveTLI);
 							readFile = XLogFileRead(readSegNo, PANIC,
 													receiveTLI,
-													XLOG_FROM_STREAM, false);
-							Assert(readFile >= 0);
+													XLOG_FROM_STREAM, false, &mappedReadFileAddr);
+							Assert(readFile >= 0 || mappedReadFileAddr != NULL);
 						}
 						else
 						{
diff --git a/src/backend/storage/file/Makefile b/src/backend/storage/file/Makefile
index 5e1291bf2d..462c71bb03 100644
--- a/src/backend/storage/file/Makefile
+++ b/src/backend/storage/file/Makefile
@@ -17,6 +17,7 @@ OBJS = \
 	copydir.o \
 	fd.o \
 	reinit.o \
-	sharedfileset.o
+	sharedfileset.o \
+	pmem.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/file/fd.c b/src/backend/storage/file/fd.c
index b58502837a..6ca74eadff 100644
--- a/src/backend/storage/file/fd.c
+++ b/src/backend/storage/file/fd.c
@@ -96,6 +96,7 @@
 #include "portability/mem.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/pmem.h"
 #include "utils/guc.h"
 #include "utils/resowner_private.h"
 
@@ -233,6 +234,9 @@ static uint64 temporary_files_size = 0;
 typedef enum
 {
 	AllocateDescFile,
+#ifdef USE_LIBPMEM
+	AllocateDescMap,
+#endif
 	AllocateDescPipe,
 	AllocateDescDir,
 	AllocateDescRawFD
@@ -247,6 +251,10 @@ typedef struct
 		FILE	   *file;
 		DIR		   *dir;
 		int			fd;
+#ifdef USE_LIBPMEM
+		size_t		fsize;
+		void	   *addr;
+#endif
 	}			desc;
 } AllocateDesc;
 
@@ -1724,6 +1732,78 @@ OpenTemporaryFileInTablespace(Oid tblspcOid, bool rejectError)
 	return file;
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Mmap a file with MapTransientFilePerm() and pass default file mode for
+ * the fileMode parameter.
+ */
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return MapTransientFilePerm(fileName, fileFlags, PG_FILE_MODE_DEFAULT,
+								fsize, addr);
+}
+
+/*
+ * Like AllocateFile, but returns an unbuffered pointer to the mapped area
+ * like mmap(2)
+ */
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr)
+{
+	int			fd;
+
+	DO_DB(elog(LOG, "MapTransientFilePerm: Allocated %d (%s)",
+			   numAllocatedDescs, fileName));
+
+	/* Can we allocate another non-virtual FD? */
+	if (!reserveAllocatedDesc())
+		ereport(ERROR,
+				(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
+				 errmsg("exceeded maxAllocatedDescs (%d) while trying to open file \"%s\"",
+						maxAllocatedDescs, fileName)));
+
+	/* Close excess kernel FDs. */
+	ReleaseLruFiles();
+
+	if (addr != NULL)
+	{
+		void	   *ret_addr = NULL;
+
+		fd = PmemFileOpenPerm(fileName, fileFlags, fileMode, fsize, &ret_addr);
+		if (ret_addr != NULL)
+		{
+			AllocateDesc *desc = &allocatedDescs[numAllocatedDescs];
+
+			*addr = ret_addr;
+
+			desc->kind = AllocateDescMap;
+			desc->desc.addr = ret_addr;
+			desc->desc.fsize = fsize;
+			desc->create_subid = GetCurrentSubTransactionId();
+			numAllocatedDescs++;
+
+			return fd;
+		}
+	}
+
+	return -1;					/* failure */
+}
+#else
+int
+MapTransientFile(const char *fileName, int fileFlags, size_t fsize, void **addr)
+{
+	return -1;
+}
+
+int
+MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr)
+{
+	return -1;
+}
+#endif
 
 /*
  * Create a new file.  The directory containing it must already exist.  Files
@@ -2530,6 +2610,11 @@ FreeDesc(AllocateDesc *desc)
 		case AllocateDescRawFD:
 			result = close(desc->desc.fd);
 			break;
+#ifdef USE_LIBPMEM
+		case AllocateDescMap:
+			result = PmemFileClose(desc->desc.addr, desc->desc.fsize);
+			break;
+#endif
 		default:
 			elog(ERROR, "AllocateDesc kind not recognized");
 			result = 0;			/* keep compiler quiet */
@@ -2571,6 +2656,42 @@ FreeFile(FILE *file)
 	return fclose(file);
 }
 
+#ifdef USE_LIBPMEM
+/*
+ * Unmap a file returned by MapTransientFile.
+ *
+ * Note we do not check unmap's return value --- it is up to the caller
+ * to handle unmap errors.
+ */
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	int			i;
+
+	DO_DB(elog(LOG, "UnmapTransientFile: Allocated %d", numAllocatedDescs));
+
+	/* Remove fd from list of allocated files, if it's present */
+	for (i = numAllocatedDescs; --i >= 0;)
+	{
+		AllocateDesc *desc = &allocatedDescs[i];
+
+		if (desc->kind == AllocateDescMap && desc->desc.addr == addr)
+			return FreeDesc(desc);
+	}
+
+	/* Only get here if someone passes us a file not in allocatedDescs */
+	elog(WARNING, "fd passed to UnmapTransientFile was not obtained from MapTransientFile");
+
+	return PmemFileClose(addr, fsize);
+}
+#else
+int
+UnmapTransientFile(void *addr, size_t fsize)
+{
+	return -1;
+}
+#endif
+
 /*
  * Close a file returned by OpenTransientFile.
  *
diff --git a/src/backend/storage/file/pmem.c b/src/backend/storage/file/pmem.c
new file mode 100644
index 0000000000..b214b6b18e
--- /dev/null
+++ b/src/backend/storage/file/pmem.c
@@ -0,0 +1,188 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.c
+ *	  Virtual file descriptor code.
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	  src/backend/storage/file/pmem.c
+ *
+ * NOTES:
+ *
+ * This code manages an memory-mapped file on a filesystem mounted with DAX on
+ * persistent memory device using the Persistent Memory Development Kit
+ * (http://pmem.io/pmdk/).
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "storage/pmem.h"
+#include "storage/fd.h"
+
+#ifdef USE_LIBPMEM
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <unistd.h>
+#include <fcntl.h>
+#include <libpmem.h>
+#include <sys/mman.h>
+#include <string.h>
+
+#define PmemFileSize 32
+
+/*
+ * This function returns true, only if the file is stored on persistent memory.
+ */
+bool
+CheckPmem(const char *path)
+{
+	int			is_pmem = 0;	/* false */
+	size_t		mapped_len = 0;
+	bool		ret = true;
+	void	   *tmpaddr;
+
+	/*
+	 * The value of is_pmem is 0, if the file(path) isn't stored on persistent
+	 * memory.
+	 */
+	tmpaddr = pmem_map_file(path, PmemFileSize, PMEM_FILE_CREATE,
+							PG_FILE_MODE_DEFAULT, &mapped_len, &is_pmem);
+
+	if (tmpaddr)
+	{
+		pmem_unmap(tmpaddr, mapped_len);
+		unlink(path);
+	}
+
+	if (is_pmem)
+		elog(LOG, "%s is stored on persistent memory.", path);
+	else
+		ret = false;
+
+	return ret;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return PmemFileOpenPerm(pathname, flags, PG_FILE_MODE_DEFAULT, fsize, addr);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+				 void **addr)
+{
+	int			mapped_flag = 0;
+	size_t		mapped_len = 0;
+	size_t		size = 0;
+	void	   *ret_addr;
+
+	if (addr == NULL)
+		return BasicOpenFile(pathname, flags);
+
+	/* non-zero 'len' not allowed without PMEM_FILE_CREATE */
+	if (flags & O_CREAT)
+	{
+		mapped_flag = PMEM_FILE_CREATE;
+		size = fsize;
+	}
+
+	if (flags & O_EXCL)
+		mapped_flag |= PMEM_FILE_EXCL;
+
+	ret_addr = pmem_map_file(pathname, size, mapped_flag, mode, &mapped_len,
+							 NULL);
+
+	if (fsize != mapped_len)
+	{
+		if (ret_addr != NULL)
+			pmem_unmap(ret_addr, mapped_len);
+
+		return -1;
+	}
+
+	if (mapped_flag & PMEM_FILE_CREATE)
+		if (msync(ret_addr, mapped_len, MS_SYNC))
+			ereport(PANIC,
+					(errcode_for_file_access(),
+					 errmsg("could not msync log file %s: %m", pathname)));
+
+	*addr = ret_addr;
+
+	return NO_FD_FOR_MAPPED_FILE;
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	pmem_memcpy_nodrain((void *) dest, src, len);
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	memcpy(buf, (void *) map_addr, len);
+}
+
+void
+PmemFileSync(void)
+{
+	return pmem_drain();
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	return pmem_unmap((void *) addr, fsize);
+}
+
+
+#else
+bool
+CheckPmem(const char *path)
+{
+	return true;
+}
+
+int
+PmemFileOpen(const char *pathname, int flags, size_t fsize, void **addr)
+{
+	return BasicOpenFile(pathname, flags);
+}
+
+int
+PmemFileOpenPerm(const char *pathname, int flags, int mode, size_t fsize,
+				 void **addr)
+{
+	return BasicOpenFilePerm(pathname, flags, mode);
+}
+
+void
+PmemFileWrite(void *dest, void *src, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileRead(void *map_addr, void *buf, size_t len)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+void
+PmemFileSync(void)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+}
+
+int
+PmemFileClose(void *addr, size_t fsize)
+{
+	ereport(PANIC, (errmsg("don't have the pmem device")));
+	return -1;
+}
+#endif
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 17579eeaca..e938464113 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -4738,7 +4738,7 @@ static struct config_enum ConfigureNamesEnum[] =
 		},
 		&sync_method,
 		DEFAULT_SYNC_METHOD, sync_method_options,
-		NULL, assign_xlog_sync_method, NULL
+		check_xlog_sync_method, assign_xlog_sync_method, NULL
 	},
 
 	{
diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
index 8930a94fff..5a28683b8e 100644
--- a/src/backend/utils/misc/postgresql.conf.sample
+++ b/src/backend/utils/misc/postgresql.conf.sample
@@ -210,6 +210,7 @@
 					#   fsync
 					#   fsync_writethrough
 					#   open_sync
+					#   pmem_drain
 #full_page_writes = on			# recover from partial page writes
 #wal_compression = off			# enable compression of full-page writes
 #wal_log_hints = off			# also do full page writes of non-critical updates
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 75ec1073bd..282a7a8c18 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -27,6 +27,7 @@
 #define SYNC_METHOD_OPEN		2	/* for O_SYNC */
 #define SYNC_METHOD_FSYNC_WRITETHROUGH	3
 #define SYNC_METHOD_OPEN_DSYNC	4	/* for O_DSYNC */
+#define SYNC_METHOD_PMEM_DRAIN	5	/* for Persistent Memory Development Kit */
 extern int	sync_method;
 
 extern PGDLLIMPORT TimeLineID ThisTimeLineID;	/* current TLI */
@@ -287,8 +288,10 @@ extern XLogRecPtr XLogInsertRecord(struct XLogRecData *rdata,
 extern void XLogFlush(XLogRecPtr RecPtr);
 extern bool XLogBackgroundFlush(void);
 extern bool XLogNeedsFlush(XLogRecPtr RecPtr);
-extern int	XLogFileInit(XLogSegNo segno, bool *use_existent, bool use_lock);
-extern int	XLogFileOpen(XLogSegNo segno);
+extern int XLogFileInit(XLogSegNo logsegno, bool *use_existent, bool use_lock,
+						void **addr);
+extern int	XLogFileOpen(XLogSegNo segno, void **addr);
+extern int	do_XLogFileClose(int fd, void *addr);
 
 extern void CheckXLogRemoved(XLogSegNo segno, TimeLineID tli);
 extern XLogSegNo XLogGetLastRemovedSegno(void);
@@ -300,6 +303,7 @@ extern void xlog_desc(StringInfo buf, XLogReaderState *record);
 extern const char *xlog_identify(uint8 info);
 
 extern void issue_xlog_fsync(int fd, XLogSegNo segno);
+extern int	xlog_fsync(int fd, void *addr);
 
 extern bool RecoveryInProgress(void);
 extern RecoveryState GetRecoveryState(void);
diff --git a/src/include/storage/fd.h b/src/include/storage/fd.h
index 30bf7d2193..385b75aabd 100644
--- a/src/include/storage/fd.h
+++ b/src/include/storage/fd.h
@@ -50,6 +50,12 @@ struct iovec;					/* avoid including port/pg_iovec.h here */
 typedef int File;
 
 
+/*
+ * Default mode for created files, unless something else is specified using
+ * the *Perm() function variants.
+ */
+#define PG_FILE_MODE_DEFAULT	(S_IRUSR | S_IWUSR)
+
 /* GUC parameter */
 extern PGDLLIMPORT int max_files_per_process;
 extern PGDLLIMPORT bool data_sync_retry;
@@ -121,6 +127,13 @@ extern int	OpenTransientFile(const char *fileName, int fileFlags);
 extern int	OpenTransientFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
 extern int	CloseTransientFile(int fd);
 
+/* Operations to allow use of a memory-mapped file */
+extern int MapTransientFile(const char *fileName, int fileFlags, size_t fsize,
+				 void **addr);
+extern int MapTransientFilePerm(const char *fileName, int fileFlags, int fileMode,
+					 size_t fsize, void **addr);
+extern int	UnmapTransientFile(void *addr, size_t fsize);
+
 /* If you've really really gotta have a plain kernel FD, use this */
 extern int	BasicOpenFile(const char *fileName, int fileFlags);
 extern int	BasicOpenFilePerm(const char *fileName, int fileFlags, mode_t fileMode);
diff --git a/src/include/storage/pmem.h b/src/include/storage/pmem.h
new file mode 100644
index 0000000000..b9b9156c91
--- /dev/null
+++ b/src/include/storage/pmem.h
@@ -0,0 +1,32 @@
+/*-------------------------------------------------------------------------
+ *
+ * pmem.h
+ *		Virtual file descriptor definitions for persistent memory.
+ *
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/pmem.h
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#ifndef PMEM_H
+#define PMEM_H
+
+#include "postgres.h"
+
+#define NO_FD_FOR_MAPPED_FILE -2
+
+extern bool CheckPmem(const char *path);
+extern int PmemFileOpen(const char *pathname, int flags, size_t fsize,
+			 void **addr);
+extern int PmemFileOpenPerm(const char *pathname, int flags, int mode,
+				 size_t fsize, void **addr);
+extern void PmemFileWrite(void *dest, void *src, size_t len);
+extern void PmemFileRead(void *map_addr, void *buf, size_t len);
+extern void PmemFileSync(void);
+extern int	PmemFileClose(void *addr, size_t fsize);
+
+#endif							/* PMEM_H */
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 5004ee4177..60ebf69ee2 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -440,6 +440,7 @@ extern void assign_search_path(const char *newval, void *extra);
 
 /* in access/transam/xlog.c */
 extern bool check_wal_buffers(int *newval, void **extra, GucSource source);
+extern bool check_xlog_sync_method(int *newval, void **extra, GucSource source);
 extern void assign_xlog_sync_method(int new_sync_method, void *extra);
 
 #endif							/* GUC_H */
-- 
2.25.1