Undo logs

Started by Thomas Munroover 7 years ago46 messages

thomas.munro@enterprisedb.com

over 7 years ago

5 attachment(s)

Hello hackers,

As announced elsewhere[1]https://amitkapila16.blogspot.com/2018/03/zheap-storage-engine-to-provide-better.html[2]https://rhaas.blogspot.com/2018/01/do-or-undo-there-is-no-vacuum.html[3]https://www.pgcon.org/2018/schedule/events/1190.en.html, at EnterpriseDB we are working on a
proposal to add in-place updates with undo logs to PostgreSQL. The
goal is to improve performance and resource usage by recycling space
better.

The lowest level piece of this work is a physical undo log manager,
which I've personally been working on. Later patches will build on
top, adding record-oriented access and then the main "zheap" access
manager and related infrastructure. My colleagues will write about
those.

The README files[4]https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo[5]https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr explain in more detail, but here is a
bullet-point description of what the attached patch set gives you:

1. Efficient appending of new undo data from many concurrent
backends. Like logs.
2. Efficient discarding of old undo data that isn't needed anymore.
Like queues.
3. Efficient buffered random reading of undo data. Like relations.

A test module is provided that can be used to exercise the undo log
code paths without needing any of the later zheap patches.

This is work in progress. A few aspects are under active development
and liable to change, as indicated by comments, and there are no doubt
bugs and room for improvement. The code is also available at
github.com/EnterpriseDB/zheap (these patches are from the
undo-log-storage branch, see also the master branch which has the full
zheap feature). We'd be grateful for any questions, feedback or
ideas.

[1]: https://amitkapila16.blogspot.com/2018/03/zheap-storage-engine-to-provide-better.html
[2]: https://rhaas.blogspot.com/2018/01/do-or-undo-there-is-no-vacuum.html
[3]: https://www.pgcon.org/2018/schedule/events/1190.en.html
[4]: https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
[5]: https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr

--
Thomas Munro
http://www.enterprisedb.com

Attachments:

0001-Add-undo-log-manager-v1.patchapplication/octet-stream; name=0001-Add-undo-log-manager-v1.patchDownload

From fd72cc0ab7850a2ee3f546da8222389ce9cecabf Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 25 May 2018 09:43:16 +1200
Subject: [PATCH 1/6] Add undo log manager.

Add a new subsystem to manage undo logs.  Undo logs allow data to be appended
efficiently, like logs.  They also allow data to be discarded efficiently from
the other end, like a queue.  Thirdly, they allow efficient buffered random
access, like a relation.

Undo logs physically consist of a set of 1MB segment files under
$PGDATA/base/undo (or per-tablespace equivalent) that are created, deleted or
renamed as required, similarly to the way that WAL segments are managed.
Meta-data about the set of undo logs is stored in shared memory, and written
to per-checkpoint files under $PGDATA/pg_undo.

This commit provides an API for allocating and discarding undo log storage
space and managing the files in a crash-safe way.  A later commit will provide
support for accessing the data stored inside them.

XXX Status: WIP.  Some details around WAL are being reconsidered, as noted in
comments.

Author: Thomas Munro, with contributions from Dilip Kumar and input from
        Amit Kapila and Robert Haas
Tested-By: Neha Sharma
Reviewed-By:
Discussion:
---
 src/backend/access/Makefile               |    2 +-
 src/backend/access/rmgrdesc/Makefile      |    2 +-
 src/backend/access/rmgrdesc/undologdesc.c |  104 +
 src/backend/access/transam/rmgr.c         |    1 +
 src/backend/access/undo/Makefile          |   17 +
 src/backend/access/undo/undolog.c         | 2633 +++++++++++++++++++++
 src/backend/catalog/system_views.sql      |    4 +
 src/backend/commands/tablespace.c         |   22 +
 src/backend/replication/logical/decode.c  |    1 +
 src/backend/storage/ipc/ipci.c            |    3 +
 src/backend/storage/lmgr/lwlock.c         |    2 +
 src/backend/storage/lmgr/lwlocknames.txt  |    1 +
 src/backend/utils/misc/guc.c              |   12 +
 src/bin/initdb/initdb.c                   |    2 +
 src/bin/pg_waldump/rmgrdesc.c             |    1 +
 src/include/access/rmgrlist.h             |    1 +
 src/include/access/undolog.h              |  305 +++
 src/include/access/undolog_xlog.h         |   70 +
 src/include/catalog/pg_proc.dat           |    7 +
 src/include/storage/lwlock.h              |    2 +
 src/include/utils/guc.h                   |    2 +
 src/test/regress/expected/rules.out       |    9 +
 22 files changed, 3201 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/undologdesc.c
 create mode 100644 src/backend/access/undo/Makefile
 create mode 100644 src/backend/access/undo/undolog.c
 create mode 100644 src/include/access/undolog.h
 create mode 100644 src/include/access/undolog_xlog.h

diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index bd93a6a8d1e..7f7380c96f0 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  tablesample transam
+			  tablesample transam undo
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1dda6..91ad1ef8a3d 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -11,6 +11,6 @@ include $(top_builddir)/src/Makefile.global
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
 	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
 	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o undologdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/undologdesc.c b/src/backend/access/rmgrdesc/undologdesc.c
new file mode 100644
index 00000000000..5855b9b49e6
--- /dev/null
+++ b/src/backend/access/rmgrdesc/undologdesc.c
@@ -0,0 +1,104 @@
+/*-------------------------------------------------------------------------
+ *
+ * undologdesc.c
+ *	  rmgr descriptor routines for access/undo/undolog.c
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/undologdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+
+void
+undolog_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_UNDOLOG_CREATE)
+	{
+		xl_undolog_create *xlrec = (xl_undolog_create *) rec;
+
+		appendStringInfo(buf, "logno %u", xlrec->logno);
+	}
+	else if (info == XLOG_UNDOLOG_EXTEND)
+	{
+		xl_undolog_extend *xlrec = (xl_undolog_extend *) rec;
+
+		appendStringInfo(buf, "logno %u end " UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_ATTACH)
+	{
+		xl_undolog_attach *xlrec = (xl_undolog_attach *) rec;
+
+		appendStringInfo(buf, "logno %u xid %u", xlrec->logno, xlrec->xid);
+	}
+	else if (info == XLOG_UNDOLOG_META)
+	{
+		xl_undolog_meta *xlrec = (xl_undolog_meta *) rec;
+
+		appendStringInfo(buf, "logno %u xid %u insert " UndoLogOffsetFormat
+						 " last_xact_start " UndoLogOffsetFormat
+						 " prevlen=%d"
+						 " is_first_record=%d",
+						 xlrec->logno, xlrec->xid, xlrec->meta.insert,
+						 xlrec->meta.last_xact_start,
+						 xlrec->meta.prevlen,
+						 xlrec->meta.is_first_rec);
+	}
+	else if (info == XLOG_UNDOLOG_DISCARD)
+	{
+		xl_undolog_discard *xlrec = (xl_undolog_discard *) rec;
+
+		appendStringInfo(buf, "logno %u discard " UndoLogOffsetFormat " end "
+						 UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->discard, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_REWIND)
+	{
+		xl_undolog_rewind *xlrec = (xl_undolog_rewind *) rec;
+
+		appendStringInfo(buf, "logno %u insert " UndoLogOffsetFormat " prevlen %d",
+						 xlrec->logno, xlrec->insert, xlrec->prevlen);
+	}
+
+}
+
+const char *
+undolog_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			id = "CREATE";
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			id = "EXTEND";
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			id = "ATTACH";
+			break;
+		case XLOG_UNDOLOG_META:
+			id = "UNDO_META";
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			id = "DISCARD";
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			id = "REWIND";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56c4ce..8b0537405a9 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -18,6 +18,7 @@
 #include "access/multixact.h"
 #include "access/nbtxlog.h"
 #include "access/spgxlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
new file mode 100644
index 00000000000..219c6963cf8
--- /dev/null
+++ b/src/backend/access/undo/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/undo
+#
+# IDENTIFICATION
+#    src/backend/access/undo/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/undo
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = undolog.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undolog.c b/src/backend/access/undo/undolog.c
new file mode 100644
index 00000000000..6be4bb4fdb9
--- /dev/null
+++ b/src/backend/access/undo/undolog.c
@@ -0,0 +1,2633 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.c
+ *	  management of undo logs
+ *
+ * PostgreSQL undo log manager.  This module is responsible for managing the
+ * lifecycle of undo logs and their segment files, associating undo logs with
+ * backends, and allocating space within undo logs.
+ *
+ * For the code that reads and writes blocks of data, see undofile.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undolog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogreader.h"
+#include "catalog/catalog.h"
+#include "catalog/pg_tablespace.h"
+#include "commands/tablespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "nodes/execnodes.h"
+#include "pgstat.h"
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+#include "storage/dsm.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/standby.h"
+#include "storage/undofile.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/varlena.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+/* End-of-list value when building linked lists of undo logs. */
+#define InvalidUndoLogNumber -1
+
+/*
+ * Number of bits of an undo log number used to identify a bank of
+ * UndoLogControl objects.  This allows us to break up our array of
+ * UndoLogControl objects into many smaller arrays, called banks, and find our
+ * way to an UndoLogControl object in O(1) complexity in two steps.
+ */
+#define UndoLogBankBits 14
+#define UndoLogBanks (1 << UndoLogBankBits)
+
+/* Extract the undo bank number from an undo log number (upper bits). */
+#define UndoLogNoGetBankNo(logno)				\
+	((logno) >> (UndoLogNumberBits - UndoLogBankBits))
+
+/* Extract the slot within a bank from an undo log number (lower bits). */
+#define UndoLogNoGetSlotNo(logno)				\
+	((logno) & ((1 << (UndoLogNumberBits - UndoLogBankBits)) - 1))
+
+/*
+ * During recovery we maintain a mapping of transaction ID to undo logs
+ * numbers.  We do this with another two-level array, so that we use memory
+ * only for chunks of the array that overlap with the range of active xids.
+ */
+#define UndoLogXidLowBits 16
+
+/*
+ * Number of high bits.
+ */
+#define UndoLogXidHighBits \
+	(sizeof(TransactionId) * CHAR_BIT - UndoLogXidLowBits)
+
+/* Extract the upper bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidHigh(xid) ((xid) >> UndoLogXidLowBits)
+
+/* Extract the lower bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidLow(xid) ((xid) & ((1 << UndoLogXidLowBits) - 1))
+
+/* What is the offset of the i'th non-header byte? */
+#define UndoLogOffsetFromUsableByteNo(i)								\
+	(((i) / UndoLogUsableBytesPerPage) * BLCKSZ +						\
+	 UndoLogBlockHeaderSize +											\
+	 ((i) % UndoLogUsableBytesPerPage))
+
+/* How many non-header bytes are there before a given offset? */
+#define UndoLogOffsetToUsableByteNo(offset)				\
+	(((offset) % BLCKSZ - UndoLogBlockHeaderSize) +		\
+	 ((offset) / BLCKSZ) * UndoLogUsableBytesPerPage)
+
+/* Add 'n' usable bytes to offset stepping over headers to find new offset. */
+#define UndoLogOffsetPlusUsableBytes(offset, n)							\
+	UndoLogOffsetFromUsableByteNo(UndoLogOffsetToUsableByteNo(offset) + (n))
+
+/*
+ * Main control structure for undo log management in shared memory.
+ */
+typedef struct UndoLogSharedData
+{
+	UndoLogNumber free_lists[UndoPersistenceLevels];
+	int low_bankno; /* the lowest bank */
+	int high_bankno; /* one past the highest bank */
+	UndoLogNumber low_logno; /* the lowest logno */
+	UndoLogNumber high_logno; /* one past the highest logno */
+
+	/*
+	 * Array of DSM handles pointing to the arrays of UndoLogControl objects.
+	 * We don't expect there to be many banks active at a time -- usually 1 or
+	 * 2, but we need random access by log number so we arrange them into
+	 * 'banks'.
+	 */
+	dsm_handle banks[UndoLogBanks];
+} UndoLogSharedData;
+
+/*
+ * Per-backend state for the undo log module.
+ * Backend-local pointers to undo subsystem state in shared memory.
+ */
+struct
+{
+	UndoLogSharedData *shared;
+
+	/*
+	 * The control object for the undo logs that this backend is currently
+	 * attached to at each persistence level.
+	 */
+	UndoLogControl *logs[UndoPersistenceLevels];
+
+	/* The DSM segments used to hold banks of control objects. */
+	dsm_segment *bank_segments[UndoLogBanks];
+
+	/*
+	 * The address where each bank of control objects is mapped into memory in
+	 * this backend.  We map banks into memory on demand, and (for now) they
+	 * stay mapped in until every backend that mapped them exits.
+	 */
+	UndoLogControl *banks[UndoLogBanks];
+
+	/*
+	 * The lowest log number that might currently be mapped into this backend.
+	 */
+	int				low_logno;
+
+	/*
+	 * If the undo_tablespaces GUC changes we'll remember to examine it and
+	 * attach to a new undo log using this flag.
+	 */
+	bool			need_to_choose_tablespace;
+
+	/*
+	 * During recovery, the startup process maintains a mapping of xid to undo
+	 * log number, instead of using 'log' above.  This is not used in regular
+	 * backends and can be in backend-private memory so long as recovery is
+	 * single-process.  This map references UNDO_PERMANENT logs only, since
+	 * temporary and unlogged relations don't have WAL to replay.
+	 */
+	UndoLogNumber **xid_map;
+
+	/*
+	 * The slot for the oldest xids still running.  We advance this during
+	 * checkpoints to free up chunks of the map.
+	 */
+	uint16			xid_map_oldest_chunk;
+} MyUndoLogState;
+
+/* GUC variables */
+char	   *undo_tablespaces = NULL;
+
+static UndoLogControl *get_undo_log_by_number(UndoLogNumber logno);
+static UndoLogControl *get_undo_log_by_number_unlocked(UndoLogNumber logno);
+static void ensure_undo_log_number(UndoLogNumber logno);
+static void attach_undo_log(UndoPersistence level, Oid tablespace);
+static void detach_current_undo_log(UndoPersistence level, bool exhausted);
+static void extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end);
+static void undo_log_before_exit(int code, Datum value);
+static void forget_undo_buffers(int logno, UndoLogOffset old_discard,
+								UndoLogOffset new_discard,
+								bool drop_tail);
+static bool choose_undo_tablespace(bool force_detach, Oid *oid);
+static void undolog_xid_map_gc(void);
+static void undolog_bank_gc(void);
+
+PG_FUNCTION_INFO_V1(pg_stat_get_undo_logs);
+
+/*
+ * Return the amount of traditional smhem required for undo log management.
+ * Extra shared memory will be managed using DSM segments.
+ */
+Size
+UndoLogShmemSize(void)
+{
+	return sizeof(UndoLogSharedData);
+}
+
+/*
+ * Initialize the undo log subsystem.  Called in each backend.
+ */
+void
+UndoLogShmemInit(void)
+{
+	bool found;
+
+	MyUndoLogState.shared = (UndoLogSharedData *)
+		ShmemInitStruct("UndoLogShared", UndoLogShmemSize(), &found);
+
+	if (!IsUnderPostmaster)
+	{
+		UndoLogSharedData *shared = MyUndoLogState.shared;
+		int		i;
+
+		Assert(!found);
+
+		/*
+		 * We start with no undo logs.  StartUpUndoLogs() will recreate undo
+		 * logs that were known at last checkpoint.
+		 */
+		memset(shared, 0, sizeof(*shared));
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+			shared->free_lists[i] = InvalidUndoLogNumber;
+		shared->low_bankno = 0;
+		shared->high_bankno = 0;
+	}
+	else
+		Assert(found);
+}
+
+void
+UndoLogInit(void)
+{
+	before_shmem_exit(undo_log_before_exit, 0);
+}
+
+/*
+ * Figure out which directory holds an undo log based on tablespace.
+ */
+static void
+UndoLogDirectory(Oid tablespace, char *dir)
+{
+	if (tablespace == DEFAULTTABLESPACE_OID ||
+		tablespace == InvalidOid)
+		snprintf(dir, MAXPGPATH, "base/undo");
+	else
+		snprintf(dir, MAXPGPATH, "pg_tblspc/%u/%s/undo",
+				 tablespace, TABLESPACE_VERSION_DIRECTORY);
+}
+
+/*
+ * Compute the pathname to use for an undo log segment file.
+ */
+void
+UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace, char *path)
+{
+	char		dir[MAXPGPATH];
+
+	/* Figure out which directory holds the segment, based on tablespace. */
+	UndoLogDirectory(tablespace, dir);
+
+	/*
+	 * Build the path from log number and offset.  The pathname is the
+	 * UndoRecPtr of the first byte in the segment in hexadecimal, with a
+	 * period inserted between the components.
+	 */
+	snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno,
+			 segno * UndoLogSegmentSize);
+}
+
+/*
+ * Iterate through the set of currently active logs.  Pass in NULL to get the
+ * first undo log.  NULL indicates the end of the set of logs.
+ */
+UndoLogControl *
+UndoLogNext(UndoLogControl *log)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogNumber logno;
+	UndoLogNumber high_logno;
+
+	if (log == NULL)
+		return get_undo_log_by_number(InvalidUndoLogNumber);
+
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+	high_logno = shared->high_logno;
+	LWLockRelease(UndoLogLock);
+
+	/*
+	 * It's possible that individual logs have been entirely discarded, so we
+	 * have to be ready to skip NULLs.
+	 */
+	for (logno = log->logno + 1; logno < high_logno; ++logno)
+	{
+		log = get_undo_log_by_number(logno);
+		if (log != NULL)
+			return log;
+	}
+
+	return NULL;
+}
+
+/*
+ * Check if an undo log position has been discarded.  'point' must be an undo
+ * log pointer that was allocated at some point in the past, otherwise the
+ * result is undefined.
+ */
+bool
+UndoLogIsDiscarded(UndoRecPtr point)
+{
+	UndoLogControl *log = get_undo_log_by_number(UndoRecPtrGetLogNo(point));
+	bool	result;
+
+	/*
+	 * If we don't recognize the log number, it's either entirely discarded or
+	 * it's never been allocated (ie from the future) and our result is
+	 * undefined.
+	 */
+	if (log == NULL)
+		return true;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	result = UndoRecPtrGetOffset(point) < log->meta.discard;
+	LWLockRelease(&log->mutex);
+
+	return result;
+}
+
+/*
+ * Store latest transaction's start undo record point in undo meta data.  It
+ * will fetched by the backend when it's reusing the undo log and preparing
+ * its first undo.
+ */
+void
+UndoLogSetLastXactStartPoint(UndoRecPtr point)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(point);
+	UndoLogControl *log = get_undo_log_by_number(logno);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.last_xact_start = UndoRecPtrGetOffset(point);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Fetch the previous transaction's start undo record point.
+ */
+UndoRecPtr
+UndoLogGetLastXactStartPoint(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log_by_number(logno);
+	uint64 last_xact_start = 0;
+
+	if (unlikely(log == NULL))
+		return InvalidUndoRecPtr;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	last_xact_start = log->meta.last_xact_start;
+	LWLockRelease(&log->mutex);
+
+	if (last_xact_start == 0)
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, last_xact_start);
+}
+
+/*
+ * Store the last undo record's length in undo meta-data so that it can be
+ * persistent across restart.
+ */
+void
+UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen)
+{
+	UndoLogControl *log = get_undo_log_by_number(logno);
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.prevlen = prevlen;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get the last undo record's length.
+ */
+uint16
+UndoLogGetPrevLen(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log_by_number(logno);
+	uint16	prevlen;
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	prevlen = log->meta.prevlen;
+	LWLockRelease(&log->mutex);
+
+	return prevlen;
+}
+
+/*
+ * Is this record is the first record for any transaction.
+ */
+bool
+IsTransactionFirstRec(TransactionId xid)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	Assert(InRecovery);
+
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	log = get_undo_log_by_number(logno);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	return log->meta.is_first_rec;
+}
+
+/*
+ * Detach from the undo log we are currently attached to, returning it to the
+ * appropriate free list if it still has space.
+ */
+static void
+detach_current_undo_log(UndoPersistence persistence, bool exhausted)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+
+	MyUndoLogState.logs[persistence] = NULL;
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = InvalidPid;
+	log->xid = InvalidTransactionId;
+	if (exhausted)
+		log->meta.status = UNDO_LOG_STATUS_EXHAUSTED;
+	LWLockRelease(&log->mutex);
+
+	/* Push back onto the appropriate free list. */
+	if (!exhausted)
+	{
+		LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+		log->next_free = shared->free_lists[persistence];
+		shared->free_lists[persistence] = log->logno;
+		LWLockRelease(UndoLogLock);
+	}
+}
+
+/*
+ * Exit handler, detaching from all undo logs.
+ */
+static void
+undo_log_before_exit(int code, Datum arg)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		if (MyUndoLogState.logs[i] != NULL)
+			detach_current_undo_log(i, false);
+	}
+}
+
+/*
+ * Create a new empty segment file on disk for the byte starting at 'end'.
+ */
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+							UndoLogOffset end)
+{
+	struct stat	stat_buffer;
+	off_t	size;
+	char	path[MAXPGPATH];
+	void   *zeroes;
+	size_t	nzeroes = 8192;
+	int		fd;
+
+	UndoLogSegmentPath(logno, end / UndoLogSegmentSize, tablespace, path);
+
+	/*
+	 * Create and fully allocate a new file.  If we crashed and recovered
+	 * then the file might already exist, so use flags that tolerate that.
+	 * It's also possible that it exists but is too short, in which case
+	 * we'll write the rest.  We don't really care what's in the file, we
+	 * just want to make sure that the filesystem has allocated physical
+	 * blocks for it, so that non-COW filesystems will report ENOSPC now
+	 * rather than later when the space is needed and we'll avoid creating
+	 * files with holes.
+	 */
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0 && tablespace != 0)
+	{
+		char undo_path[MAXPGPATH];
+
+		/* Try creating the undo directory for this tablespace. */
+		UndoLogDirectory(tablespace, undo_path);
+		if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+		{
+			char	   *parentdir;
+
+			if (errno != ENOENT || !InRecovery)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+
+			/*
+			 * In recovery, it's possible that the tablespace directory
+			 * doesn't exist because a later WAL record removed the whole
+			 * tablespace.  In that case we create a regular directory to
+			 * stand in for it.  This is similar to the logic in
+			 * TablespaceCreateDbspace().
+			 */
+
+			/* create two parents up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			/* create one parent up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+		}
+
+		fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	}
+	if (fd < 0)
+		elog(ERROR, "could not create new file \"%s\": %m", path);
+	if (fstat(fd, &stat_buffer) < 0)
+		elog(ERROR, "could not stat \"%s\": %m", path);
+	size = stat_buffer.st_size;
+
+	/* A buffer full of zeroes we'll use to fill up new segment files. */
+	zeroes = palloc0(nzeroes);
+
+	while (size < UndoLogSegmentSize)
+	{
+		ssize_t written;
+
+		written = write(fd, zeroes, Min(nzeroes, UndoLogSegmentSize - size));
+		if (written < 0)
+			elog(ERROR, "cannot initialize undo log segment file \"%s\": %m",
+				 path);
+		size += written;
+	}
+
+	/* Flush the contents of the file to disk. */
+	if (pg_fsync(fd) != 0)
+		elog(ERROR, "cannot fsync file \"%s\": %m", path);
+	CloseTransientFile(fd);
+
+	pfree(zeroes);
+
+	elog(LOG, "created undo segment \"%s\"", path); /* XXX: remove me */
+}
+
+/*
+ * Create a new undo segment, when it is unexpectedly not present.
+ */
+void
+UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno)
+{
+	Assert(InRecovery);
+	allocate_empty_undo_segment(logno, tablespace, segno * UndoLogSegmentSize);
+}
+
+/*
+ * Create and zero-fill a new segment for a given undo log number.
+ */
+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
+{
+	UndoLogControl *log;
+	char		dir[MAXPGPATH];
+	size_t		end;
+
+	log = get_undo_log_by_number(logno);
+
+	Assert(log != NULL);
+	Assert(log->meta.end % UndoLogSegmentSize == 0);
+	Assert(new_end % UndoLogSegmentSize == 0);
+	Assert(MyUndoLogState.logs[log->meta.persistence] == log || InRecovery);
+
+	/*
+	 * Create all the segments needed to increase 'end' to the requested
+	 * size.  This is quite expensive, so we will try to avoid it completely
+	 * by renaming files into place in UndoLogDiscard instead.
+	 */
+	end = log->meta.end;
+	while (end < new_end)
+	{
+		allocate_empty_undo_segment(logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	Assert(end == new_end);
+
+	/*
+	 * Flush the parent dir so that the directory metadata survives a crash
+	 * after this point.
+	 */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/*
+	 * If we're not in recovery, we need to WAL-log the creation of the new
+	 * file(s).  We do that after the above filesystem modifications, in
+	 * violation of the data-before-WAL rule as exempted by
+	 * src/backend/access/transam/README.  This means that it's possible for
+	 * us to crash having made some or all of the filesystem changes but
+	 * before WAL logging, but in that case we'll eventually try to create the
+	 * same segment(s) again, which is tolerated.
+	 */
+	if (!InRecovery)
+	{
+		xl_undolog_extend xlrec;
+		XLogRecPtr	ptr;
+
+		xlrec.logno = logno;
+		xlrec.end = end;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+		XLogFlush(ptr);
+	}
+
+	/*
+	 * We didn't need to acquire the mutex to read 'end' above because only
+	 * we write to it.  But we need the mutex to update it, because the
+	 * checkpointer might read it concurrently.
+	 *
+	 * XXX It's possible for meta.end to be higher already during
+	 * recovery, because of the timing of a checkpoint; in that case we did
+	 * nothing above and we shouldn't update shmem here.  That interaction
+	 * needs more analysis.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (log->meta.end < end)
+		log->meta.end = end;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get an insertion point that is guaranteed to be backed by enough space to
+ * hold 'size' bytes of data.  To actually write into the undo log, client
+ * code should call this first and then use bufmgr routines to access buffers
+ * and provide WAL logs and redo handlers.  In other words, while this module
+ * looks after making sure the undo log has sufficient space and the undo meta
+ * data is crash safe, the *contents* of the undo log and (indirectly) the
+ * insertion point are the responsibility of client code.
+ *
+ * Return an undo log insertion point that can be converted to a buffer tag
+ * and an insertion point within a buffer page.
+ *
+ * XXX For now an xl_undolog_meta object is filled in, in case it turns out
+ * to be necessary to write it into the WAL record (like FPI, this must be
+ * logged once for each undo log after each checkpoint).  I think this should
+ * be moved out of this interface and done differently -- to review.
+ */
+UndoRecPtr
+UndoLogAllocate(size_t size, UndoPersistence persistence, xl_undolog_meta *undometa)
+{
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+	UndoLogOffset new_insert;
+
+	/*
+	 * We may need to attach to an undo log, either because this is the first
+	 * time this backend as needed to write to an undo log at all or because
+	 * the undo_tablespaces GUC was changed.  When doing that, we'll need
+	 * interlocking against tablespaces being concurrently dropped.
+	 */
+
+ retry:
+	/* See if we need to check the undo_tablespaces GUC. */
+	if (unlikely(MyUndoLogState.need_to_choose_tablespace || log == NULL))
+	{
+		Oid		tablespace;
+		bool	need_to_unlock;
+
+		need_to_unlock =
+			choose_undo_tablespace(MyUndoLogState.need_to_choose_tablespace,
+								   &tablespace);
+		attach_undo_log(persistence, tablespace);
+		if (need_to_unlock)
+			LWLockRelease(TablespaceCreateLock);
+		log = MyUndoLogState.logs[persistence];
+		MyUndoLogState.need_to_choose_tablespace = false;
+	}
+
+	/*
+	 * If this is the first time we've allocated undo log space in this
+	 * transaction, we'll record the xid->undo log association so that it can
+	 * be replayed correctly. Before that, we set the first record flag to
+	 * false.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.is_first_rec = false;
+
+	if (log->xid != GetTopTransactionId())
+	{
+		xl_undolog_attach xlrec;
+
+		/*
+		 * While we have the lock, check if we have been forcibly detached by
+		 * DROP TABLESPACE.  That can only happen between transactions (see
+		 * DropUndoLogsInsTablespace()).
+		 */
+		if (log->pid == InvalidPid)
+		{
+			LWLockRelease(&log->mutex);
+			log = NULL;
+			goto retry;
+		}
+		log->xid = GetTopTransactionId();
+		log->meta.is_first_rec = true;
+		LWLockRelease(&log->mutex);
+
+		/* Skip the attach record for unlogged and temporary tables. */
+		if (persistence == UNDO_PERMANENT)
+		{
+			xlrec.xid = GetTopTransactionId();
+			xlrec.logno = log->logno;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_ATTACH);
+		}
+	}
+	else
+	{
+		LWLockRelease(&log->mutex);
+	}
+
+	/*
+	 * 'size' is expressed in usable non-header bytes.  Figure out how far we
+	 * have to move insert to create space for 'size' usable bytes, stepping
+	 * over any intervening headers.
+	 */
+	Assert(log->meta.insert % BLCKSZ >= UndoLogBlockHeaderSize);
+	new_insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	Assert(new_insert % BLCKSZ >= UndoLogBlockHeaderSize);
+
+	/*
+	 * We don't need to acquire log->mutex to read log->meta.insert and
+	 * log->meta.end, because this backend is the only one that can
+	 * modify them.
+	 */
+	if (unlikely(new_insert > log->meta.end))
+	{
+		if (new_insert > UndoLogMaxSize)
+		{
+			/* This undo log is entirely full.  Get a new one. */
+			/*
+			 * TODO: do we need to do something more here?  How will the
+			 * caller or later the undo worker deal with a transaction being
+			 * split over two undo logs?
+			 */
+			log = NULL;
+			detach_current_undo_log(persistence, true);
+			goto retry;
+		}
+		/*
+		 * Extend the end of this undo log to cover new_insert (in other words
+		 * round up to the segment size).
+		 */
+		extend_undo_log(log->logno,
+						new_insert + UndoLogSegmentSize -
+						new_insert % UndoLogSegmentSize);
+		Assert(new_insert <= log->meta.end);
+	}
+
+	if (undometa)
+	{
+		undometa->meta = log->meta;
+		undometa->logno = log->logno;
+		undometa->xid = log->xid;
+	}
+
+	return MakeUndoRecPtr(log->logno, log->meta.insert);
+}
+
+/*
+ * In recovery, we expect the xid to map to a known log which already has
+ * enough space in it.
+ */
+UndoRecPtr
+UndoLogAllocateInRecovery(TransactionId xid, size_t size,
+						  UndoPersistence level)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	/*
+	 * The sequence of calls to UndoLogAllocateRecovery() during REDO
+	 * (recovery) must match the sequence of calls to UndoLogAllocate during
+	 * DO, for any given session.  The XXX_redo code for any UNDO-generating
+	 * operation must use UndoLogAllocateRecovery() rather than
+	 * UndoLogAllocate(), because it must supply the extra 'xid' argument so
+	 * that we can find out which undo log number to use.  During DO, that's
+	 * tracked per-backend, but during REDO the original backends/sessions are
+	 * lost and we have only the Xids.
+	 */
+	Assert(InRecovery);
+
+	/*
+	 * Look up the undo log number for this xid.  The mapping must already
+	 * have been created by an XLOG_UNDOLOG_ATTACH record emitted during the
+	 * first call to UndoLogAllocate for this xid after the most recent
+	 * checkpoint.
+	 */
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	if (logno == InvalidUndoLogNumber)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	/*
+	 * This log must already have been created by an XLOG_UNDOLOG_CREATE
+	 * record emitted by UndoLogAllocate().
+	 */
+	log = get_undo_log_by_number(logno);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	/*
+	 * This log must already have been extended to cover the requested size by
+	 * XLOG_UNDOLOG_EXTEND records emitted by UndoLogAllocate(), or by
+	 * XLOG_UNDLOG_DISCARD records recycling segments.
+	 */
+	if (log->meta.end < UndoLogOffsetPlusUsableBytes(log->meta.insert, size))
+		elog(ERROR,
+			 "unexpectedly couldn't allocate %zu bytes in undo log number %d",
+			 size, logno);
+
+	/*
+	 * By this time we have allocated a undo log in transaction so after this
+	 * it will not be first undo record for the transaction.
+	 */
+	log->meta.is_first_rec = false;
+
+	return MakeUndoRecPtr(logno, log->meta.insert);
+}
+
+/*
+ * Advance the insertion pointer by 'size' usable (non-header) bytes.
+ *
+ * XXX The original idea was that this step needed to be done separately from
+ * the UndoLogAllocate() call because we were using a slightly different
+ * scheme for interlocking with checkpoints.  The thought was that the zheap
+ * operation allocating undo log space should be WAL logged in between
+ * allocation and advancing.  Now that we are using FPI-style undo log
+ * meta-data records, this probably isn't needed anymore.  We might be able
+ * to lose this function and just advance when we allocate.  To review.
+ */
+void
+UndoLogAdvance(UndoRecPtr insertion_point, size_t size, UndoPersistence persistence)
+{
+	UndoLogControl *log = NULL;
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insertion_point) ;
+
+	/*
+	 * During recovery, MyUndoLogState is uninitialized. Hence, we need to work
+	 * more.
+	 */
+	log = (InRecovery) ? get_undo_log_by_number(logno)
+		: MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+	Assert(InRecovery || logno == log->logno);
+	Assert(UndoRecPtrGetOffset(insertion_point) == log->meta.insert);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Advance the discard pointer in one undo log, discarding all undo data
+ * relating to one or more whole transactions.  The passed in undo pointer is
+ * the address of the oldest data that the called would like to keep, and the
+ * affected undo log is implied by this pointer, ie
+ * UndoRecPtrGetLogNo(discard_pointer).
+ *
+ * The caller asserts that there will be no attempts to access the undo log
+ * region being discarded after this moment.  This operation will cause the
+ * relevant buffers to be dropped immediately, without writing any data out to
+ * disk.  Any attempt to read the buffers (except a partial buffer at the end
+ * of this range which will remain) may result in IO errors, because the
+ * underlying segment file may have been physically removed.
+ *
+ * Only one backend should call this for a given undo log concurrently, or
+ * data structures will become corrupted.  It is expected that the caller will
+ * be an undo worker; only one undo worker should be working on a given undo
+ * log at a time.
+ */
+void
+UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(discard_point);
+	UndoLogControl *log = get_undo_log_by_number(logno);
+	UndoLogOffset old_discard;
+	UndoLogOffset discard = UndoRecPtrGetOffset(discard_point);
+	UndoLogOffset end;
+	int		segno;
+	int		new_segno;
+	bool		need_to_flush_wal = false;
+
+	if (log == NULL)
+		elog(ERROR, "cannot advance discard pointer for unknown undo log %d",
+			 logno);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (discard > log->meta.insert)
+		elog(ERROR, "cannot move discard point past insert point");
+	old_discard = log->meta.discard;
+	if (discard < old_discard)
+		elog(ERROR, "cannot move discard pointer backwards");
+	end = log->meta.end;
+	LWLockRelease(&log->mutex);
+
+	/*
+	 * Drop all buffers holding this undo data out of the buffer pool (except
+	 * the last one, if the new location is in the middle of it somewhere), so
+	 * that the contained data doesn't ever touch the disk.  The caller
+	 * promises that this data will not be needed again.  We have to drop the
+	 * buffers from the buffer pool before removing files, otherwise a
+	 * concurrent session might try to write the block to evict the buffer.
+	 */
+	forget_undo_buffers(logno, old_discard, discard, false);
+
+	/*
+	 * Check if we crossed a segment boundary and need to do some synchronous
+	 * filesystem operations.
+	 */
+	segno = old_discard / UndoLogSegmentSize;
+	new_segno = discard / UndoLogSegmentSize;
+	if (segno < new_segno)
+	{
+		int		recycle;
+		UndoLogOffset pointer;
+
+		/*
+		 * We always WAL-log discards, but we only need to flush the WAL if we
+		 * have performed a filesystem operation.
+		 */
+		need_to_flush_wal = true;
+
+		/*
+		 * XXX When we rename or unlink a file, it's possible that some
+		 * backend still has it open because it has recently read a page from
+		 * it.  smgr/undofile.c in any such backend will eventually close it,
+		 * because it considers that fd to belong to the file with the name
+		 * that we're unlinking or renaming and it doesn't like to keep more
+		 * than one open at a time.  No backend should ever try to read from
+		 * such a file descriptor; that is what it means when we say that the
+		 * caller of UndoLogDiscard() asserts that there will be no attempts
+		 * to access the discarded range of undo log.  In the case of a
+		 * rename, if a backend were to attempt to read undo data in the range
+		 * being discarded, it would read entirely the wrong data.
+		 */
+
+		/*
+		 * How many segments should we recycle (= rename from tail position to
+		 * head position)?  For now it's always 1 unless there is already a
+		 * spare one, but we could have an adaptive algorithm that recycles
+		 * multiple segments at a time and pays just one fsync().
+		 */
+		if (log->meta.end - log->meta.insert < UndoLogSegmentSize)
+			recycle = 1;
+		else
+			recycle = 0;
+
+		/* Rewind to the start of the segment. */
+		pointer = segno * UndoLogSegmentSize;
+
+		while (pointer < new_segno * UndoLogSegmentSize)
+		{
+			char	discard_path[MAXPGPATH];
+
+			/*
+			 * Before removing the file, make sure that undofile_sync knows
+			 * that it might be missing.
+			 */
+			undofile_forgetsync(log->logno,
+								log->meta.tablespace,
+								pointer / UndoLogSegmentSize);
+
+			UndoLogSegmentPath(logno, pointer / UndoLogSegmentSize,
+							   log->meta.tablespace, discard_path);
+
+			/* Can we recycle the oldest segment? */
+			if (recycle > 0)
+			{
+				char	recycle_path[MAXPGPATH];
+
+				/*
+				 * End points one byte past the end of the current undo space,
+				 * ie to the first byte of the segment file we want to create.
+				 */
+				UndoLogSegmentPath(logno, end / UndoLogSegmentSize,
+								   log->meta.tablespace, recycle_path);
+				if (rename(discard_path, recycle_path) == 0)
+				{
+					elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+					end += UndoLogSegmentSize;
+					--recycle;
+				}
+				else
+				{
+					elog(ERROR, "could not rename \"%s\" to \"%s\": %m",
+						 discard_path, recycle_path);
+				}
+			}
+			else
+			{
+				if (unlink(discard_path) == 0)
+					elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+				else
+					elog(ERROR, "could not unlink \"%s\": %m", discard_path);
+			}
+			pointer += UndoLogSegmentSize;
+		}
+	}
+
+	/* WAL log the discard. */
+	{
+		xl_undolog_discard xlrec;
+		XLogRecPtr ptr;
+
+		xlrec.logno = logno;
+		xlrec.discard = discard;
+		xlrec.end = end;
+		xlrec.latestxid = xid;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_DISCARD);
+
+		if (need_to_flush_wal)
+			XLogFlush(ptr);
+	}
+
+	/* Update shmem to show the new discard and end pointers. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get the tablespace for a given UndoRecPtr.
+ */
+Oid
+UndoRecPtrGetTablespace(UndoRecPtr ptr)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(ptr);
+	UndoLogControl *log = get_undo_log_by_number(logno);
+
+	/*
+	 * No need to acquire log->mutex, because log->meta.tablespace is constant
+	 * for the lifetime of the log.  In future, we might change
+	 * DropUndoLogsInTablespace() so that it discards only up the next segment
+	 * file but then allows the undo log to be reused for another tablespace,
+	 * and then we might need to reconsider this.
+	 */
+	if (log != NULL)
+		return log->meta.tablespace;
+	else
+		return InvalidOid;
+}
+
+/*
+ * Return an UndoRecPtr to the oldest valid data in an undo log, or
+ * InvalidUndoRecPtr if it is empty.
+ */
+UndoRecPtr
+UndoLogGetFirstValidRecord(UndoLogControl *log)
+{
+	UndoRecPtr	result;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	if (log->meta.discard == log->meta.insert)
+		result = InvalidUndoRecPtr;
+	else
+		result = MakeUndoRecPtr(log->logno, log->meta.discard);
+	LWLockRelease(&log->mutex);
+
+	return result;
+}
+
+/*
+ * Return the Next insert location.  This will also validate the input xid
+ * if latest insert point is not for the same transaction id then this will
+ * return Invalid Undo pointer.
+ */
+UndoRecPtr
+UndoLogGetNextInsertPtr(UndoLogNumber logno, TransactionId xid)
+{
+	UndoLogControl *log = get_undo_log_by_number(logno);
+	TransactionId	logxid;
+	UndoRecPtr	insert;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	insert = log->meta.insert;
+	logxid = log->xid;
+	LWLockRelease(&log->mutex);
+
+	if (TransactionIdIsValid(logxid) && !TransactionIdEquals(logxid, xid))
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, insert);
+}
+
+/*
+ * Rewind the undo log insert position also set the prevlen in the mata
+ */
+void
+UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen)
+{
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insert_urp);
+	UndoLogControl *log = get_undo_log_by_number(logno);
+	UndoLogOffset	insert = UndoRecPtrGetOffset(insert_urp);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = insert;
+	log->meta.prevlen = prevlen;
+
+	/*
+	 * Force the wal log on next undo allocation. So that during recovery undo
+	 * insert location is consistent with normal allocation.
+	 */
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	/* WAL log the rewind. */
+	{
+		xl_undolog_rewind xlrec;
+
+		xlrec.logno = logno;
+		xlrec.insert = insert;
+		xlrec.prevlen = prevlen;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_REWIND);
+	}
+}
+
+/*
+ * Delete unreachable files under pg_undo.  Any files corresponding to LSN
+ * positions before the previous checkpoint are no longer needed.
+ */
+static void
+CleanUpUndoCheckPointFiles(XLogRecPtr checkPointRedo)
+{
+	DIR	   *dir;
+	struct dirent *de;
+	char	path[MAXPGPATH];
+	char	oldest_path[MAXPGPATH];
+
+	/*
+	 * If a base backup is in progress, we can't delete any checkpoint
+	 * snapshot files because one of them corresponds to the backup label but
+	 * there could be any number of checkpoints during the backup.
+	 */
+	if (BackupInProgress())
+		return;
+
+	/* Otherwise keep only those >= the previous checkpoint's redo point. */
+	snprintf(oldest_path, MAXPGPATH, "%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	dir = AllocateDir("pg_undo");
+	while ((de = ReadDir(dir, "pg_undo")) != NULL)
+	{
+		/*
+		 * Assume that fixed width uppercase hex strings sort the same way as
+		 * the values they represent, so we can use strcmp to identify undo
+		 * log snapshot files corresponding to checkpoints that we don't need
+		 * anymore.  This assumption holds for ASCII.
+		 */
+		if (!(strlen(de->d_name) == UNDO_CHECKPOINT_FILENAME_LENGTH))
+			continue;
+
+		if (UndoCheckPointFilenamePrecedes(de->d_name, oldest_path))
+		{
+			snprintf(path, MAXPGPATH, "pg_undo/%s", de->d_name);
+			if (unlink(path) != 0)
+				elog(ERROR, "could not unlink file \"%s\": %m", path);
+		}
+	}
+	FreeDir(dir);
+}
+
+/*
+ * Write out the undo log meta data to the pg_undo directory.  The actual
+ * contents of undo logs is in shared buffers and therefore handled by
+ * CheckPointBuffers(), but here we record the table of undo logs and their
+ * properties.
+ */
+void
+CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogMetaData *serialized = NULL;
+	UndoLogNumber low_logno;
+	UndoLogNumber high_logno;
+	UndoLogNumber logno;
+	size_t	serialized_size = 0;
+	char   *data;
+	char	path[MAXPGPATH];
+	int		num_logs;
+	int		fd;
+	pg_crc32c crc;
+
+	/*
+	 * Take this opportunity to check if we can free up any DSM segments and
+	 * also some entries in the checkpoint file by forgetting about entirely
+	 * discarded undo logs.  Otherwise both would eventually grow large.
+	 */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	while (shared->low_logno < shared->high_logno)
+	{
+		UndoLogControl *log;
+
+		log = get_undo_log_by_number_unlocked(shared->low_logno);
+		if (log->meta.status != UNDO_LOG_STATUS_DISCARDED)
+			break;
+
+		/*
+		 * If this was the last slot in a bank, the bank is no longer needed.
+		 * The shared memory will be given back to the operating system once
+		 * every attached backend runs undolog_bank_gc().
+		 */
+		if (UndoLogNoGetSlotNo(shared->low_logno + 1) == 0)
+			shared->banks[UndoLogNoGetBankNo(shared->low_logno)] =
+				DSM_HANDLE_INVALID;
+
+		++shared->low_logno;
+	}
+	LWLockRelease(UndoLogLock);
+
+	/* Detach from any banks that we don't need if low_logno advanced. */
+	undolog_bank_gc();
+
+	/*
+	 * We acquire UndoLogLock to prevent any undo logs from being created or
+	 * discarded while we build a snapshot of them.  This isn't expected to
+	 * take long on a healthy system because the number of active logs should
+	 * be around the number of backends.  Holding this lock won't prevent
+	 * concurrent access to the undo log, except when segments need to be
+	 * added or removed.
+	 */
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+
+	low_logno = shared->low_logno;
+	high_logno = shared->high_logno;
+	num_logs = high_logno - low_logno;
+
+	/*
+	 * Rather than doing the file IO while we hold the lock, we'll copy it
+	 * into a palloc'd buffer.
+	 */
+	if (num_logs > 0)
+	{
+		serialized_size = sizeof(UndoLogMetaData) * num_logs;
+		serialized = (UndoLogMetaData *) palloc0(serialized_size);
+
+		for (logno = low_logno; logno != high_logno; ++logno)
+		{
+			UndoLogControl *log;
+
+			log = get_undo_log_by_number_unlocked(logno);
+			if (log == NULL) /* XXX can this happen? */
+				continue;
+
+			/* Capture snapshot while holding the mutex. */
+			LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+			log->need_attach_wal_record = true;
+			memcpy(&serialized[logno], &log->meta, sizeof(UndoLogMetaData));
+			LWLockRelease(&log->mutex);
+		}
+	}
+
+	LWLockRelease(UndoLogLock);
+
+	/* Dump into a file under pg_undo. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE);
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", path)));
+
+	/* Compute header checksum. */
+	INIT_CRC32C(crc);
+	COMP_CRC32C(crc, &low_logno, sizeof(low_logno));
+	COMP_CRC32C(crc, &high_logno, sizeof(high_logno));
+	FIN_CRC32C(crc);
+
+	/* Write out range of active log numbers + crc. */
+	if ((write(fd, &low_logno, sizeof(low_logno)) != sizeof(low_logno)) ||
+		(write(fd, &high_logno, sizeof(high_logno)) != sizeof(high_logno)) ||
+		(write(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+	/* Write out the meta data for all undo logs in that range. */
+	data = (char *) serialized;
+	INIT_CRC32C(crc);
+	while (serialized_size > 0)
+	{
+		ssize_t written;
+
+		written = write(fd, data, serialized_size);
+		if (written < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write to file \"%s\": %m", path)));
+		COMP_CRC32C(crc, data, written);
+		serialized_size -= written;
+		data += written;
+	}
+	FIN_CRC32C(crc);
+
+	if (write(fd, &crc, sizeof(crc)) != sizeof(crc))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+
+	/* Flush file and directory entry. */
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC);
+	pg_fsync(fd);
+	CloseTransientFile(fd);
+	fsync_fname("pg_undo", true);
+	pgstat_report_wait_end();
+
+	if (serialized)
+		pfree(serialized);
+
+	CleanUpUndoCheckPointFiles(priorCheckPointRedo);
+	undolog_xid_map_gc();
+}
+
+void
+StartupUndoLogs(XLogRecPtr checkPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char	path[MAXPGPATH];
+	int		logno;
+	int		fd;
+	pg_crc32c crc;
+	pg_crc32c new_crc;
+
+	/* If initdb is calling, there is no file to read yet. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	/* Open the pg_undo file corresponding to the given checkpoint. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_READ);
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+		elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
+
+	/* Read the active log number range. */
+	if ((read(fd, &shared->low_logno, sizeof(shared->low_logno))
+		 != sizeof(shared->low_logno)) ||
+		(read(fd, &shared->high_logno, sizeof(shared->high_logno))
+		 != sizeof(shared->high_logno)) ||
+		(read(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+
+	/* Verify the header checksum. */
+	INIT_CRC32C(new_crc);
+	COMP_CRC32C(new_crc, &shared->low_logno, sizeof(shared->low_logno));
+	COMP_CRC32C(new_crc, &shared->high_logno, sizeof(shared->high_logno));
+	FIN_CRC32C(new_crc);
+
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	/* Initialize all the logs and set up the freelist. */
+	INIT_CRC32C(new_crc);
+	for (logno = shared->low_logno; logno < shared->high_logno; ++logno)
+	{
+		UndoLogControl *log;
+
+		/* Get a zero-initialized control objects. */
+		LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+		ensure_undo_log_number(logno);
+		LWLockRelease(UndoLogLock);
+
+		log = get_undo_log_by_number(logno);
+
+		/* Read in the meta data for this undo log. */
+		if (read(fd, &log->meta, sizeof(log->meta)) != sizeof(log->meta))
+			elog(ERROR, "corrupted pg_undo meta data in file \"%s\": %m",
+				 path);
+		COMP_CRC32C(new_crc, &log->meta, sizeof(log->meta));
+
+		/*
+		 * At normal start-up, or during recovery, all active undo logs start
+		 * out on the appropriate free list.
+		 */
+		log->pid = InvalidPid;
+		log->xid = InvalidTransactionId;
+		if (log->meta.status == UNDO_LOG_STATUS_ACTIVE)
+		{
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = logno;
+		}
+	}
+	FIN_CRC32C(new_crc);
+
+	/* Verify body checksum. */
+	if (read(fd, &crc, sizeof(crc)) != sizeof(crc))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	CloseTransientFile(fd);
+	pgstat_report_wait_end();
+}
+
+/*
+ * Workhorse for get_undo_log_by_number().  Also callable directly if
+ * UndoLogLock is already held.
+ */
+static UndoLogControl *
+get_undo_log_by_number_unlocked(UndoLogNumber logno)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	int bankno = UndoLogNoGetBankNo(logno);
+	int slotno = UndoLogNoGetSlotNo(logno);
+
+	Assert(LWLockHeldByMe(UndoLogLock));
+
+	/* See if we need to attach to the bank that holds logno. */
+	if (MyUndoLogState.banks[bankno] == NULL)
+	{
+		dsm_segment *segment;
+
+		/* See if we need to map in a new bank. */
+		if (shared->banks[bankno] != DSM_HANDLE_INVALID)
+		{
+			segment = dsm_attach(shared->banks[bankno]);
+			if (segment != NULL)
+			{
+				MyUndoLogState.bank_segments[bankno] = segment;
+				MyUndoLogState.banks[bankno] = dsm_segment_address(segment);
+				dsm_pin_mapping(segment);
+			}
+		}
+
+		/*
+		 * If we didn't manage to find a bank to map in, the undo log we're
+		 * being asked for must be entirely discarded.  In that case we just
+		 * return NULL.
+		 */
+		if (unlikely(MyUndoLogState.banks[bankno] == NULL))
+		{
+			Assert(logno < shared->low_logno);
+			return NULL;
+		}
+	}
+
+	return &MyUndoLogState.banks[bankno][slotno];
+}
+
+/*
+ * Get an UndoLogControl pointer for a given logno.  This may require
+ * attaching to a DSM segment if it isn't already attached in this backend.
+ *
+ * If InvalidUndoLogNumber is passed in, the lowest existing undo log will be
+ * found.  Even if the bank of undo logs is entirely discarded, the returned
+ * pointer remains valid until undolog_bank_gc() is called in the calling
+ * backend.
+ *
+ * Return NULL if there is no such logno because it has been entirely
+ * discarded.  Most callers shouldn't expect to get a NULL, but
+ * UndoLogIsDiscarded() must be ready to received an ancient UndoRecPtr.
+ */
+static UndoLogControl *
+get_undo_log_by_number(UndoLogNumber logno)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	int bankno = UndoLogNoGetBankNo(logno);
+	int slotno = UndoLogNoGetSlotNo(logno);
+
+	/* Is it an ancient discarded logno? */
+	if (unlikely(logno != InvalidUndoLogNumber &&
+				 logno < shared->low_logno))
+	{
+		/*
+		 * We opportunistically checked without locking so shared->low_logno
+		 * might be out of date.
+		 */
+		return NULL;
+	}
+
+	/* Is it not currently mapped into this backend? */
+	if (unlikely(logno == InvalidUndoLogNumber ||
+				 MyUndoLogState.banks[bankno] == NULL))
+	{
+		UndoLogControl *log;
+
+		/* We'll have to acquire the lock. */
+		LWLockAcquire(UndoLogLock, LW_SHARED);
+		if (logno == InvalidUndoLogNumber)
+		{
+			if (shared->low_logno == shared->high_logno)
+				log = NULL;
+			else
+				log = get_undo_log_by_number_unlocked(shared->low_logno);
+		}
+		else
+			log = get_undo_log_by_number_unlocked(logno);
+		LWLockRelease(UndoLogLock);
+
+		return log;
+	}
+
+	/* Fast path: it's already mapped in. */
+	return &MyUndoLogState.banks[bankno][slotno];
+}
+
+UndoLogControl *
+UndoLogGet(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log_by_number(logno);
+
+	if (log == NULL)
+		elog(ERROR, "unknown undo log number %d", logno);
+
+	return log;
+}
+
+/*
+ * We write the undo log number into each UndoLogControl object.
+ */
+static void
+initialize_undo_log_bank(int bankno, UndoLogControl *bank)
+{
+	int		i;
+	int		logs_per_bank = 1 << (UndoLogNumberBits - UndoLogBankBits);
+
+	for (i = 0; i < logs_per_bank; ++i)
+	{
+		bank[i].logno = logs_per_bank * bankno + i;
+		LWLockInitialize(&bank[i].mutex, LWTRANCHE_UNDOLOG);
+		LWLockInitialize(&bank[i].discard_lock, LWTRANCHE_UNDODISCARD);
+	}
+}
+
+/*
+ * Create shared memory space for a given undo log number, if it doesn't exist
+ * already.
+ */
+static void
+ensure_undo_log_number(UndoLogNumber logno)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	int		bankno = UndoLogNoGetBankNo(logno);
+
+	/* In single-user mode, we have to use backend-private memory. */
+	if (!IsUnderPostmaster)
+	{
+			if (MyUndoLogState.banks[bankno] == NULL)
+			{
+				size_t size;
+
+				size = sizeof(UndoLogControl) * (1 << UndoLogBankBits);
+				MyUndoLogState.banks[bankno] =
+					MemoryContextAllocZero(TopMemoryContext, size);
+				initialize_undo_log_bank(bankno, MyUndoLogState.banks[bankno]);
+			}
+			return;
+	}
+
+	Assert(LWLockHeldByMeInMode(UndoLogLock, LW_EXCLUSIVE));
+
+	/* Do we need to create a bank in shared memory for this undo log number? */
+	if (shared->banks[bankno] == DSM_HANDLE_INVALID)
+	{
+		dsm_segment *segment;
+		size_t size;
+
+		size = sizeof(UndoLogControl) * (1 << UndoLogBankBits);
+		segment = dsm_create(size, 0);
+		dsm_pin_mapping(segment);
+		dsm_pin_segment(segment);
+		memset(dsm_segment_address(segment), 0, size);
+		shared->banks[bankno] = dsm_segment_handle(segment);
+		MyUndoLogState.banks[bankno] = dsm_segment_address(segment);
+		initialize_undo_log_bank(bankno, MyUndoLogState.banks[bankno]);
+	}
+}
+
+/*
+ * Attach to an undo log, possibly creating or recycling one.
+ */
+static void
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = NULL;
+	UndoLogNumber logno;
+	UndoLogNumber *place;
+
+	Assert(!InRecovery);
+	Assert(MyUndoLogState.logs[persistence] == NULL);
+
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+	/*
+	 * For now we have a simple linked list of unattached undo logs for each
+	 * persistence level.  We'll grovel though it to find something for the
+	 * tablespace you asked for.  If you're not using multiple tablespaces
+	 * it'll be able to pop one off the front.  We might need a hash table
+	 * keyed by tablespace if this simple scheme turns out to be too slow when
+	 * using many tablespaces and many undo logs, but that seems like an
+	 * unusual use case not worth optimizing for.
+	 */
+	place = &shared->free_lists[persistence];
+	while (*place != InvalidUndoLogNumber)
+	{
+		UndoLogControl *candidate = get_undo_log_by_number_unlocked(*place);
+
+		if (candidate == NULL)
+			elog(ERROR, "corrupted undo log freelist");
+		if (candidate->meta.tablespace == tablespace)
+		{
+			logno = *place;
+			log = candidate;
+			*place = candidate->next_free;
+			break;
+		}
+		place = &candidate->next_free;
+	}
+
+	/*
+	 * All existing undo logs for this tablespace and persistence level are
+	 * busy, so we'll have to create a new one.
+	 */
+	if (log == NULL)
+	{
+		if (shared->high_logno > (1 << UndoLogNumberBits))
+		{
+			/*
+			 * You've used up all 16 exabytes of undo log addressing space.
+			 * This is a difficult state to reach using only 16 exabytes of
+			 * WAL.
+			 */
+			elog(ERROR, "cannot create new undo log");
+		}
+
+		logno = shared->high_logno;
+		ensure_undo_log_number(logno);
+
+		/* Get new zero-filled UndoLogControl object. */
+		log = get_undo_log_by_number_unlocked(logno);
+
+		Assert(log->meta.persistence == 0);
+		Assert(log->meta.tablespace == InvalidOid);
+		Assert(log->meta.discard == 0);
+		Assert(log->meta.insert == 0);
+		Assert(log->meta.end == 0);
+		Assert(log->pid == 0);
+		Assert(log->xid == 0);
+
+		/*
+		 * The insert and discard pointers start after the first block's
+		 * header.  XXX That means that insert is > end for a short time in a
+		 * newly created undo log.  Is there any problem with that?
+		 */
+		log->meta.insert = UndoLogBlockHeaderSize;
+		log->meta.discard = UndoLogBlockHeaderSize;
+
+		log->meta.tablespace = tablespace;
+		log->meta.persistence = persistence;
+		log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+
+		/* Move the high log number pointer past this one. */
+		++shared->high_logno;
+
+		/* WAL-log the creation of this new undo log. */
+		{
+			xl_undolog_create xlrec;
+
+			xlrec.logno = logno;
+			xlrec.tablespace = log->meta.tablespace;
+			xlrec.persistence = log->meta.persistence;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE);
+		}
+
+		/*
+		 * This undo log has no segments.  UndoLogAllocate will create the
+		 * first one on demand.
+		 */
+	}
+	LWLockRelease(UndoLogLock);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = MyProcPid;
+	log->xid = InvalidTransactionId;
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	MyUndoLogState.logs[persistence] = log;
+}
+
+/*
+ * Free chunks of the xid/undo log map that relate to transactions that are no
+ * longer running.  This is run at each checkpoint.
+ */
+static void
+undolog_xid_map_gc(void)
+{
+	UndoLogNumber **xid_map = MyUndoLogState.xid_map;
+	TransactionId oldest_xid;
+	uint16 new_oldest_chunk;
+	uint16 oldest_chunk;
+
+	if (xid_map == NULL)
+		return;
+
+	/*
+	 * During crash recovery, it may not be possible to call GetOldestXmin()
+	 * yet because latestCompletedXid is invalid.
+	 */
+	if (!TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid))
+		return;
+
+	oldest_xid = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
+	new_oldest_chunk = UndoLogGetXidHigh(oldest_xid);
+	oldest_chunk = MyUndoLogState.xid_map_oldest_chunk;
+
+	while (oldest_chunk != new_oldest_chunk)
+	{
+		if (xid_map[oldest_chunk])
+		{
+			pfree(xid_map[oldest_chunk]);
+			xid_map[oldest_chunk] = NULL;
+		}
+		oldest_chunk = (oldest_chunk + 1) % (1 << UndoLogXidHighBits);
+	}
+	MyUndoLogState.xid_map_oldest_chunk = new_oldest_chunk;
+}
+
+/*
+ * Detach from shared memory banks that are no longer needed because they hold
+ * undo logs that are entirely discarded.  This should ideally be called
+ * periodically in any backend that accesses undo data, so that they have a
+ * chance to detach from DSM segments that hold banks of entirely discarded
+ * undo log control objects.
+ */
+static void
+undolog_bank_gc(void)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogNumber low_logno = shared->low_logno;
+
+	if (unlikely(MyUndoLogState.low_logno < low_logno))
+	{
+		int low_bank = UndoLogNoGetBankNo(low_logno);
+		int bank = UndoLogNoGetBankNo(MyUndoLogState.low_logno);
+
+		while (bank < low_bank)
+		{
+			Assert(shared->banks[bank] == DSM_HANDLE_INVALID);
+			if (MyUndoLogState.banks[bank] != NULL)
+			{
+				dsm_detach(MyUndoLogState.bank_segments[bank]);
+				MyUndoLogState.bank_segments[bank] = NULL;
+				MyUndoLogState.banks[bank] = NULL;
+			}
+			++bank;
+		}
+	}
+
+	MyUndoLogState.low_logno = low_logno;
+}
+
+/*
+ * Associate a xid with an undo log, during recovery.  In a primary server,
+ * this isn't necessary because backends know which undo log they're attached
+ * to.  During recovery, the natural association between backends and xids is
+ * lost, so we need to manage that explicitly.
+ */
+static void
+undolog_xid_map_add(TransactionId xid, UndoLogNumber logno)
+{
+	uint16		high_bits;
+	uint16		low_bits;
+
+	high_bits = UndoLogGetXidHigh(xid);
+	low_bits = UndoLogGetXidLow(xid);
+
+	if (unlikely(MyUndoLogState.xid_map == NULL))
+	{
+		/* First time through.  Create mapping array. */
+		MyUndoLogState.xid_map =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber *) *
+								   (1 << (32 - UndoLogXidLowBits)));
+		MyUndoLogState.xid_map_oldest_chunk = high_bits;
+	}
+
+	if (unlikely(MyUndoLogState.xid_map[high_bits] == NULL))
+	{
+		/* This bank of mappings doesn't exist yet.  Create it. */
+		MyUndoLogState.xid_map[high_bits] =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber) *
+								   (1 << UndoLogXidLowBits));
+	}
+
+	/* Associate this xid with this undo log number. */
+	MyUndoLogState.xid_map[high_bits][low_bits] = logno;
+}
+
+/* check_hook: validate new undo_tablespaces */
+bool
+check_undo_tablespaces(char **newval, void **extra, GucSource source)
+{
+	char	   *rawname;
+	List	   *namelist;
+
+	/* Need a modifiable copy of string */
+	rawname = pstrdup(*newval);
+
+	/*
+	 * Parse string into list of identifiers, just to check for
+	 * well-formedness (unfortunateley we can't validate the names in the
+	 * catalog yet).
+	 */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+	{
+		/* syntax error in name list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawname);
+		list_free(namelist);
+		return false;
+	}
+
+	/*
+	 * Make sure we aren't already in a transaction that has been assigned an
+	 * XID.  This ensures we don't detach from an undo log that we might have
+	 * started writing undo data into for this transaction.
+	 */
+	if (GetTopTransactionIdIfAny() != InvalidTransactionId)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 (errmsg("undo_tablespaces cannot be changed while a transaction is in progress"))));
+	list_free(namelist);
+
+	return true;
+}
+
+/* assign_hook: do extra actions as needed */
+void
+assign_undo_tablespaces(const char *newval, void *extra)
+{
+	/*
+	 * This is normally called only when GetTopTransactionIdIfAny() ==
+	 * InvalidTransactionId (because you can't change undo_tablespaces in the
+	 * middle of a transaction that's been asigned an xid), but we can't
+	 * assert that because it's also called at the end of a transaction that's
+	 * rolling back, to reset the GUC if it was set inside the transaction.
+	 */
+
+	/* Tell UndoLogAllocate() to reexamine undo_tablespaces. */
+	MyUndoLogState.need_to_choose_tablespace = true;
+}
+
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+	char   *rawname;
+	List   *namelist;
+	bool	need_to_unlock;
+	int		length;
+	int		i;
+
+	/* We need a modifiable copy of string. */
+	rawname = pstrdup(undo_tablespaces);
+
+	/* Break string into list of identifiers. */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+		elog(ERROR, "undo_tablespaces is unexpectedly malformed");
+
+	length = list_length(namelist);
+	if (length == 0 ||
+		(length == 1 && ((char *) linitial(namelist))[0] == '\0'))
+	{
+		/*
+		 * If it's an empty string, then we'll use the default tablespace.  No
+		 * locking is required because it can't be dropped.
+		 */
+		*tablespace = DEFAULTTABLESPACE_OID;
+		need_to_unlock = false;
+	}
+	else
+	{
+		/*
+		 * Choose an OID using our pid, so that if several backends have the
+		 * same multi-tablespace setting they'll spread out.  We could easily
+		 * do better than this if more serious load balancing is judged
+		 * useful.
+		 */
+		int		index = MyProcPid % length;
+		int		first_index = index;
+		Oid		oid = InvalidOid;
+
+		/*
+		 * Take the tablespace create/drop lock while we look the name up.
+		 * This prevents the tablespace from being dropped while we're trying
+		 * to resolve the name, or while the called is trying to create an
+		 * undo log in it.  The caller will have to release this lock.
+		 */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		for (;;)
+		{
+			const char *name = list_nth(namelist, index);
+
+			oid = get_tablespace_oid(name, true);
+			if (oid == InvalidOid)
+			{
+				/* Unknown tablespace, try the next one. */
+				index = (index + 1) % length;
+				/*
+				 * But if we've tried them all, it's time to complain.  We'll
+				 * arbitrarily complain about the last one we tried in the
+				 * error message.
+				 */
+				if (index == first_index)
+					ereport(ERROR,
+							(errcode(ERRCODE_UNDEFINED_OBJECT),
+							 errmsg("tablespace \"%s\" does not exist", name),
+							 errhint("Create the tablespace or set undo_tablespaces to a valid or empty list.")));
+				continue;
+			}
+			if (oid == GLOBALTABLESPACE_OID)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("undo logs cannot be placed in pg_global tablespace")));
+			/* If we got here we succeeded in finding one. */
+			break;
+		}
+
+		Assert(oid != InvalidOid);
+		*tablespace = oid;
+		need_to_unlock = true;
+	}
+
+	/*
+	 * If we came here because the user changed undo_tablesaces, then detach
+	 * from any undo logs we happen to be attached to.
+	 */
+	if (force_detach)
+	{
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+		{
+			UndoLogControl *log = MyUndoLogState.logs[i];
+			UndoLogSharedData *shared = MyUndoLogState.shared;
+
+			if (log != NULL)
+			{
+				LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+				log->pid = InvalidPid;
+				log->xid = InvalidTransactionId;
+				LWLockRelease(&log->mutex);
+
+				LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+				log->next_free = shared->free_lists[i];
+				shared->free_lists[i] = log->logno;
+				LWLockRelease(UndoLogLock);
+
+				MyUndoLogState.logs[i] = NULL;
+			}
+		}
+	}
+
+	return need_to_unlock;
+}
+
+bool
+DropUndoLogsInTablespace(Oid tablespace)
+{
+	DIR *dir;
+	char undo_path[MAXPGPATH];
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log;
+	int		i;
+
+	Assert(LWLockHeldByMe(TablespaceCreateLock));
+	Assert(tablespace != DEFAULTTABLESPACE_OID);
+
+	/* First, try to kick everyone off any undo logs in this tablespace. */
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		bool ok;
+		bool return_to_freelist = false;
+
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/* Check if this undo log can be forcibly detached. */
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		if (log->meta.discard == log->meta.insert &&
+			(log->xid == InvalidTransactionId ||
+			 !TransactionIdIsInProgress(log->xid)))
+		{
+			log->xid = InvalidTransactionId;
+			if (log->pid != InvalidPid)
+			{
+				log->pid = InvalidPid;
+				return_to_freelist = true;
+			}
+			ok = true;
+		}
+		else
+		{
+			/*
+			 * There is data we need in this undo log.  We can't force it to
+			 * be detached.
+			 */
+			ok = false;
+		}
+		LWLockRelease(&log->mutex);
+
+		/* If we failed, then give up now and report failure. */
+		if (!ok)
+			return false;
+
+		/*
+		 * Put this undo log back on the appropriate free-list.  No one can
+		 * attach to it while we hold TablespaceCreateLock, but if we return
+		 * earlier in a future go around this loop, we need the undo log to
+		 * remain usable.  We'll remove all appropriate logs from the
+		 * free-lists in a separate step below.
+		 */
+		if (return_to_freelist)
+		{
+			LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = log->logno;
+			LWLockRelease(UndoLogLock);
+		}
+	}
+
+	/*
+	 * We detached all backends from undo logs in this tablespace, and no one
+	 * can attach to any non-default-tablespace undo logs while we hold
+	 * TablespaceCreateLock.  We can now drop the undo logs.
+	 */
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/*
+		 * Make sure no buffers remain.  When that is done by UndoDiscard(),
+		 * the final page is left in shared_buffers because it may contain
+		 * data, or at least be needed again very soon.  Here we need to drop
+		 * even that page from the buffer pool.
+		 */
+		forget_undo_buffers(log->logno, log->meta.discard, log->meta.discard, true);
+
+		/*
+		 * TODO: For now we drop the undo log, meaning that it will never be
+		 * used again.  That wastes the rest of its address space.  Instead,
+		 * we should put it onto a special list of 'offline' undo logs, ready
+		 * to be reactivated in some other tablespace.  Then we can keep the
+		 * unused portion of its address space.
+		 */
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		log->meta.status = UNDO_LOG_STATUS_DISCARDED;
+		LWLockRelease(&log->mutex);
+	}
+
+	/* Unlink all undo segment files in this tablespace. */
+	UndoLogDirectory(tablespace, undo_path);
+
+	dir = AllocateDir(undo_path);
+	if (dir != NULL)
+	{
+		struct dirent *de;
+
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 ||
+				strcmp(de->d_name, "..") == 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+	}
+
+	/* Remove all dropped undo logs from the free-lists. */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		UndoLogControl *log;
+		UndoLogNumber *place;
+
+		place = &shared->free_lists[i];
+		while (*place != InvalidUndoLogNumber)
+		{
+			log = get_undo_log_by_number(*place);
+			if (log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+				*place = log->next_free;
+			else
+				place = &log->next_free;
+		}
+	}
+	LWLockRelease(UndoLogLock);
+
+	return true;
+}
+
+void
+ResetUndoLogs(UndoPersistence persistence)
+{
+	UndoLogControl *log;
+
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		DIR	   *dir;
+		struct dirent *de;
+		char	undo_path[MAXPGPATH];
+		char	segment_prefix[MAXPGPATH];
+		size_t	segment_prefix_size;
+
+		if (log->meta.persistence != persistence)
+			continue;
+
+		/* Scan the directory for files belonging to this undo log. */
+		snprintf(segment_prefix, sizeof(segment_prefix), "%06X.", log->logno);
+		segment_prefix_size = strlen(segment_prefix);
+		UndoLogDirectory(log->meta.tablespace, undo_path);
+		dir = AllocateDir(undo_path);
+		if (dir == NULL)
+			continue;
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			elog(LOG, "unlinked undo segment \"%s\"", segment_path); /* XXX: remove me */
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+
+		/*
+		 * We have no segment files.  Set all pointers to the current end
+		 * pointer, so we'll create the next segment from there as soon as we
+		 * need it.
+		 */
+		log->meta.insert = log->meta.discard = log->meta.end;
+	}
+}
+
+Datum
+pg_stat_get_undo_logs(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_UNDO_LOGS_COLS 9
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	UndoLogNumber low_logno;
+	UndoLogNumber high_logno;
+	UndoLogNumber logno;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char *tablespace_name = NULL;
+	Oid last_tablespace = InvalidOid;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Find the range of active log numbers. */
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+	low_logno = shared->low_logno;
+	high_logno = shared->high_logno;
+	LWLockRelease(UndoLogLock);
+
+	/* Scan all undo logs to build the results. */
+	for (logno = low_logno; logno < high_logno; ++logno)
+	{
+		UndoLogControl *log = get_undo_log_by_number(logno);
+		char buffer[17];
+		Datum values[PG_STAT_GET_UNDO_LOGS_COLS];
+		bool nulls[PG_STAT_GET_UNDO_LOGS_COLS] = { false };
+		Oid tablespace;
+
+		if (log == NULL)
+			continue;
+
+		/*
+		 * This won't be a consistent result overall, but the values for each
+		 * log will be consistent because we'll take the per-log lock while
+		 * copying them.
+		 */
+		LWLockAcquire(&log->mutex, LW_SHARED);
+
+		if (log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+		{
+			LWLockRelease(&log->mutex);
+			continue;
+		}
+
+		values[0] = ObjectIdGetDatum((Oid) logno);
+		values[1] = CStringGetTextDatum(
+			log->meta.persistence == UNDO_PERMANENT ? "permanent" :
+			log->meta.persistence == UNDO_UNLOGGED ? "unlogged" :
+			log->meta.persistence == UNDO_TEMP ? "temporary" : "<uknown>");
+		tablespace = log->meta.tablespace;
+
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(logno, log->meta.discard));
+		values[3] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(logno, log->meta.insert));
+		values[4] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(logno, log->meta.end));
+		values[5] = CStringGetTextDatum(buffer);
+		if (log->xid == InvalidTransactionId)
+			nulls[6] = true;
+		else
+			values[6] = TransactionIdGetDatum(log->xid);
+		if (log->pid == InvalidPid)
+			nulls[7] = true;
+		else
+			values[7] = Int32GetDatum((int64) log->pid);
+		LWLockRelease(&log->mutex);
+
+		/*
+		 * Deal with potentially slow tablespace name lookup without the lock.
+		 * Avoid making multiple calls to that expensive function for the
+		 * common case of repeating tablespace.
+		 */
+		if (tablespace != last_tablespace)
+		{
+			if (tablespace_name)
+				pfree(tablespace_name);
+			tablespace_name = get_tablespace_name(tablespace);
+			last_tablespace = tablespace;
+		}
+		if (tablespace_name)
+		{
+			values[2] = CStringGetTextDatum(tablespace_name);
+			nulls[2] = false;
+		}
+		else
+			nulls[2] = true;
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+	if (tablespace_name)
+		pfree(tablespace_name);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * replay the creation of a new undo log
+ */
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+	xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	/* Create meta-data space in shared memory. */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	ensure_undo_log_number(xlrec->logno);
+	log = get_undo_log_by_number_unlocked(xlrec->logno);
+	log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+	log->meta.persistence = xlrec->persistence;
+	log->meta.tablespace = xlrec->tablespace;
+	log->meta.insert = UndoLogBlockHeaderSize;
+	log->meta.discard = UndoLogBlockHeaderSize;
+	shared->high_logno = Max(xlrec->logno + 1, shared->high_logno);
+	LWLockRelease(UndoLogLock);
+}
+
+/*
+ * replay the addition of a new segment to an undo log
+ */
+static void
+undolog_xlog_extend(XLogReaderState *record)
+{
+	xl_undolog_extend *xlrec = (xl_undolog_extend *) XLogRecGetData(record);
+
+	/* Extend exactly as we would during DO phase. */
+	extend_undo_log(xlrec->logno, xlrec->end);
+}
+
+/*
+ * replay the association of an xid with a specific undo log
+ */
+static void
+undolog_xlog_attach(XLogReaderState *record)
+{
+	xl_undolog_attach *xlrec = (xl_undolog_attach *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	undolog_xid_map_add(xlrec->xid, xlrec->logno);
+
+	/*
+	 * Whatever follows is the first record for this transaction.  Zheap will
+	 * use this to add UREC_INFO_TRANSACTION.
+	 */
+	log = get_undo_log_by_number(xlrec->logno);
+	log->meta.is_first_rec = true;
+}
+
+/*
+ * replay undo log meta-data image
+ */
+static void
+undolog_xlog_meta(XLogReaderState *record)
+{
+	xl_undolog_meta *xlrec = (xl_undolog_meta *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	undolog_xid_map_add(xlrec->xid, xlrec->logno);
+
+	log = get_undo_log_by_number(xlrec->logno);
+	if (log == NULL)
+		elog(ERROR, "cannot attach to unknown undo log %u", xlrec->logno);
+
+	/*
+	 * Update the insertion point.  While this races against a checkpoint,
+	 * XLOG_UNDOLOG_META always wins because it must be correct for any
+	 * subsequent data appended by this transaction, so we can simply
+	 * overwrite it here.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta = xlrec->meta;
+	log->xid = xlrec->xid;
+	log->pid = MyProcPid; /* show as recovery process */
+	LWLockRelease(&log->mutex);
+
+	MyUndoLogState.logs[log->meta.persistence] = log;
+}
+
+/*
+ * Drop all buffers for the given undo log, from the old_discard to up
+ * new_discard.  If drop_tail is true, also drop the buffer that holds
+ * new_discard; this is used when dropping undo logs completely via DROP
+ * TABLESPACE.  If it is false, then the final buffer is not dropped because
+ * it may contain data.
+ *
+ */
+static void
+forget_undo_buffers(int logno, UndoLogOffset old_discard,
+					UndoLogOffset new_discard, bool drop_tail)
+{
+	BlockNumber old_blockno;
+	BlockNumber new_blockno;
+	RelFileNode	rnode;
+
+	UndoRecPtrAssignRelFileNode(rnode, MakeUndoRecPtr(logno, old_discard));
+	old_blockno = old_discard / BLCKSZ;
+	new_blockno = new_discard / BLCKSZ;
+	if (drop_tail)
+		++new_blockno;
+	while (old_blockno < new_blockno)
+		ForgetBuffer(rnode, UndoLogForkNum, old_blockno++);
+}
+
+/*
+ * replay an undo segment discard record
+ */
+static void
+undolog_xlog_discard(XLogReaderState *record)
+{
+	xl_undolog_discard *xlrec = (xl_undolog_discard *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogOffset discard;
+	UndoLogOffset end;
+	UndoLogOffset old_segment_begin;
+	UndoLogOffset new_segment_begin;
+	RelFileNode rnode = {0};
+	char	dir[MAXPGPATH];
+
+	log = get_undo_log_by_number(xlrec->logno);
+	if (log == NULL)
+		elog(ERROR, "unknown undo log %d", xlrec->logno);
+
+	/*
+	 * We're about to discard undologs. In Hot Standby mode, ensure that
+	 * there's no queries running which need to get tuple from discarded undo.
+	 *
+	 * XXX we are passing empty rnode to the conflict function so that it can
+	 * check conflict in all the backend regardless of which database the
+	 * backend is connected.
+	 */
+	if (InHotStandby && TransactionIdIsValid(xlrec->latestxid))
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestxid, rnode);
+
+	/*
+	 * See if we need to unlink or rename any files, but don't consider it an
+	 * error if we find that files are missing.  Since UndoLogDiscard()
+	 * performs filesystem operations before WAL logging or updating shmem
+	 * which could be checkpointed, a crash could have left files already
+	 * deleted, but we could replay WAL that expects the files to be there.
+	 */
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	discard = log->meta.discard;
+	end = log->meta.end;
+	LWLockRelease(&log->mutex);
+
+	/* Drop buffers before we remove/recycle any files. */
+	forget_undo_buffers(xlrec->logno, discard, xlrec->discard, false);
+
+	/* Rewind to the start of the segment. */
+	old_segment_begin = discard - discard % UndoLogSegmentSize;
+	new_segment_begin = xlrec->discard - xlrec->discard % UndoLogSegmentSize;
+
+	/* Unlink or rename segments that are no longer in range. */
+	while (old_segment_begin < new_segment_begin)
+	{
+		char	discard_path[MAXPGPATH];
+
+		/*
+		 * Before removing the file, make sure that undofile_sync knows that
+		 * it might be missing.
+		 */
+		undofile_forgetsync(log->logno,
+							log->meta.tablespace,
+							end / UndoLogSegmentSize);
+
+		UndoLogSegmentPath(xlrec->logno, old_segment_begin / UndoLogSegmentSize,
+						   log->meta.tablespace, discard_path);
+
+		/* Can we recycle the oldest segment? */
+		if (end < xlrec->end)
+		{
+			char	recycle_path[MAXPGPATH];
+
+			UndoLogSegmentPath(xlrec->logno, end / UndoLogSegmentSize,
+							   log->meta.tablespace, recycle_path);
+			if (rename(discard_path, recycle_path) == 0)
+			{
+				elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+				end += UndoLogSegmentSize;
+			}
+			else
+			{
+				elog(LOG, "could not rename \"%s\" to \"%s\": %m",
+					 discard_path, recycle_path);
+			}
+		}
+		else
+		{
+			if (unlink(discard_path) == 0)
+				elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+			else
+				elog(LOG, "could not unlink \"%s\": %m", discard_path);
+		}
+		old_segment_begin += UndoLogSegmentSize;
+	}
+
+	/* Create any further new segments that are needed the slow way. */
+	while (end < xlrec->end)
+	{
+		allocate_empty_undo_segment(xlrec->logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	/* Flush the directory entries. */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/* Update shmem. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = xlrec->discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * replay the rewind of a undo log
+ */
+static void
+undolog_xlog_rewind(XLogReaderState *record)
+{
+	xl_undolog_rewind *xlrec = (xl_undolog_rewind *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	log = get_undo_log_by_number(xlrec->logno);
+	log->meta.insert = xlrec->insert;
+	log->meta.prevlen = xlrec->prevlen;
+}
+
+void
+undolog_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			undolog_xlog_create(record);
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			undolog_xlog_extend(record);
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			undolog_xlog_attach(record);
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			undolog_xlog_discard(record);
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			undolog_xlog_rewind(record);
+			break;
+		case XLOG_UNDOLOG_META:
+			undolog_xlog_meta(record);
+			break;
+		default:
+			elog(PANIC, "undo_redo: unknown op code %u", info);
+	}
+}
+
+/*
+ * For assertions only.
+ */
+bool
+AmAttachedToUndoLog(UndoLogControl *log)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		if (MyUndoLogState.logs[i] == log)
+			return true;
+	}
+	return false;
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8cd8bf40ac4..1bf9fd5f36b 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -939,6 +939,10 @@ GRANT SELECT (subdbid, subname, subowner, subenabled, subslotname, subpublicatio
     ON pg_subscription TO public;
 
 
+CREATE VIEW pg_stat_undo_logs AS
+    SELECT *
+    FROM pg_stat_get_undo_logs();
+
 --
 -- We have a few function definitions in here, too.
 -- At some point there might be enough to justify breaking them out into
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index f7e9160a4f6..29a85289628 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -482,6 +482,20 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 	 */
 	LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
 
+	/*
+	 * Drop the undo logs in this tablespace.  This will fail (without
+	 * dropping anything) if there are undo logs that we can't afford to drop
+	 * because they contain non-discarded data or a transaction is in
+	 * progress.  Since we hold TablespaceCreateLock, no other session will be
+	 * able to attach to an undo log in this tablespace (or any tablespace
+	 * except default) concurrently.
+	 */
+	if (!DropUndoLogsInTablespace(tablespaceoid))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs",
+						tablespacename)));
+
 	/*
 	 * Try to remove the physical infrastructure.
 	 */
@@ -1482,6 +1496,14 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		/* This shouldn't be able to fail in recovery. */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		if (!DropUndoLogsInTablespace(xlrec->ts_id))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("tablespace cannot be dropped because it contains non-empty undo logs")));
+		LWLockRelease(TablespaceCreateLock);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index 59c003de9ce..a1e08ef9b39 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -154,6 +154,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
+		case RM_UNDOLOG_ID:
 			/* just deal with xid, and done */
 			ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
 									buf.origptr);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a581c03..4725cbe2d18 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/undolog.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -127,6 +128,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, UndoLogShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
@@ -219,6 +221,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	UndoLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81feb6..b6c0b00ed0f 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,8 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+	LWLockRegisterTranche(LWTRANCHE_UNDOLOG, "undo_log");
+	LWLockRegisterTranche(LWTRANCHE_UNDODISCARD, "undo_discard");
 
 	/* Register named tranches. */
 	for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ecedb3..554af463221 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
 CLogTruncationLock					45
+UndoLogLock							46
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index ee1444c427f..a46fcec0834 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -116,6 +116,7 @@ extern int	CommitDelay;
 extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
+extern char *undo_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
 
@@ -3341,6 +3342,17 @@ static struct config_string ConfigureNamesString[] =
 		check_temp_tablespaces, assign_temp_tablespaces, NULL
 	},
 
+	{
+		{"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Sets the tablespace(s) to use for undo logs."),
+			NULL,
+			GUC_LIST_INPUT | GUC_LIST_QUOTE
+		},
+		&undo_tablespaces,
+		"",
+		check_undo_tablespaces, assign_undo_tablespaces, NULL
+	},
+
 	{
 		{"dynamic_library_path", PGC_SUSET, CLIENT_CONN_OTHER,
 			gettext_noop("Sets the path for dynamically loadable modules."),
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ae22e7d9fb8..cb17fbf6fd1 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -208,11 +208,13 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_undo",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
+	"base/undo",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca4b1c..938150dd915 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -20,6 +20,7 @@
 #include "access/nbtxlog.h"
 #include "access/rmgr.h"
 #include "access/spgxlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 0bbe9879ca1..9c6fca46ec8 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_UNDOLOG_ID, "UndoLog", undolog_redo, undolog_desc, undolog_identify, NULL, NULL, NULL)
diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h
new file mode 100644
index 00000000000..d30ce2e297c
--- /dev/null
+++ b/src/include/access/undolog.h
@@ -0,0 +1,305 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.h
+ *
+ * PostgreSQL undo log manager.  This module is responsible for lifecycle
+ * management of undo logs and backing files, associating undo logs with
+ * backends, allocating and managing space within undo logs.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_H
+#define UNDOLOG_H
+
+#include "access/xlogreader.h"
+#include "catalog/pg_class.h"
+#include "common/relpath.h"
+#include "storage/bufpage.h"
+
+#ifndef FRONTEND
+#include "storage/lwlock.h"
+#endif
+
+/* The type used to identify an undo log and position within it. */
+typedef uint64 UndoRecPtr;
+
+/* The type used for undo record lengths. */
+typedef uint16 UndoRecordSize;
+
+/* Undo log statuses. */
+typedef enum
+{
+	UNDO_LOG_STATUS_UNUSED = 0,
+	UNDO_LOG_STATUS_ACTIVE,
+	UNDO_LOG_STATUS_EXHAUSTED,
+	UNDO_LOG_STATUS_DISCARDED
+} UndoLogStatus;
+
+/*
+ * Undo log persistence levels.  These have a one-to-one correspondence with
+ * relpersistence values, but are small integers so that we can use them as an
+ * index into the "logs" and "lognos" arrays.
+ */
+typedef enum
+{
+	UNDO_PERMANENT = 0,
+	UNDO_UNLOGGED = 1,
+	UNDO_TEMP = 2
+} UndoPersistence;
+
+#define UndoPersistenceLevels 3
+
+/*
+ * Convert from relpersistence ('p', 'u', 't') to an UndoPersistence
+ * enumerator.
+ */
+#define UndoPersistenceForRelPersistence(rp)						\
+	((rp) == RELPERSISTENCE_PERMANENT ? UNDO_PERMANENT :			\
+	 (rp) == RELPERSISTENCE_UNLOGGED ? UNDO_UNLOGGED : UNDO_TEMP)
+
+/*
+ * Convert from UndoPersistence to a relpersistence value.
+ */
+#define RelPersistenceForUndoPersistence(up)				\
+	((up) == UNDO_PERMANENT ? RELPERSISTENCE_PERMANENT :	\
+	 (up) == UNDO_UNLOGGED ? RELPERSISTENCE_UNLOGGED :		\
+	 RELPERSISTENCE_TEMP)
+
+/*
+ * Get the appropriate UndoPersistence value from a Relation.
+ */
+#define UndoPersistenceForRelation(rel)									\
+	(UndoPersistenceForRelPersistence((rel)->rd_rel->relpersistence))
+
+/* Type for offsets within undo logs */
+typedef uint64 UndoLogOffset;
+
+/* printf-family format string for UndoRecPtr. */
+#define UndoRecPtrFormat "%016" INT64_MODIFIER "X"
+
+/* printf-family format string for UndoLogOffset. */
+#define UndoLogOffsetFormat UINT64_FORMAT
+
+/* Number of blocks of BLCKSZ in an undo log segment file.  128 = 1MB. */
+#define UNDOSEG_SIZE 128
+
+/* Size of an undo log segment file in bytes. */
+#define UndoLogSegmentSize ((size_t) BLCKSZ * UNDOSEG_SIZE)
+
+/* The width of an undo log number in bits.  24 allows for 16.7m logs. */
+#define UndoLogNumberBits 24
+
+/* The width of an undo log offset in bits.  40 allows for 1TB per log.*/
+#define UndoLogOffsetBits (64 - UndoLogNumberBits)
+
+/* Special value for undo record pointer which indicates that it is invalid. */
+#define	InvalidUndoRecPtr	((UndoRecPtr) 0)
+
+/*
+ * This undo record pointer will be used in the transaction header this special
+ * value is the indication that currently we don't have the value of the the
+ * next transactions start point but it will be updated with a valid value
+ * in the future.
+ */
+#define SpecialUndoRecPtr	((UndoRecPtr) 0xFFFFFFFFFFFFFFFF)
+
+/*
+ * The maximum amount of data that can be stored in an undo log.  Can be set
+ * artificially low to test full log behavior.
+ */
+#define UndoLogMaxSize ((UndoLogOffset) 1 << UndoLogOffsetBits)
+
+/* Type for numbering undo logs. */
+typedef int UndoLogNumber;
+
+/* Extract the undo log number from an UndoRecPtr. */
+#define UndoRecPtrGetLogNo(urp)					\
+	((urp) >> UndoLogOffsetBits)
+
+/* Extract the offset from an UndoRecPtr. */
+#define UndoRecPtrGetOffset(urp)				\
+	((urp) & ((UINT64CONST(1) << UndoLogOffsetBits) - 1))
+
+/* Make an UndoRecPtr from an log number and offset. */
+#define MakeUndoRecPtr(logno, offset)			\
+	(((uint64) (logno) << UndoLogOffsetBits) | (offset))
+
+/* The number of unusable bytes in the header of each block. */
+#define UndoLogBlockHeaderSize SizeOfPageHeaderData
+
+/* The number of usable bytes we can store per block. */
+#define UndoLogUsableBytesPerPage (BLCKSZ - UndoLogBlockHeaderSize)
+
+/* The pseudo-database OID used for undo logs. */
+#define UndoLogDatabaseOid 9
+
+/* Length of undo checkpoint filename */
+#define UNDO_CHECKPOINT_FILENAME_LENGTH	16
+
+/*
+ * UndoRecPtrIsValid
+ *		True iff undoRecPtr is valid.
+ */
+#define UndoRecPtrIsValid(undoRecPtr) \
+	((bool) ((UndoRecPtr) (undoRecPtr) != InvalidUndoRecPtr))
+
+/* Extract the relnode for an undo log. */
+#define UndoRecPtrGetRelNode(urp)				\
+	UndoRecPtrGetLogNo(urp)
+
+/* The only valid fork number for undo log buffers. */
+#define UndoLogForkNum MAIN_FORKNUM
+
+/* Compute the block number that holds a given UndoRecPtr. */
+#define UndoRecPtrGetBlockNum(urp)				\
+	(UndoRecPtrGetOffset(urp) / BLCKSZ)
+
+/* Compute the offset of a given UndoRecPtr in the page that holds it. */
+#define UndoRecPtrGetPageOffset(urp)			\
+	(UndoRecPtrGetOffset(urp) % BLCKSZ)
+
+/* Compare two undo checkpoint files to find the oldest file. */
+#define UndoCheckPointFilenamePrecedes(file1, file2)	\
+	(strcmp(file1, file2) < 0)
+
+/* Find out which tablespace the given undo log location is backed by. */
+extern Oid UndoRecPtrGetTablespace(UndoRecPtr insertion_point);
+
+/* Populate a RelFileNode from an UndoRecPtr. */
+#define UndoRecPtrAssignRelFileNode(rfn, urp)			\
+	do													\
+	{													\
+		(rfn).spcNode = UndoRecPtrGetTablespace(urp);	\
+		(rfn).dbNode = UndoLogDatabaseOid;				\
+		(rfn).relNode = UndoRecPtrGetRelNode(urp);		\
+	} while (false);
+
+/*
+ * Control metadata for an active undo log.  Lives in shared memory inside an
+ * UndoLogControl object, but also written to disk during checkpoints.
+ */
+typedef struct UndoLogMetaData
+{
+	UndoLogStatus status;
+	Oid		tablespace;
+	UndoPersistence persistence;	/* permanent, unlogged, temp? */
+	UndoLogOffset insert;			/* next insertion point (head) */
+	UndoLogOffset end;				/* one past end of highest segment */
+	UndoLogOffset discard;			/* oldest data needed (tail) */
+	UndoLogOffset last_xact_start;	/* last transactions start undo offset */
+	bool	is_first_rec;
+
+	/*
+	 * last undo record's length. We need to save this in undo meta and WAL
+	 * log so that the value can be preserved across restart so that the first
+	 * undo record after the restart can get this value properly.  This will be
+	 * used going to the previous record of the transaction during rollback.
+	 * In case the transaction have done some operation before checkpoint and
+	 * remaining after checkpoint in such case if we can't get the previous
+	 * record prevlen which which before checkpoint we can not properly
+	 * rollback.  And, undo worker is also fetch this value when rolling back
+	 * the last transaction in the undo log for locating the last undo record
+	 * of the transaction.
+	 */
+	uint16	prevlen;
+} UndoLogMetaData;
+
+/* Record the undo log number used for a transaction. */
+typedef struct xl_undolog_meta
+{
+	UndoLogMetaData	meta;
+	UndoLogNumber	logno;
+	TransactionId	xid;
+} xl_undolog_meta;
+
+#ifndef FRONTEND
+
+/*
+ * The in-memory control object for an undo log.  As well as the current
+ * meta-data for the undo log, we also lazily maintain a snapshot of the
+ * meta-data as it was at the redo point of a checkpoint that is in progress.
+ *
+ * Conceptually the set of UndoLogControl objects is arranged into a very
+ * large array for access by log number, but because we typically need only a
+ * smallish number of adjacent undo logs to be active at a time we arrange
+ * them into smaller fragments called 'banks'.
+ */
+typedef struct UndoLogControl
+{
+	UndoLogNumber logno;
+	UndoLogMetaData meta;			/* current meta-data */
+	XLogRecPtr      lsn;
+	bool	need_attach_wal_record;	/* need_attach_wal_record */
+	pid_t		pid;				/* InvalidPid for unattached */
+	LWLock	mutex;					/* protects the above */
+	TransactionId xid;
+	/* State used by undo workers. */
+	TransactionId	oldest_xid;		/* cache of oldest transaction's xid */
+	uint32		oldest_xidepoch;
+	UndoRecPtr	oldest_data;
+	LWLock		discard_lock;		/* prevents discarding while reading */
+
+	UndoLogNumber next_free;		/* protected by UndoLogLock */
+} UndoLogControl;
+
+#endif
+
+/* Space management. */
+extern UndoRecPtr UndoLogAllocate(size_t size,
+								  UndoPersistence level,
+								  xl_undolog_meta *undometa);
+extern UndoRecPtr UndoLogAllocateInRecovery(TransactionId xid,
+											size_t size,
+											UndoPersistence persistence);
+extern void UndoLogAdvance(UndoRecPtr insertion_point,
+						   size_t size,
+						   UndoPersistence persistence);
+extern void UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid);
+extern bool UndoLogIsDiscarded(UndoRecPtr point);
+
+/* Initialization interfaces. */
+extern void StartupUndoLogs(XLogRecPtr checkPointRedo);
+extern void UndoLogShmemInit(void);
+extern Size UndoLogShmemSize(void);
+extern void UndoLogInit(void);
+extern void UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace,
+							   char *path);
+extern void ResetUndoLogs(UndoPersistence persistence);
+
+/* Interface use by tablespace.c. */
+extern bool DropUndoLogsInTablespace(Oid tablespace);
+
+/* GUC interfaces. */
+extern void assign_undo_tablespaces(const char *newval, void *extra);
+
+/* Checkpointing interfaces. */
+extern void CheckPointUndoLogs(XLogRecPtr checkPointRedo,
+							   XLogRecPtr priorCheckPointRedo);
+
+#ifndef FRONTEND
+
+extern UndoLogControl *UndoLogGet(UndoLogNumber logno);
+extern UndoLogControl *UndoLogNext(UndoLogControl *log);
+extern bool AmAttachedToUndoLog(UndoLogControl *log);
+extern UndoRecPtr UndoLogGetFirstValidRecord(UndoLogControl *log);
+
+#endif
+
+extern void UndoLogSetLastXactStartPoint(UndoRecPtr point);
+extern UndoRecPtr UndoLogGetLastXactStartPoint(UndoLogNumber logno);
+extern UndoRecPtr UndoLogGetNextInsertPtr(UndoLogNumber logno,
+										  TransactionId xid);
+extern void UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen);
+extern bool IsTransactionFirstRec(TransactionId xid);
+extern void UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen);
+extern uint16 UndoLogGetPrevLen(UndoLogNumber logno);
+void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno);
+/* Redo interface. */
+extern void undolog_redo(XLogReaderState *record);
+
+#endif
diff --git a/src/include/access/undolog_xlog.h b/src/include/access/undolog_xlog.h
new file mode 100644
index 00000000000..8507db644d6
--- /dev/null
+++ b/src/include/access/undolog_xlog.h
@@ -0,0 +1,70 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog_xlog.h
+ *	  undo log access XLOG definitions.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_XLOG_H
+#define UNDOLOG_XLOG_H
+
+#include "access/undolog.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+
+/* XLOG records */
+#define XLOG_UNDOLOG_CREATE		0x00
+#define XLOG_UNDOLOG_EXTEND		0x10
+#define XLOG_UNDOLOG_ATTACH		0x20
+#define XLOG_UNDOLOG_DISCARD	0x30
+#define XLOG_UNDOLOG_REWIND		0x40
+#define XLOG_UNDOLOG_META		0x50
+
+/* Create a new undo log. */
+typedef struct xl_undolog_create
+{
+	UndoLogNumber logno;
+	Oid		tablespace;
+	UndoPersistence persistence;
+} xl_undolog_create;
+
+/* Extend an undo log by adding a new segment. */
+typedef struct xl_undolog_extend
+{
+	UndoLogNumber logno;
+	UndoLogOffset end;
+} xl_undolog_extend;
+
+/* Record the undo log number used for a transaction. */
+typedef struct xl_undolog_attach
+{
+	TransactionId xid;
+	UndoLogNumber logno;
+} xl_undolog_attach;
+
+/* Discard space, and possibly destroy or recycle undo log segments. */
+typedef struct xl_undolog_discard
+{
+	UndoLogNumber logno;
+	UndoLogOffset discard;
+	UndoLogOffset end;
+	TransactionId latestxid;	/* latest xid whose undolog are discarded. */
+} xl_undolog_discard;
+
+/* Rewind insert location of the undo log. */
+typedef struct xl_undolog_rewind
+{
+	UndoLogNumber logno;
+	UndoLogOffset insert;
+	uint16		  prevlen;
+} xl_undolog_rewind;
+
+extern void undolog_desc(StringInfo buf,XLogReaderState *record);
+extern const char *undolog_identify(uint8 info);
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 66c6c224a8b..e6b190bc1ea 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10202,4 +10202,11 @@
   proisstrict => 'f', prorettype => 'bool', proargtypes => 'oid int4 int4 any',
   proargmodes => '{i,i,i,v}', prosrc => 'satisfies_hash_partition' },
 
+# undo logs
+{ oid => '5030', descr => 'list undo logs',
+  proname => 'pg_stat_get_undo_logs', procost => '1', prorows => '10', proretset => 't',
+  prorettype => 'record', proargtypes => '',
+  proallargtypes => '{oid,text,text,text,text,text,xid,int4}', proargmodes => '{o,o,o,o,o,o,o,o}',
+  proargnames => '{log_number,persistence,tablespace,discard,insert,end,xid,pid}', prosrc => 'pg_stat_get_undo_logs' },
+
 ]
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c21bfe2f666..05690727ed8 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,8 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_TBM,
 	LWTRANCHE_PARALLEL_APPEND,
+	LWTRANCHE_UNDOLOG,
+	LWTRANCHE_UNDODISCARD,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 3d13a33b94e..6de7a9a8f10 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -426,6 +426,8 @@ extern void GUC_check_errcode(int sqlerrcode);
 extern bool check_default_tablespace(char **newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra, GucSource source);
 extern void assign_temp_tablespaces(const char *newval, void *extra);
+extern bool check_undo_tablespaces(char **newval, void **extra, GucSource source);
+extern void assign_undo_tablespaces(const char *newval, void *extra);
 
 /* in catalog/namespace.c */
 extern bool check_search_path(char **newval, void **extra, GucSource source);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index ae0cd253d5f..dfbd88835e8 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1918,6 +1918,15 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
+pg_stat_undo_logs| SELECT pg_stat_get_undo_logs.log_number,
+    pg_stat_get_undo_logs.persistence,
+    pg_stat_get_undo_logs.tablespace,
+    pg_stat_get_undo_logs.discard,
+    pg_stat_get_undo_logs.insert,
+    pg_stat_get_undo_logs."end",
+    pg_stat_get_undo_logs.xid,
+    pg_stat_get_undo_logs.pid
+   FROM pg_stat_get_undo_logs() pg_stat_get_undo_logs(log_number, persistence, tablespace, discard, insert, "end", xid, pid);
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
-- 
2.17.0

0002-Provide-access-to-undo-log-data-via-the-buffer-ma-v1.patchapplication/octet-stream; name=0002-Provide-access-to-undo-log-data-via-the-buffer-ma-v1.patchDownload

From 24129934c491179edcdd02c6085d74074ecb702c Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 25 May 2018 09:43:16 +1200
Subject: [PATCH 2/6] Provide access to undo log data via the buffer manager.

In ancient Berkeley POSTGRES, smgr.c allowed for different storage engines, of
which only md.c survives.  Revive this mechanism to provide access to undo log
data through the existing buffer manager.

Undo logs exist in a pseudo-database whose OID is used to dispatch IO requests
to undofile.c instead of md.c.

Convert RemeberFsyncRequest() into a first class smgr function
smgrrequestsync() so that it can be dispatched to md.c or undofile.c as
appropriate.

XXX Status: WIP.  Some details around fsync queuing are likely to change.

Author: Thomas Munro, though ForgetBuffer() was contributed by Robert Haas
Reviewed-By:
Discussion:
---
 src/backend/access/transam/xlogutils.c |   8 +-
 src/backend/postmaster/checkpointer.c  |   2 +-
 src/backend/postmaster/pgstat.c        |  22 +
 src/backend/storage/buffer/bufmgr.c    |  80 +++-
 src/backend/storage/smgr/Makefile      |   2 +-
 src/backend/storage/smgr/md.c          |  15 +-
 src/backend/storage/smgr/smgr.c        |  37 +-
 src/backend/storage/smgr/undofile.c    | 547 +++++++++++++++++++++++++
 src/include/pgstat.h                   |   7 +
 src/include/storage/bufmgr.h           |  14 +-
 src/include/storage/smgr.h             |  35 +-
 src/include/storage/undofile.h         |  50 +++
 12 files changed, 788 insertions(+), 31 deletions(-)
 create mode 100644 src/backend/storage/smgr/undofile.c
 create mode 100644 src/include/storage/undofile.h

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 52fe55e2afb..4888242604b 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -462,7 +462,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -487,7 +487,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -497,7 +498,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 0950ada6019..c0bde4d0566 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1342,7 +1342,7 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		smgrrequestsync(request->rnode, request->forknum, request->segno);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 084573e77c0..0c5f3daa69d 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3898,6 +3898,28 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_READ:
+			event_name = "UndoCheckpointRead";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_WRITE:
+			event_name = "UndoCheckpointWrite";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_SYNC:
+			event_name = "UndoCheckpointSync";
+			break;
+		case WAIT_EVENT_UNDO_FILE_READ:
+			event_name = "UndoFileRead";
+			break;
+		case WAIT_EVENT_UNDO_FILE_WRITE:
+			event_name = "UndoFileWrite";
+			break;
+		case WAIT_EVENT_UNDO_FILE_FLUSH:
+			event_name = "UndoFileFlush";
+			break;
+		case WAIT_EVENT_UNDO_FILE_SYNC:
+			event_name = "UndoFileSync";
+			break;
+
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe57063..31a9b54080e 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -176,6 +176,7 @@ static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
 static inline int32 GetPrivateRefCount(Buffer buffer);
 static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
+static void InvalidateBuffer(BufferDesc *buf);
 
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
@@ -618,10 +619,12 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * valid, the page is zeroed instead of throwing an error. This is intended
  * for non-critical data, where the caller is prepared to repair errors.
  *
- * In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's
+ * In RBM_ZERO mode, if the page isn't in buffer cache already, it's
  * filled with zeros instead of reading it from disk.  Useful when the caller
  * is going to fill the page from scratch, since this saves I/O and avoids
  * unnecessary failure if the page-on-disk has corrupt page headers.
+ *
+ * In RBM_ZERO_AND_LOCK mode, the page is zeroed and also locked.
  * The page is returned locked to ensure that the caller has a chance to
  * initialize the page before it's made visible to others.
  * Caution: do not use this mode to read a page that is beyond the relation's
@@ -672,24 +675,20 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy,
+						  char relpersistence)
 {
 	bool		hit;
 
-	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
-
-	Assert(InRecovery);
+	SMgrRelation smgr = smgropen(rnode,
+								 relpersistence == RELPERSISTENCE_TEMP
+								 ? MyBackendId : InvalidBackendId);
 
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -877,7 +876,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Read in the page, unless the caller intends to overwrite it and
 		 * just wants us to allocate a buffer.
 		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+		if (mode == RBM_ZERO ||
+			mode == RBM_ZERO_AND_LOCK ||
+			mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
@@ -1331,6 +1332,61 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 	return buf;
 }
 
+/*
+ * ForgetBuffer -- drop a buffer from shared buffers
+ *
+ * If the buffer isn't present in shared buffers, nothing happens.  If it is
+ * present, it is discarded without making any attempt to write it back out to
+ * the operating system.  The caller must therefore somehow be sure that the
+ * data won't be needed for anything now or in the future.  It assumes that
+ * there is no concurrent access to the block, except that it might be being
+ * concurrently written.
+ */
+void
+ForgetBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum)
+{
+	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
+	BufferTag	tag;			/* identity of target block */
+	uint32		hash;			/* hash value for tag */
+	LWLock	   *partitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(tag, smgr->smgr_rnode.node, forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	hash = BufTableHashCode(&tag);
+	partitionLock = BufMappingPartitionLock(hash);
+
+	/* see if the block is in the buffer pool */
+	LWLockAcquire(partitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&tag, hash);
+	LWLockRelease(partitionLock);
+
+	/* didn't find it, so nothing to do */
+	if (buf_id < 0)
+		return;
+
+	/* take the buffer header lock */
+	bufHdr = GetBufferDescriptor(buf_id);
+	buf_state = LockBufHdr(bufHdr);
+
+	/*
+	 * The buffer might been evicted after we released the partition lock and
+	 * before we acquired the buffer header lock.  If so, the buffer we've
+	 * locked might contain some other data which we shouldn't touch. If the
+	 * buffer hasn't been recycled, we proceed to invalidate it.
+	 */
+	if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+		bufHdr->tag.blockNum == blockNum &&
+		bufHdr->tag.forkNum == forkNum)
+		InvalidateBuffer(bufHdr);		/* releases spinlock */
+	else
+		UnlockBufHdr(bufHdr, buf_state);
+}
+
 /*
  * InvalidateBuffer -- mark a shared buffer invalid and return it to the
  * freelist.
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0df16..b657eb275fa 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrtype.o undofile.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 2ec103e6047..930a2a8c74b 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -44,7 +44,7 @@
 #define UNLINKS_PER_ABSORB		10
 
 /*
- * Special values for the segno arg to RememberFsyncRequest.
+ * Special values for the segno arg to mdrequestsync.
  *
  * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
  * fsync request from the queue if an identical, subsequent request is found.
@@ -1433,7 +1433,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+		mdrequestsync(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
 	}
 	else
 	{
@@ -1469,8 +1469,7 @@ register_unlink(RelFileNodeBackend rnode)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
+		mdrequestsync(rnode.node, MAIN_FORKNUM, UNLINK_RELATION_REQUEST);
 	}
 	else
 	{
@@ -1489,7 +1488,7 @@ register_unlink(RelFileNodeBackend rnode)
 }
 
 /*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ * mdrequestsync() -- callback from checkpointer side of fsync request
  *
  * We stuff fsync requests into the local hash table for execution
  * during the checkpointer's next checkpoint.  UNLINK requests go into a
@@ -1510,7 +1509,7 @@ register_unlink(RelFileNodeBackend rnode)
  * heavyweight operation anyhow, so we'll live with it.)
  */
 void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
 {
 	Assert(pendingOpsTable);
 
@@ -1653,7 +1652,7 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+		mdrequestsync(rnode, forknum, FORGET_RELATION_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
@@ -1692,7 +1691,7 @@ ForgetDatabaseFsyncRequests(Oid dbid)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
+		mdrequestsync(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 08f06bade25..c456ea1a12b 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,6 +58,8 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
+	void		(*smgr_requestsync) (RelFileNode rnode, ForkNumber forknum,
+									 int segno);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
 	void		(*smgr_sync) (void);	/* may be NULL */
@@ -69,12 +71,30 @@ static const f_smgr smgrsw[] = {
 	/* magnetic disk */
 	{mdinit, NULL, mdclose, mdcreate, mdexists, mdunlink, mdextend,
 		mdprefetch, mdread, mdwrite, mdwriteback, mdnblocks, mdtruncate,
+		mdrequestsync,
 		mdimmedsync, mdpreckpt, mdsync, mdpostckpt
+	},
+	/* undo logs */
+	{undofile_init, undofile_shutdown, undofile_close, undofile_create,
+	 undofile_exists, undofile_unlink, undofile_extend, undofile_prefetch,
+	 undofile_read, undofile_write, undofile_writeback, undofile_nblocks,
+	 undofile_truncate,
+	 undofile_requestsync,
+	 undofile_immedsync, undofile_preckpt, undofile_sync,
+	 undofile_postckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
+/*
+ * In ancient Postgres the catalog entry for each relation controlled the
+ * choice of storage manager implementation.  Now we have only md.c for
+ * regular relations, and undofile.c for undo log storage in the undolog
+ * pseudo-database.
+ */
+#define SmgrWhichForRelFileNode(rfn)			\
+	((rfn).dbNode == 9 ? 1 : 0)
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -170,11 +190,18 @@ smgropen(RelFileNode rnode, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		reln->smgr_fsm_nblocks = InvalidBlockNumber;
 		reln->smgr_vm_nblocks = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+
+		/* Which storage manager implementation? */
+		reln->smgr_which = SmgrWhichForRelFileNode(rnode);
 
 		/* mark it not open */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
 			reln->md_num_open_segs[forknum] = 0;
+			reln->md_seg_fds[forknum] = NULL;
+		}
+
+		reln->private_data = NULL;
 
 		/* it has no owner yet */
 		add_to_unowned_list(reln);
@@ -707,6 +734,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 	smgrsw[reln->smgr_which].smgr_truncate(reln, forknum, nblocks);
 }
 
+/*
+ *	smgrrequestsync() -- Enqueue a request for smgrsync() to flush data.
+ */
+void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	smgrsw[SmgrWhichForRelFileNode(rnode)].smgr_requestsync(rnode, forknum, segno);
+}
+
 /*
  *	smgrimmedsync() -- Force the specified relation to stable storage.
  *
diff --git a/src/backend/storage/smgr/undofile.c b/src/backend/storage/smgr/undofile.c
new file mode 100644
index 00000000000..dbf98acb227
--- /dev/null
+++ b/src/backend/storage/smgr/undofile.c
@@ -0,0 +1,547 @@
+/*
+ * undofile.h
+ *
+ * PostgreSQL undo file manager.  This module provides SMGR-compatible
+ * interface to the files that back undo logs on the filesystem, so that undo
+ * log data can use the shared buffer pool.  Other aspects of undo log
+ * management are provided by undolog.c, so the SMGR interfaces not directly
+ * concerned with reading, writing and flushing data are unimplemented.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/storage/smgr/undofile.c
+ */
+
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "storage/fd.h"
+#include "storage/undofile.h"
+#include "utils/memutils.h"
+
+/* intervals for calling AbsorbFsyncRequests in undofile_sync */
+#define FSYNCS_PER_ABSORB		10
+
+/*
+ * Special values for the fork arg to undofile_requestsync.
+ */
+#define FORGET_UNDO_SEGMENT_FSYNC	(InvalidBlockNumber)
+
+/*
+ * While md.c expects random access and has a small number of huge
+ * segments, undofile.c manages a potentially very large number of smaller
+ * segments and has a less random access pattern.  Therefore, instead of
+ * keeping a potentially huge array of vfds we'll just keep the most
+ * recently accessed N.
+ *
+ * For now, N == 1, so we just need to hold onto one 'File' handle.
+ */
+typedef struct UndoFileState
+{
+	int		mru_segno;
+	File	mru_file;
+} UndoFileState;
+
+static MemoryContext UndoFileCxt;
+
+typedef uint16 CycleCtr;
+
+/*
+ * An entry recording the segments that need to be fsynced by undofile_sync().
+ * This is a bit simpler than md.c's version, though it could perhaps be
+ * merged into a common struct.  One difference is that we can have much
+ * larger segment numbers, so we'll adjust for that to avoid having a lot of
+ * leading zero bits.
+ */
+typedef struct
+{
+	RelFileNode rnode;
+	Bitmapset  *requests;
+	CycleCtr	cycle_ctr;
+} PendingOperationEntry;
+
+static HTAB *pendingOpsTable = NULL;
+static MemoryContext pendingOpsCxt;
+
+static CycleCtr undofile_sync_cycle_ctr = 0;
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok);
+static File undofile_get_segment_file(SMgrRelation reln, int segno);
+
+void
+undofile_init(void)
+{
+	UndoFileCxt = AllocSetContextCreate(TopMemoryContext,
+										"UndoFileSmgr",
+										ALLOCSET_DEFAULT_SIZES);
+
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		pendingOpsCxt = AllocSetContextCreate(UndoFileCxt,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(RelFileNode);
+		hash_ctl.entrysize = sizeof(PendingOperationEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOpsTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+}
+
+void
+undofile_shutdown(void)
+{
+}
+
+void
+undofile_close(SMgrRelation reln, ForkNumber forknum)
+{
+}
+
+void
+undofile_create(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_create is not supported");
+}
+
+bool
+undofile_exists(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_exists is not supported");
+}
+
+void
+undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_unlink is not supported");
+}
+
+void
+undofile_extend(SMgrRelation reln, ForkNumber forknum,
+				BlockNumber blocknum, char *buffer,
+				bool skipFsync)
+{
+	elog(ERROR, "undofile_extend is not supported");
+}
+
+void
+undofile_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	elog(ERROR, "undofile_prefetch is not supported");
+}
+
+void
+undofile_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  char *buffer)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	if (FileSeek(file, seekpos, SEEK_SET) != seekpos)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek to block %u in file \"%s\": %m",
+						blocknum, FilePathName(file))));
+	nbytes = FileRead(file, buffer, BLCKSZ, WAIT_EVENT_UNDO_FILE_READ);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+}
+
+static void
+register_dirty_segment(SMgrRelation reln, ForkNumber forknum, int segno, File file)
+{
+	/* Temp relations should never be fsync'd */
+	Assert(!SmgrIsTemp(reln));
+
+	if (pendingOpsTable)
+	{
+		/* push it into local pending-ops table */
+		undofile_requestsync(reln->smgr_rnode.node, forknum, segno);
+	}
+	else
+	{
+		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, segno))
+			return;				/* passed it off successfully */
+
+		ereport(DEBUG1,
+				(errmsg("could not forward fsync request because request queue is full")));
+
+		if (FileSync(file, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(file))));
+	}
+}
+
+void
+undofile_write(SMgrRelation reln, ForkNumber forknum,
+			   BlockNumber blocknum, char *buffer,
+			   bool skipFsync)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	if (FileSeek(file, seekpos, SEEK_SET) != seekpos)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek to block %u in file \"%s\": %m",
+						blocknum, FilePathName(file))));
+	nbytes = FileWrite(file, buffer, BLCKSZ, WAIT_EVENT_UNDO_FILE_WRITE);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		/*
+		 * short write: unexpected, because this should be overwriting an
+		 * entirely pre-allocated segment file
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_DISK_FULL),
+				 errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+
+	if (!skipFsync && !SmgrIsTemp(reln))
+		register_dirty_segment(reln, forknum, blocknum / UNDOSEG_SIZE, file);
+}
+
+void
+undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+				   BlockNumber blocknum, BlockNumber nblocks)
+{
+	while (nblocks > 0)
+	{
+		File	file;
+		int		nflush;
+
+		file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+
+		/* compute number of desired writes within the current segment */
+		nflush = Min(nblocks,
+					 1 + UNDOSEG_SIZE - (blocknum % UNDOSEG_SIZE));
+
+		FileWriteback(file,
+					  (blocknum % UNDOSEG_SIZE) * BLCKSZ,
+					  nflush * BLCKSZ, WAIT_EVENT_UNDO_FILE_FLUSH);
+
+		nblocks -= nflush;
+		blocknum += nflush;
+	}
+}
+
+BlockNumber
+undofile_nblocks(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_nblocks is not supported");
+	return 0;
+}
+
+void
+undofile_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
+{
+	elog(ERROR, "undofile_truncate is not supported");
+}
+
+void
+undofile_immedsync(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_immedsync is not supported");
+}
+
+void
+undofile_preckpt(void)
+{
+}
+
+void
+undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+	PendingOperationEntry *entry;
+	bool		found;
+
+	Assert(pendingOpsTable);
+
+	if (forknum == FORGET_UNDO_SEGMENT_FSYNC)
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_FIND,
+													  NULL);
+		if (entry)
+			entry->requests = bms_del_member(entry->requests, segno);
+	}
+	else
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_ENTER,
+													  &found);
+		if (!found)
+		{
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+			entry->requests = bms_make_singleton(segno);
+		}
+		else
+			entry->requests = bms_add_member(entry->requests, segno);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+void
+undofile_forgetsync(Oid logno, Oid tablespace, int segno)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = 9;
+	rnode.spcNode = tablespace;
+	rnode.relNode = logno;
+
+	if (pendingOpsTable)
+		undofile_requestsync(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno);
+	else if (IsUnderPostmaster)
+	{
+		while (!ForwardFsyncRequest(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno))
+			pg_usleep(10000L);
+	}
+}
+
+void
+undofile_sync(void)
+{
+	static bool undofile_sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingOperationEntry *entry;
+	int			absorb_counter;
+	int			segno;
+
+	if (!pendingOpsTable)
+		elog(ERROR, "cannot sync without a pendingOpsTable");
+
+	AbsorbFsyncRequests();
+
+	if (undofile_sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOpsTable);
+		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+	}
+
+	undofile_sync_cycle_ctr++;
+	undofile_sync_in_progress = true;
+
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOpsTable);
+	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		Bitmapset	   *requests;
+
+		/* Skip entries that arrived after we arrived. */
+		if (entry->cycle_ctr == undofile_sync_cycle_ctr)
+			continue;
+
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == undofile_sync_cycle_ctr);
+
+		if (!enableFsync)
+			continue;
+
+		requests = entry->requests;
+		entry->requests = NULL;
+
+		segno = -1;
+		while ((segno = bms_next_member(requests, segno)) >= 0)
+		{
+			File		file;
+
+			if (!enableFsync)
+				continue;
+
+			file = undofile_open_segment_file(entry->rnode.relNode,
+											  entry->rnode.spcNode,
+											  segno, true /* missing_ok */);
+
+			/*
+			 * The file may be gone due to concurrent discard.  We'll ignore
+			 * that, but only if we find a cancel request for this segment in
+			 * the queue.
+			 *
+			 * It's also possible that we succeed in opening a segment file
+			 * that is subsequently recycled (renamed to represent a new range
+			 * of undo log), in which case we'll fsync that later file
+			 * instead.  That is rare and harmless.
+			 */
+			if (file <= 0)
+			{
+				/*
+				 * Put the request back into the bitset in a way that can't
+				 * fail due to memory allocation.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				/*
+				 * Check if a forgetsync request has arrived to delete that
+				 * segment.
+				 */
+				AbsorbFsyncRequests();
+				if (bms_is_member(segno, entry->requests))
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not fsync file \"%s\": %m",
+									FilePathName(file))));
+				/* It must have been removed, so we can safely skip it. */
+				continue;
+			}
+
+			elog(LOG, "fsync()ing %s", FilePathName(file));	/* TODO: remove me */
+			if (FileSync(file, WAIT_EVENT_UNDO_FILE_SYNC) < 0)
+			{
+				FileClose(file);
+
+				/*
+				 * Keep the failed requests, but merge with any new ones.  The
+				 * requirement to be able to do this without risk of failure
+				 * prevents us from using a smaller bitmap that doesn't bother
+				 * tracking leading zeros.  Perhaps another data structure
+				 * would be better.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m",
+								FilePathName(file))));
+			}
+			requests = bms_del_member(requests, segno);
+			FileClose(file);
+
+			if (--absorb_counter <= 0)
+			{
+				AbsorbFsyncRequests();
+				absorb_counter = FSYNCS_PER_ABSORB;
+			}
+		}
+
+		bms_free(requests);
+	}
+
+	undofile_sync_in_progress = true;
+}
+
+void undofile_postckpt(void)
+{
+}
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok)
+{
+	File		file;
+	char		path[MAXPGPATH];
+
+	UndoLogSegmentPath(relNode, segno, spcNode, path);
+	file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+
+	if (file <= 0 && (!missing_ok || errno != ENOENT))
+		elog(ERROR, "cannot open undo segment file '%s': %m", path);
+
+	return file;
+}
+
+/*
+ * Get a File for a particular segment of a SMgrRelation representing an undo
+ * log.
+ */
+static File undofile_get_segment_file(SMgrRelation reln, int segno)
+{
+	UndoFileState *state;
+
+
+	/*
+	 * Create private state space on demand.
+	 *
+	 * XXX There should probably be a smgr 'open' or 'init' interface that
+	 * would do this.  smgr.c currently initializes reln->md_XXX stuff
+	 * directly...
+	 */
+	state = (UndoFileState *) reln->private_data;
+	if (unlikely(state == NULL))
+	{
+		state = MemoryContextAllocZero(UndoFileCxt, sizeof(UndoFileState));
+		reln->private_data = state;
+	}
+
+	/* If we have a file open already, check if we need to close it. */
+	if (state->mru_file > 0 && state->mru_segno != segno)
+	{
+		/* These are not the blocks we're looking for. */
+		FileClose(state->mru_file);
+		state->mru_file = 0;
+	}
+
+	/* Check if we need to open a new file. */
+	if (state->mru_file <= 0)
+	{
+		state->mru_file =
+			undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+									   reln->smgr_rnode.node.spcNode,
+									   segno, InRecovery);
+		if (InRecovery && state->mru_file <= 0)
+		{
+			/*
+			 * If in recovery, we may be trying to access a file that will
+			 * later be unlinked.  Tolerate missing files, creating a new
+			 * zero-filled file as required.
+			 */
+			UndoLogNewSegment(reln->smgr_rnode.node.relNode,
+							  reln->smgr_rnode.node.spcNode,
+							  segno);
+			state->mru_file =
+				undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+										   reln->smgr_rnode.node.spcNode,
+										   segno, false);
+			Assert(state->mru_file > 0);
+		}
+		state->mru_segno = segno;
+	}
+
+	return state->mru_file;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index be2f59239bf..f02d9a39935 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -912,6 +912,13 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_UNDO_CHECKPOINT_READ,
+	WAIT_EVENT_UNDO_CHECKPOINT_WRITE,
+	WAIT_EVENT_UNDO_CHECKPOINT_SYNC,
+	WAIT_EVENT_UNDO_FILE_READ,
+	WAIT_EVENT_UNDO_FILE_WRITE,
+	WAIT_EVENT_UNDO_FILE_FLUSH,
+	WAIT_EVENT_UNDO_FILE_SYNC,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce3906a0e..5b135565527 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -38,8 +38,9 @@ typedef enum BufferAccessStrategyType
 typedef enum
 {
 	RBM_NORMAL,					/* Normal read */
-	RBM_ZERO_AND_LOCK,			/* Don't read from disk, caller will
-								 * initialize. Also locks the page. */
+	RBM_ZERO,					/* Don't read from disk, caller will
+								 * initialize. */
+	RBM_ZERO_AND_LOCK,			/* Like RBM_ZERO, but also locks the page. */
 	RBM_ZERO_AND_CLEANUP_LOCK,	/* Like RBM_ZERO_AND_LOCK, but locks the page
 								 * in "cleanup" mode */
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
@@ -171,7 +172,10 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 				   BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 						  ForkNumber forkNum, BlockNumber blockNum,
-						  ReadBufferMode mode, BufferAccessStrategy strategy);
+						  ReadBufferMode mode, BufferAccessStrategy strategy,
+						  char relpersistence);
+extern void ForgetBuffer(RelFileNode rnode, ForkNumber forkNum,
+			 BlockNumber blockNum);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -228,6 +232,10 @@ extern void AtProcExit_LocalBuffers(void);
 
 extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
 
+/* in localbuf.c */
+extern void ForgetLocalBuffer(RelFileNode rnode, ForkNumber forkNum,
+				  BlockNumber blockNum);
+
 /* in freelist.c */
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 558e4d8518b..002ae4c5e32 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -71,6 +71,9 @@ typedef struct SMgrRelationData
 	int			md_num_open_segs[MAX_FORKNUM + 1];
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
+	/* For use by implementations. */
+	void	   *private_data;
+
 	/* if unowned, list link in list of all unowned SMgrRelations */
 	struct SMgrRelationData *next_unowned_reln;
 } SMgrRelationData;
@@ -105,6 +108,7 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
+extern void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
 extern void smgrsync(void);
@@ -133,14 +137,41 @@ extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
+extern void mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);
 
+/* in undofile.c */
+extern void undofile_init(void);
+extern void undofile_shutdown(void);
+extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
+							bool isRedo);
+extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
+							bool isRedo);
+extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber blocknum);
+extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum, char *buffer);
+extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber nblocks);
+extern void undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno);
+extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_preckpt(void);
+extern void undofile_sync(void);
+extern void undofile_postckpt(void);
+
 extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 
diff --git a/src/include/storage/undofile.h b/src/include/storage/undofile.h
new file mode 100644
index 00000000000..7544be3522b
--- /dev/null
+++ b/src/include/storage/undofile.h
@@ -0,0 +1,50 @@
+/*
+ * undofile.h
+ *
+ * PostgreSQL undo file manager.  This module manages the files that back undo
+ * logs on the filesystem.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/undofile.h
+ */
+
+#ifndef UNDOFILE_H
+#define UNDOFILE_H
+
+#include "storage/smgr.h"
+
+/* Prototypes of functions exposed to SMgr. */
+extern void undofile_init(void);
+extern void undofile_shutdown(void);
+extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
+							bool isRedo);
+extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
+							bool isRedo);
+extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum, char *buffer,
+							bool skipFsync);
+extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
+extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum, char *buffer);
+extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, char *buffer,
+						   bool skipFsync);
+extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber nblocks);
+extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_pre_ckpt(void);
+extern void undofile_sync(void);
+extern void undofile_post_ckpt(void);
+
+/* Functions used by undolog.c. */
+extern void undofile_forgetsync(Oid logno, Oid tablespace, int segno);
+
+#endif
-- 
2.17.0

0003-Add-developer-documentation-for-the-undo-log-stor-v1.patchapplication/octet-stream; name=0003-Add-developer-documentation-for-the-undo-log-stor-v1.patchDownload

From 957ef431a6a14afe1b387b47fa1e60facda537f5 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 25 May 2018 09:43:16 +1200
Subject: [PATCH 3/6] Add developer documentation for the undo log storage
 subsystem.

This document provides an overview of the design.

Author: Thomas Munro
Discussion:
---
 src/backend/access/undo/README  | 169 ++++++++++++++++++++++++++++++++
 src/backend/storage/smgr/README |  23 +++--
 2 files changed, 184 insertions(+), 8 deletions(-)
 create mode 100644 src/backend/access/undo/README

diff --git a/src/backend/access/undo/README b/src/backend/access/undo/README
new file mode 100644
index 00000000000..9ba81f960de
--- /dev/null
+++ b/src/backend/access/undo/README
@@ -0,0 +1,169 @@
+src/backend/access/undo/README
+
+Undo Logs
+=========
+
+The undo log subsystem provides a way to store data that is needed for
+a limited time.  Undo data is generated whenever zheap relations are
+modified, but it is only useful until (1) the generating transaction
+is committed or rolled back and (2) there is no snapshot that might
+need it for MVCC purposes.  See src/backend/access/zheap/README for
+more information on zheap.  The undo log subsystem is concerned with
+raw storage optimized for efficient recycling and buffered random
+access.
+
+Like redo data (the WAL), undo data consists of records identified by
+their location within a 64 bit address space.  Unlike redo data, the
+addressing space is internally divided up unto multiple numbered logs.
+The first 24 bits of an UndoRecPtr identify the undo log number, and
+the remaining 40 bits address the space within that undo log.  Higher
+level code (zheap) is largely oblivious to this internal structure and
+deals mostly in opaque UndoRecPtr values.
+
+Using multiple undo logs instead of a single uniform space avoids the
+contention that would result from a single insertion point, since each
+session can be given sole access to write data into a given undo log.
+It also allows for parallelized space reclamation.
+
+Like redo data, undo data is stored on disk in numbered segment files
+that are recycled as required.  Unlike redo data, undo data is
+accessed through the buffer pool.  In this respect it is similar to
+regular relation data.  Buffer content is written out to disk during
+checkpoints and whenever it is evicted to make space for another page.
+However, unlike regular relation data, undo data has a chance of never
+being written to disk at all: if a page is allocated and and then
+later discarded without an intervening checkpoint and without an
+eviction provoked by memory pressure, then no disk IO is generated.
+
+Keeping the undo data physically separate from redo data and accessing
+it though the existing shared buffers mechanism allows it to be
+accessed efficiently for MVCC purposes.
+
+Meta-Data
+=========
+
+At any given time the set of undo logs that exists is tracked in
+shared memory and can be inspected in the pg_stat_undo_logs view.  For
+each undo log, a set of properties called the undo log's meta-data are
+tracked:
+
+* the tablespace that holds its segment files
+* the persistence level (permanent, unlogged, temporary)
+* the "discard" pointer; data before this point has been discarded
+* the "insert" pointer: new data will be written here
+* the "end" pointer: a new undo segment file will be needed at this point
+
+The three pointers discard, insert and end move strictly forwards
+until the whole undo log has been exhausted.  At all times discard <=
+insert <= end.  When discard == insert, the undo log is empty
+(everything that has ever been inserted has since been discarded).
+The insert pointer advances when regular backends allocate new space,
+and the discard pointer usually advances when an undo worker process
+determines that no session could need the data either for rollback or
+for finding old versions of tuples to satisfy a snapshot.  In some
+special cases including single-user mode and temporary undo logs the
+discard pointer might also be advanced synchronously by a foreground
+session.
+
+In order to provide constant time access to undo log meta-data given
+an UndoRecPtr, there is conceptually an array of UndoLogControl
+objects indexed by undo log number.  Since that array would be too
+large and since we expect the set of active undo log numbers to be
+small and clustered, we only keep small ranges of that logical array
+in memory at a time.  We use the higher order bits of the undo log
+number to identify a 'bank' (array fragment), and then the lower order
+bits to identify a slot within the bank.  Each bank is backed by a DSM
+segment.  We expect to need just 1 or 2 such DSM segments to exist at
+any time.
+
+The meta-data for all undo logs is written to disk at every
+checkpoint.  It is stored in files under PGDATA/pg_undo/, using the
+checkpoint's redo point (a WAL LSN) as its filename.  At startup time,
+the redo point's file can be used to restore all undo logs' meta-data
+as of the moment of the redo point into shared memory.  Changes to the
+discard pointer and end pointer are WAL-logged by undolog.c and will
+bring the in-memory meta-data up to date in the event of recovery
+after a crash.  Changes to insert pointers are included in other WAL
+records (see below).
+
+Responsibility for creating, deleting and recycling undo log segment
+files and WAL logging the associated meta-data changes lies with
+src/backend/storage/undo/undolog.c.
+
+Persistence Levels and Tablespaces
+==================================
+
+When new undo log space is requested by client code, the persistence
+level of the relation being modified and the current value of the GUC
+"undo_tablespaces" controls which undo log is selected.  If the
+session is already attached to a suitable undo log and it hasn't run
+out of address space, it can be used immediately.  Otherwise a
+suitable undo log must be either found or created.  The system should
+stabilize on one undo log per active writing backend (or more if
+different tablespaces are persistence levels are used).
+
+When an unlogged relation is modified, undo data generated by the
+operation must be stored in an unlogged undo log.  This causes the
+undo data to be deleted along with all unlogged relations during
+recovery from a non-shutdown checkpoint.  Likewise, temporary
+relations require special treatment: their buffers are backend-local
+and they cannot be accessed by other backend including undo workers.
+
+Non-empty undo logs in a tablespace prevent the tablespace from being
+dropped.
+
+Undo Log Contents
+=================
+
+Undo log contents are written into 1MB segment files under
+PGDATA/base/undo/ or PGDATA/pg_tblspc/VERSION/undo/ using filenames
+that encode the address (UndoRecPtr) of their first byte.  A period
+'.'  separates the undo log number part from the offset part, for the
+benefit of human administrators.
+
+Undo logs are page-oriented and use regular PosgreSQL page headers
+including checksums (if enabled) and LSNs.  An UndoRecPtr can be used
+to obtain a buffer and an offset within the buffer, and then regular
+buffer locking and page LSN rules apply.  While space is allocated by
+asking for a given number of usable bytes (not including page
+headers), client code is responsible for stepping over the page
+headers and advancing to the next page.
+
+Responsibility for WAL-logging the contents of the undo log lies with
+client code (ie zheap).  While undolog.c WAL-logs all meta-data
+changes except insert points and checkpoints all meta-data including
+insert points, client code is responsible for allocating undo log
+space in the same sequence at recovery time.  This avoids having to
+WAL-log insertion points explicitly and separately for every insertion
+into an undo log, greatly reducing WAL traffic.  (WAL is still
+generated by undolog.c whenever a 1MB segment boundary is crossed,
+since that also advances the end pointer.)
+
+One complication of this scheme for implicit insert pointer movement
+is that recovery doesn't naturally have access to the association
+between transactions and undo logs.  That is, while 'do' sessions have
+a currently attached undo log from which they allocate new space,
+recovery is performed by a single startup process which has no concept
+of the sessions that generated the WAL it is replaying.  For that
+reason, an xid->undo log number map is maintained at recovery time.
+At 'do' time, a WAL record is emitted the first time any permanent
+undo log is used in a given transaction, so that the mapping can be
+recovered at redo time.  That allows a stream of allocations to be
+directed to the appropriate undo logs so that the same resulting
+stream of undo log pointer can be produced.  (Unlogged and temporary
+undo logs don't have this problem since they aren't used at recovery
+time.)
+
+Another complication is that the checkpoint files written under pg_undo
+may contain inconsistent data during recovery from an online checkpoint
+(after a crash or base backup).  To compensate for this, client code
+must arrange to log an undo log meta-data record when inserting the
+first WAL record that might cause undo log access during recovery.
+This is conceptually similar to full page images after checkpoints,
+but limited to one meta-data WAL record per undo log per checkpoint.
+
+src/backend/storage/buffer/bufmgr.c is unaware of the existence of
+undo log as a separate category of buffered data.  Reading and writing
+of buffered undo log pages is handled by a new storage manager in
+src/backend/storage/smgr/undo_file.c.  See
+src/backend/storage/smgr/README for more details.
diff --git a/src/backend/storage/smgr/README b/src/backend/storage/smgr/README
index 37ed40b6450..641926f8769 100644
--- a/src/backend/storage/smgr/README
+++ b/src/backend/storage/smgr/README
@@ -10,16 +10,14 @@ memory, but these were never supported in any externally released Postgres,
 nor in any version of PostgreSQL.)  The "magnetic disk" manager is itself
 seriously misnamed, because actually it supports any kind of device for
 which the operating system provides standard filesystem operations; which
-these days is pretty much everything of interest.  However, we retain the
-notion of a storage manager switch in case anyone ever wants to reintroduce
-other kinds of storage managers.  Removing the switch layer would save
-nothing noticeable anyway, since storage-access operations are surely far
-more expensive than one extra layer of C function calls.
+these days is pretty much everything of interest.  However, we retained the
+notion of a storage manager switch and it turned out to be useful for plugging
+in a new storage manager to support buffered undo logs.
 
 In Berkeley Postgres each relation was tagged with the ID of the storage
-manager to use for it.  This is gone.  It would be probably more reasonable
-to associate storage managers with tablespaces, should we ever re-introduce
-multiple storage managers into the system catalogs.
+manager to use for it.  This is gone.  While earlier PostgreSQL releases were
+hard coded to use md.c unconditionally, PostgreSQL 12 routes IO for the undo
+pseudo-database to undo_file.c.
 
 The files in this directory, and their contents, are
 
@@ -31,6 +29,12 @@ The files in this directory, and their contents, are
     md.c	The "magnetic disk" storage manager, which is really just
 		an interface to the kernel's filesystem operations.
 
+    undo_file.c The undo log storage manager.  This supports
+		buffer-pool based access to the contents of undo log
+		segment files.  It supports a limited subset of the
+		smgr interface: it can only read and write blocks of
+		existing files.
+
     smgrtype.c	Storage manager type -- maps string names to storage manager
 		IDs and provides simple comparison operators.  This is the
 		regproc support for type "smgr" in the system catalogs.
@@ -38,6 +42,9 @@ The files in this directory, and their contents, are
 		in the catalogs anymore.)
 
 Note that md.c in turn relies on src/backend/storage/file/fd.c.
+undo_file.c also uses fd.c to read and write blocks, but it expects
+src/backend/access/undo/undolog.c to manage the files holding those
+blocks.
 
 
 Relation Forks
-- 
2.17.0

0004-Add-tests-for-the-undo-log-manager-v1.patchapplication/octet-stream; name=0004-Add-tests-for-the-undo-log-manager-v1.patchDownload

From c5ca93c3bb759079c81aac3b507a5834864506c2 Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 25 May 2018 09:43:16 +1200
Subject: [PATCH 4/6] Add tests for the undo log manager.

Provide a test module that exercises undolog.c and undofile.c.
TODO: A TAP test for recovery.

Author: Thomas Munro
Discussion:
---
 src/test/modules/Makefile                     |   1 +
 src/test/modules/test_undo/Makefile           |  28 +
 .../modules/test_undo/expected/.gitignore     |   1 +
 .../modules/test_undo/input/test_undo.source  | 107 ++++
 .../modules/test_undo/output/test_undo.source | 358 +++++++++++
 src/test/modules/test_undo/sql/.gitignore     |   1 +
 src/test/modules/test_undo/test_undo--1.0.sql |  53 ++
 src/test/modules/test_undo/test_undo.c        | 560 ++++++++++++++++++
 src/test/modules/test_undo/test_undo.control  |   4 +
 9 files changed, 1113 insertions(+)
 create mode 100644 src/test/modules/test_undo/Makefile
 create mode 100644 src/test/modules/test_undo/expected/.gitignore
 create mode 100644 src/test/modules/test_undo/input/test_undo.source
 create mode 100644 src/test/modules/test_undo/output/test_undo.source
 create mode 100644 src/test/modules/test_undo/sql/.gitignore
 create mode 100644 src/test/modules/test_undo/test_undo--1.0.sql
 create mode 100644 src/test/modules/test_undo/test_undo.c
 create mode 100644 src/test/modules/test_undo/test_undo.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 19d60a506e1..43323a6f2ad 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -18,6 +18,7 @@ SUBDIRS = \
 		  test_rbtree \
 		  test_rls_hooks \
 		  test_shm_mq \
+		  test_undo \
 		  worker_spi
 
 $(recurse)
diff --git a/src/test/modules/test_undo/Makefile b/src/test/modules/test_undo/Makefile
new file mode 100644
index 00000000000..ce41746d0b8
--- /dev/null
+++ b/src/test/modules/test_undo/Makefile
@@ -0,0 +1,28 @@
+# src/test/modules/test_undo/Makefile
+
+MODULE_big = test_undo
+OBJS = test_undo.o
+PGFILEDESC = "test_undo - a test module for the undo log manager"
+
+EXTENSION = test_undo
+DATA = test_undo--1.0.sql
+
+REGRESS = test_undo
+
+check: tablespace-setup
+
+.PHONY: tablespace-setup
+tablespace-setup:
+	rm -fr testtablespace1 testtablespace2
+	mkdir testtablespace1 testtablespace2
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_undo
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_undo/expected/.gitignore b/src/test/modules/test_undo/expected/.gitignore
new file mode 100644
index 00000000000..1bb8bf6d7fd
--- /dev/null
+++ b/src/test/modules/test_undo/expected/.gitignore
@@ -0,0 +1 @@
+# empty
diff --git a/src/test/modules/test_undo/input/test_undo.source b/src/test/modules/test_undo/input/test_undo.source
new file mode 100644
index 00000000000..21d1ee6e2bc
--- /dev/null
+++ b/src/test/modules/test_undo/input/test_undo.source
@@ -0,0 +1,107 @@
+create extension test_undo;
+
+create view undo_logs as
+  select log_number,
+         persistence,
+         tablespace,
+         discard,
+         insert,
+         "end",
+         pid = pg_backend_pid() as my_pid,
+         xid = txid_current()::text::xid as my_xid
+    from pg_stat_undo_logs;
+
+begin;
+
+-- permanent storage
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'permanent');
+select * from undo_logs order by log_number;
+-- write a short message
+select undo_append('[permanent]'::bytea, 'permanent');
+select * from undo_logs order by log_number;
+-- see if we can read it back
+select undo_dump('000000000000003C', 11, 'permanent');
+
+-- unlogged storage
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'unlogged');
+select * from undo_logs order by log_number;
+-- write a short message
+select undo_append('<unlogged> '::bytea, 'unlogged');
+select * from undo_logs order by log_number;
+-- see if we can read it back
+select undo_dump('000001000000003C', 11, 'unlogged');
+
+-- temporary storage
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'temporary');
+select * from undo_logs order by log_number;
+-- write a short message
+select undo_append('{temporary}'::bytea, 'temporary');
+select * from undo_logs order by log_number;
+-- see if we can read it back
+select undo_dump('000002000000003C', 11, 'temporary');
+
+-- discard the data we wrote in each of those logs
+select undo_discard('0000000000000047');
+select * from undo_logs order by log_number;
+select undo_discard('0000010000000047');
+select * from undo_logs order by log_number;
+select undo_discard('0000020000000047');
+select * from undo_logs order by log_number;
+
+commit;
+
+create tablespace ts1 location '@testtablespace@1';
+create tablespace ts2 location '@testtablespace@2';
+
+begin;
+set undo_tablespaces = ts1;
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'permanent');
+select * from undo_logs order by log_number;
+-- write a short message
+select undo_append('ts1:perm---'::bytea, 'permanent');
+select * from undo_logs order by log_number;
+-- discard
+select undo_discard('0000030000000047');
+select * from undo_logs order by log_number;
+commit;
+
+begin;
+set undo_tablespaces = ts2;
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'permanent');
+select * from undo_logs order by log_number;
+-- write a short message
+select undo_append('ts2:perm---', 'permanent');
+select * from undo_logs order by log_number;
+-- discard
+select undo_discard('0000040000000047');
+select * from undo_logs order by log_number;
+commit;
+
+-- check that we can drop tablespaces (because there is nothing in them)
+drop tablespace ts1;
+drop tablespace ts2;
+
+-- we fail to allocate space now that ts2 is gone
+begin;
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'permanent');
+select * from undo_logs order by log_number;
+commit;
+
+-- we go back to allocating from log 0 if we clear the GUC
+begin;
+set undo_tablespaces = '';
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'permanent');
+select * from undo_logs order by log_number;
+-- discard
+select undo_discard('000000000000006B');
+select * from undo_logs order by log_number;
+commit;
+
+drop view undo_logs;
diff --git a/src/test/modules/test_undo/output/test_undo.source b/src/test/modules/test_undo/output/test_undo.source
new file mode 100644
index 00000000000..13ec679ae94
--- /dev/null
+++ b/src/test/modules/test_undo/output/test_undo.source
@@ -0,0 +1,358 @@
+create extension test_undo;
+create view undo_logs as
+  select log_number,
+         persistence,
+         tablespace,
+         discard,
+         insert,
+         "end",
+         pid = pg_backend_pid() as my_pid,
+         xid = txid_current()::text::xid as my_xid
+    from pg_stat_undo_logs;
+begin;
+-- permanent storage
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'permanent');
+NOTICE:  will copy 20 bytes into undo log at 0000000000000018
+NOTICE:  writing chunk at offset 24
+NOTICE:  will copy 16 bytes into undo log at 000000000000002C
+NOTICE:  writing chunk at offset 44
+ undo_append_transaction_header 
+--------------------------------
+ 0000000000000018
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000018 | 000000000000003C | 0000000000100000 | t      | t
+(1 row)
+
+-- write a short message
+select undo_append('[permanent]'::bytea, 'permanent');
+NOTICE:  will copy 11 bytes into undo log at 000000000000003C
+NOTICE:  writing chunk at offset 60
+   undo_append    
+------------------
+ 000000000000003C
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000018 | 0000000000000047 | 0000000000100000 | t      | t
+(1 row)
+
+-- see if we can read it back
+select undo_dump('000000000000003C', 11, 'permanent');
+NOTICE:  0000000000000038: 00 00 00 00 5b 70 65 72 ....[per
+NOTICE:  0000000000000040: 6d 61 6e 65 6e 74 5d 00 manent].
+ undo_dump 
+-----------
+ 
+(1 row)
+
+-- unlogged storage
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'unlogged');
+NOTICE:  will copy 20 bytes into undo log at 0000010000000018
+NOTICE:  writing chunk at offset 24
+NOTICE:  will copy 16 bytes into undo log at 000001000000002C
+NOTICE:  writing chunk at offset 44
+ undo_append_transaction_header 
+--------------------------------
+ 0000010000000018
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000018 | 0000000000000047 | 0000000000100000 | t      | t
+          1 | unlogged    | pg_default | 0000010000000018 | 000001000000003C | 0000010000100000 | t      | t
+(2 rows)
+
+-- write a short message
+select undo_append('<unlogged> '::bytea, 'unlogged');
+NOTICE:  will copy 11 bytes into undo log at 000001000000003C
+NOTICE:  writing chunk at offset 60
+   undo_append    
+------------------
+ 000001000000003C
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000018 | 0000000000000047 | 0000000000100000 | t      | t
+          1 | unlogged    | pg_default | 0000010000000018 | 0000010000000047 | 0000010000100000 | t      | t
+(2 rows)
+
+-- see if we can read it back
+select undo_dump('000001000000003C', 11, 'unlogged');
+NOTICE:  0000010000000038: 00 00 00 00 3c 75 6e 6c ....<unl
+NOTICE:  0000010000000040: 6f 67 67 65 64 3e 20 00 ogged> .
+ undo_dump 
+-----------
+ 
+(1 row)
+
+-- temporary storage
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'temporary');
+NOTICE:  will copy 20 bytes into undo log at 0000020000000018
+NOTICE:  writing chunk at offset 24
+NOTICE:  will copy 16 bytes into undo log at 000002000000002C
+NOTICE:  writing chunk at offset 44
+ undo_append_transaction_header 
+--------------------------------
+ 0000020000000018
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000018 | 0000000000000047 | 0000000000100000 | t      | t
+          1 | unlogged    | pg_default | 0000010000000018 | 0000010000000047 | 0000010000100000 | t      | t
+          2 | temporary   | pg_default | 0000020000000018 | 000002000000003C | 0000020000100000 | t      | t
+(3 rows)
+
+-- write a short message
+select undo_append('{temporary}'::bytea, 'temporary');
+NOTICE:  will copy 11 bytes into undo log at 000002000000003C
+NOTICE:  writing chunk at offset 60
+   undo_append    
+------------------
+ 000002000000003C
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000018 | 0000000000000047 | 0000000000100000 | t      | t
+          1 | unlogged    | pg_default | 0000010000000018 | 0000010000000047 | 0000010000100000 | t      | t
+          2 | temporary   | pg_default | 0000020000000018 | 0000020000000047 | 0000020000100000 | t      | t
+(3 rows)
+
+-- see if we can read it back
+select undo_dump('000002000000003C', 11, 'temporary');
+NOTICE:  0000020000000038: 00 00 00 00 7b 74 65 6d ....{tem
+NOTICE:  0000020000000040: 70 6f 72 61 72 79 7d 00 porary}.
+ undo_dump 
+-----------
+ 
+(1 row)
+
+-- discard the data we wrote in each of those logs
+select undo_discard('0000000000000047');
+ undo_discard 
+--------------
+ 
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000047 | 0000000000000047 | 0000000000100000 | t      | t
+          1 | unlogged    | pg_default | 0000010000000018 | 0000010000000047 | 0000010000100000 | t      | t
+          2 | temporary   | pg_default | 0000020000000018 | 0000020000000047 | 0000020000100000 | t      | t
+(3 rows)
+
+select undo_discard('0000010000000047');
+ undo_discard 
+--------------
+ 
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000047 | 0000000000000047 | 0000000000100000 | t      | t
+          1 | unlogged    | pg_default | 0000010000000047 | 0000010000000047 | 0000010000100000 | t      | t
+          2 | temporary   | pg_default | 0000020000000018 | 0000020000000047 | 0000020000100000 | t      | t
+(3 rows)
+
+select undo_discard('0000020000000047');
+ undo_discard 
+--------------
+ 
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000047 | 0000000000000047 | 0000000000100000 | t      | t
+          1 | unlogged    | pg_default | 0000010000000047 | 0000010000000047 | 0000010000100000 | t      | t
+          2 | temporary   | pg_default | 0000020000000047 | 0000020000000047 | 0000020000100000 | t      | t
+(3 rows)
+
+commit;
+create tablespace ts1 location '@testtablespace@1';
+create tablespace ts2 location '@testtablespace@2';
+begin;
+set undo_tablespaces = ts1;
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'permanent');
+NOTICE:  will copy 20 bytes into undo log at 0000030000000018
+NOTICE:  writing chunk at offset 24
+NOTICE:  will copy 16 bytes into undo log at 000003000000002C
+NOTICE:  writing chunk at offset 44
+ undo_append_transaction_header 
+--------------------------------
+ 0000030000000018
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000047 | 0000000000000047 | 0000000000100000 |        | 
+          1 | unlogged    | pg_default | 0000010000000047 | 0000010000000047 | 0000010000100000 |        | 
+          2 | temporary   | pg_default | 0000020000000047 | 0000020000000047 | 0000020000100000 |        | 
+          3 | permanent   | ts1        | 0000030000000018 | 000003000000003C | 0000030000100000 | t      | t
+(4 rows)
+
+-- write a short message
+select undo_append('ts1:perm---'::bytea, 'permanent');
+NOTICE:  will copy 11 bytes into undo log at 000003000000003C
+NOTICE:  writing chunk at offset 60
+   undo_append    
+------------------
+ 000003000000003C
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000047 | 0000000000000047 | 0000000000100000 |        | 
+          1 | unlogged    | pg_default | 0000010000000047 | 0000010000000047 | 0000010000100000 |        | 
+          2 | temporary   | pg_default | 0000020000000047 | 0000020000000047 | 0000020000100000 |        | 
+          3 | permanent   | ts1        | 0000030000000018 | 0000030000000047 | 0000030000100000 | t      | t
+(4 rows)
+
+-- discard
+select undo_discard('0000030000000047');
+ undo_discard 
+--------------
+ 
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000047 | 0000000000000047 | 0000000000100000 |        | 
+          1 | unlogged    | pg_default | 0000010000000047 | 0000010000000047 | 0000010000100000 |        | 
+          2 | temporary   | pg_default | 0000020000000047 | 0000020000000047 | 0000020000100000 |        | 
+          3 | permanent   | ts1        | 0000030000000047 | 0000030000000047 | 0000030000100000 | t      | t
+(4 rows)
+
+commit;
+begin;
+set undo_tablespaces = ts2;
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'permanent');
+NOTICE:  will copy 20 bytes into undo log at 0000040000000018
+NOTICE:  writing chunk at offset 24
+NOTICE:  will copy 16 bytes into undo log at 000004000000002C
+NOTICE:  writing chunk at offset 44
+ undo_append_transaction_header 
+--------------------------------
+ 0000040000000018
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000047 | 0000000000000047 | 0000000000100000 |        | 
+          1 | unlogged    | pg_default | 0000010000000047 | 0000010000000047 | 0000010000100000 |        | 
+          2 | temporary   | pg_default | 0000020000000047 | 0000020000000047 | 0000020000100000 |        | 
+          3 | permanent   | ts1        | 0000030000000047 | 0000030000000047 | 0000030000100000 |        | 
+          4 | permanent   | ts2        | 0000040000000018 | 000004000000003C | 0000040000100000 | t      | t
+(5 rows)
+
+-- write a short message
+select undo_append('ts2:perm---', 'permanent');
+NOTICE:  will copy 11 bytes into undo log at 000004000000003C
+NOTICE:  writing chunk at offset 60
+   undo_append    
+------------------
+ 000004000000003C
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000047 | 0000000000000047 | 0000000000100000 |        | 
+          1 | unlogged    | pg_default | 0000010000000047 | 0000010000000047 | 0000010000100000 |        | 
+          2 | temporary   | pg_default | 0000020000000047 | 0000020000000047 | 0000020000100000 |        | 
+          3 | permanent   | ts1        | 0000030000000047 | 0000030000000047 | 0000030000100000 |        | 
+          4 | permanent   | ts2        | 0000040000000018 | 0000040000000047 | 0000040000100000 | t      | t
+(5 rows)
+
+-- discard
+select undo_discard('0000040000000047');
+ undo_discard 
+--------------
+ 
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000047 | 0000000000000047 | 0000000000100000 |        | 
+          1 | unlogged    | pg_default | 0000010000000047 | 0000010000000047 | 0000010000100000 |        | 
+          2 | temporary   | pg_default | 0000020000000047 | 0000020000000047 | 0000020000100000 |        | 
+          3 | permanent   | ts1        | 0000030000000047 | 0000030000000047 | 0000030000100000 |        | 
+          4 | permanent   | ts2        | 0000040000000047 | 0000040000000047 | 0000040000100000 | t      | t
+(5 rows)
+
+commit;
+-- check that we can drop tablespaces (because there is nothing in them)
+drop tablespace ts1;
+drop tablespace ts2;
+-- we fail to allocate space now that ts2 is gone
+begin;
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'permanent');
+ERROR:  tablespace "ts2" does not exist
+HINT:  Create the tablespace or set undo_tablespaces to a valid or empty list.
+select * from undo_logs order by log_number;
+ERROR:  current transaction is aborted, commands ignored until end of transaction block
+commit;
+-- we go back to allocating from log 0 if we clear the GUC
+begin;
+set undo_tablespaces = '';
+-- write a transaction header to avoid upsetting undo workers
+select undo_append_transaction_header(txid_current()::text::xid, 'permanent');
+NOTICE:  will copy 20 bytes into undo log at 0000000000000047
+NOTICE:  writing chunk at offset 71
+NOTICE:  will copy 16 bytes into undo log at 000000000000005B
+NOTICE:  writing chunk at offset 91
+ undo_append_transaction_header 
+--------------------------------
+ 0000000000000047
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 0000000000000047 | 000000000000006B | 0000000000100000 | t      | t
+          1 | unlogged    | pg_default | 0000010000000047 | 0000010000000047 | 0000010000100000 |        | 
+          2 | temporary   | pg_default | 0000020000000047 | 0000020000000047 | 0000020000100000 |        | 
+(3 rows)
+
+-- discard
+select undo_discard('000000000000006B');
+ undo_discard 
+--------------
+ 
+(1 row)
+
+select * from undo_logs order by log_number;
+ log_number | persistence | tablespace |     discard      |      insert      |       end        | my_pid | my_xid 
+------------+-------------+------------+------------------+------------------+------------------+--------+--------
+          0 | permanent   | pg_default | 000000000000006B | 000000000000006B | 0000000000100000 | t      | t
+          1 | unlogged    | pg_default | 0000010000000047 | 0000010000000047 | 0000010000100000 |        | 
+          2 | temporary   | pg_default | 0000020000000047 | 0000020000000047 | 0000020000100000 |        | 
+(3 rows)
+
+commit;
+drop view undo_logs;
diff --git a/src/test/modules/test_undo/sql/.gitignore b/src/test/modules/test_undo/sql/.gitignore
new file mode 100644
index 00000000000..1bb8bf6d7fd
--- /dev/null
+++ b/src/test/modules/test_undo/sql/.gitignore
@@ -0,0 +1 @@
+# empty
diff --git a/src/test/modules/test_undo/test_undo--1.0.sql b/src/test/modules/test_undo/test_undo--1.0.sql
new file mode 100644
index 00000000000..4ab4813cf44
--- /dev/null
+++ b/src/test/modules/test_undo/test_undo--1.0.sql
@@ -0,0 +1,53 @@
+\echo Use "CREATE EXTENSION test_undo" to load this file. \quit
+
+CREATE FUNCTION undo_allocate(size int, persistence text)
+RETURNS text
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION undo_advance(ptr text, size int, persistence text)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION undo_append(bytes bytea, persistence text)
+RETURNS text
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION undo_append_transaction_header(xid xid, persistence text)
+RETURNS text
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION undo_append_file(path text)
+RETURNS text
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION undo_extract_file(path text, undo_ptr text, size int, persistence text)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION undo_dump(undo_ptr text, size int, persistence text)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION undo_discard(undo_ptr text)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION undo_is_discarded(undo_ptr text)
+RETURNS boolean
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+CREATE FUNCTION undo_foreground_discard_test(loops int, size int, persistence text)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+
diff --git a/src/test/modules/test_undo/test_undo.c b/src/test/modules/test_undo/test_undo.c
new file mode 100644
index 00000000000..2296ba08654
--- /dev/null
+++ b/src/test/modules/test_undo/test_undo.c
@@ -0,0 +1,560 @@
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undolog.h"
+#include "catalog/pg_class.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+
+#include <stdlib.h>
+#include <unistd.h>
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(undo_allocate);
+PG_FUNCTION_INFO_V1(undo_advance);
+PG_FUNCTION_INFO_V1(undo_append);
+PG_FUNCTION_INFO_V1(undo_append_file);
+PG_FUNCTION_INFO_V1(undo_append_transaction_header);
+PG_FUNCTION_INFO_V1(undo_extract_file);
+PG_FUNCTION_INFO_V1(undo_dump);
+PG_FUNCTION_INFO_V1(undo_discard);
+PG_FUNCTION_INFO_V1(undo_is_discarded);
+PG_FUNCTION_INFO_V1(undo_foreground_discard_test);
+
+/*
+ * It's nice to show UndoRecPtr always as hex, because that way you can see
+ * the components easily.  Bigint just doesn't really work because it's
+ * signed.
+ */
+static text *
+undo_rec_ptr_to_text(UndoRecPtr undo_ptr)
+{
+	char buffer[17];
+
+	snprintf(buffer, sizeof(buffer), UndoRecPtrFormat, undo_ptr);
+	return cstring_to_text(buffer);
+}
+
+static UndoRecPtr
+undo_rec_ptr_from_text(text *t)
+{
+	UndoRecPtr undo_ptr;
+
+	if (sscanf(text_to_cstring(t), "%zx", &undo_ptr) != 1)
+		elog(ERROR, "could not parse UndoRecPtr (expected hex)");
+	return undo_ptr;
+}
+
+static UndoPersistence
+undo_persistence_from_text(text *t)
+{
+	char *str = text_to_cstring(t);
+
+	if (strcmp(str, "permanent") == 0)
+		return UNDO_PERMANENT;
+	else if (strcmp(str, "temporary") == 0)
+		return UNDO_TEMP;
+	else if (strcmp(str, "unlogged") == 0)
+		return UNDO_UNLOGGED;
+	else
+		elog(ERROR, "unknown undo persistence level: %s", str);
+}
+
+/*
+ * Just allocate some undo space, for testing.  This may cause us to be
+ * attached to an undo log, possibly creating it on demand.
+ */
+Datum
+undo_allocate(PG_FUNCTION_ARGS)
+{
+	int size = PG_GETARG_INT32(0);
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(1));
+	UndoRecPtr undo_ptr;
+
+	undo_ptr = UndoLogAllocate(size, persistence, NULL);
+
+	PG_RETURN_TEXT_P(undo_rec_ptr_to_text(undo_ptr));
+}
+
+/*
+ * Advance the insert pointer for an undo log, for testing.  This must
+ * undo_ptr value give must have been returned by undo_allocate(), and the
+ * size give must be the argument that was given to undo_allocate().  The call
+ * to undo_allocate() reserved space for us and told us where it is, and now
+ * we are advancing the insertion pointer (presumably having written data
+ * there).
+ */
+Datum
+undo_advance(PG_FUNCTION_ARGS)
+{
+	UndoRecPtr undo_ptr = undo_rec_ptr_from_text(PG_GETARG_TEXT_PP(0));
+	int size = PG_GETARG_INT32(1);
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(2));
+
+	UndoLogAdvance(undo_ptr, size, persistence);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Advance the discard pointer in an undo log.
+ */
+Datum
+undo_discard(PG_FUNCTION_ARGS)
+{
+	UndoRecPtr undo_ptr = undo_rec_ptr_from_text(PG_GETARG_TEXT_PP(0));
+
+	UndoLogDiscard(undo_ptr, InvalidTransactionId);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Allocate space and write the contents of a file into it.
+ */
+Datum
+undo_append_file(PG_FUNCTION_ARGS)
+{
+	char *path = text_to_cstring(PG_GETARG_TEXT_PP(0));
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(1));
+	size_t size;
+	size_t remaining;
+	UndoRecPtr start_undo_ptr;
+	UndoRecPtr insert_undo_ptr;
+	int fd;
+
+	/* Open the file and check its size. */
+	fd = open(path, O_RDONLY, 0);
+	if (fd < 0)
+		elog(ERROR, "could not open file '%s': %m", path);
+	size = lseek(fd, 0, SEEK_END);
+	lseek(fd, 0, SEEK_SET);
+
+	/* Allocate undo log space. */
+	start_undo_ptr = UndoLogAllocate(size, persistence, NULL);
+
+	elog(NOTICE, "will copy %zu bytes into undo log", size);
+
+	/* Copy data into shared buffers. */
+	insert_undo_ptr = start_undo_ptr;
+	remaining = size;
+	while (remaining > 0)
+	{
+		RelFileNode rfn;
+		Buffer buffer;
+		char *page;
+		size_t this_chunk_offset;
+		size_t this_chunk_size;
+		char data[BLCKSZ];
+		ssize_t bytes_read;
+
+		/*
+		 * Figure out how much we can fit on the page that insert_undo_ptr
+		 * points to.
+		 */
+		this_chunk_offset = UndoRecPtrGetPageOffset(insert_undo_ptr);
+		this_chunk_size = Min(remaining, BLCKSZ - this_chunk_offset);
+
+		Assert(this_chunk_offset >= UndoLogBlockHeaderSize);
+		Assert(this_chunk_size <= UndoLogUsableBytesPerPage);
+		Assert(this_chunk_offset + this_chunk_size <= BLCKSZ);
+
+		bytes_read = read(fd, data, this_chunk_size);
+		if (bytes_read < 0)
+		{
+			int save_errno = errno;
+			close(fd);
+			errno = save_errno;
+			elog(ERROR, "failed to read from '%s': %m", path);
+		}
+		if (bytes_read < this_chunk_size)
+		{
+			/*
+			 * This is a bit silly, we should be prepared to handle this but
+			 * for this demo code we'll just give up.
+			 */
+			close(fd);
+			elog(ERROR, "short read from '%s'", path);
+		}
+
+		/* Copy the chunk onto the page. */
+		UndoRecPtrAssignRelFileNode(rfn, insert_undo_ptr);
+		buffer =
+			ReadBufferWithoutRelcache(rfn,
+									  UndoLogForkNum,
+									  UndoRecPtrGetBlockNum(insert_undo_ptr),
+									  RBM_NORMAL,
+									  NULL,
+									  persistence);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		if (this_chunk_offset == UndoLogBlockHeaderSize)
+			PageInit(page, BLCKSZ, 0);
+		memcpy(page + this_chunk_offset, data, this_chunk_size);
+		MarkBufferDirty(buffer);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buffer);
+
+		/* Prepare to put the next chunk on the next page. */
+		insert_undo_ptr += this_chunk_size;
+		remaining -= this_chunk_size;
+
+		/* Step over the page header if we landed at the start of page. */
+		if (UndoRecPtrGetPageOffset(insert_undo_ptr) == 0)
+			insert_undo_ptr += UndoLogBlockHeaderSize;
+	}
+
+	/* Advance the undo log insert point.  No need to consider headers. */
+	UndoLogAdvance(start_undo_ptr, size, persistence);
+
+	/*
+	 * We'd leak a file descriptor if code above raised an error, but not
+	 * worrying about that for this demo code.
+	 */
+	close(fd);
+
+	PG_RETURN_TEXT_P(undo_rec_ptr_to_text(start_undo_ptr));
+}
+
+/*
+ * Extract the contents of an undo log into a file.
+ */
+Datum
+undo_extract_file(PG_FUNCTION_ARGS)
+{
+	char *path = text_to_cstring(PG_GETARG_TEXT_PP(0));
+	UndoRecPtr undo_ptr = undo_rec_ptr_from_text(PG_GETARG_TEXT_PP(1));
+	size_t size = (size_t) PG_GETARG_INT32(2);
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(3));
+	size_t remaining = size;
+	int fd;
+
+	if (UndoRecPtrGetPageOffset(undo_ptr) < UndoLogBlockHeaderSize)
+		elog(ERROR, "undo pointer points to header data");
+
+	fd = open(path, O_WRONLY | O_CREAT, 0664);
+	if (fd < 0)
+		elog(ERROR, "can't open '%s': %m", path);
+
+	while (remaining > 0)
+	{
+		RelFileNode rfn;
+		Buffer buffer;
+		char *page;
+		size_t this_chunk_offset;
+		size_t this_chunk_size;
+		char data[BLCKSZ];
+		ssize_t bytes_written;
+
+		/*
+		 * Figure out how much we can read from the page that undo_ptr points
+		 * to.
+		 */
+		this_chunk_offset = UndoRecPtrGetPageOffset(undo_ptr);
+		this_chunk_size = Min(remaining, BLCKSZ - this_chunk_offset);
+
+		/* Copy region of page contents to buffer. */
+		UndoRecPtrAssignRelFileNode(rfn, undo_ptr);
+		buffer =
+			ReadBufferWithoutRelcache(rfn,
+									  UndoLogForkNum,
+									  UndoRecPtrGetBlockNum(undo_ptr),
+									  RBM_NORMAL,
+									  NULL,
+									  persistence);
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+		memcpy(data, page + this_chunk_offset, this_chunk_size);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buffer);
+
+		/* Write out. */
+		bytes_written = write(fd, data, this_chunk_size);
+		if (bytes_written < 0)
+		{
+			int save_errno = errno;
+			close(fd);
+			errno = save_errno;
+			elog(ERROR, "failed to write to '%s': %m", path);
+		}
+		if (bytes_written < this_chunk_size)
+		{
+			/*
+			 * This is a bit silly, we should be prepared to handle this but
+			 * for this demo code we'll just give up.
+			 */
+			close(fd);
+			elog(ERROR, "short write to '%s'", path);
+		}
+
+		/* Prepare to put the next chunk on the next page. */
+		undo_ptr += this_chunk_size;
+		remaining -= this_chunk_size;
+
+		/* Step over the page header if we landed at the start of page. */
+		if (UndoRecPtrGetPageOffset(undo_ptr) == 0)
+			undo_ptr += UndoLogBlockHeaderSize;
+	}
+	PG_RETURN_VOID();
+}
+
+/*
+ * Allocate space and write data into it.
+ */
+static UndoRecPtr
+undo_append_raw(void *data, size_t size, UndoPersistence persistence)
+{
+	size_t remaining;
+	UndoRecPtr start_undo_ptr;
+	UndoRecPtr insert_undo_ptr;
+
+	/* Allocate undo log space for our data. */
+	start_undo_ptr = UndoLogAllocate(size, persistence, NULL);
+
+	elog(NOTICE, "will copy %zu bytes into undo log at " UndoRecPtrFormat,
+		 size, start_undo_ptr);
+
+	/*
+	 * Copy data into shared buffers.  Real code that does this would need to
+	 * WAL-log something that would redo this.
+	 */
+	insert_undo_ptr = start_undo_ptr;
+	remaining = size;
+	while (remaining > 0)
+	{
+		RelFileNode rfn;
+		Buffer buffer;
+		char *page;
+		size_t this_chunk_offset;
+		size_t this_chunk_size;
+
+		/*
+		 * Figure out how much we can fit on the page that insert_undo_ptr
+		 * points to.
+		 */
+		this_chunk_offset = UndoRecPtrGetPageOffset(insert_undo_ptr);
+		this_chunk_size = Min(remaining, BLCKSZ - this_chunk_offset);
+
+		Assert(this_chunk_offset >= UndoLogBlockHeaderSize);
+		Assert(this_chunk_size <= UndoLogUsableBytesPerPage);
+		Assert(this_chunk_offset + this_chunk_size <= BLCKSZ);
+		elog(NOTICE, "writing chunk at offset %zu", this_chunk_offset);
+
+		/* Copy the chunk onto the page. */
+		UndoRecPtrAssignRelFileNode(rfn, insert_undo_ptr);
+		buffer =
+			ReadBufferWithoutRelcache(rfn,
+									  UndoLogForkNum,
+									  UndoRecPtrGetBlockNum(insert_undo_ptr),
+									  RBM_NORMAL,
+									  NULL,
+									  persistence);
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+		page = BufferGetPage(buffer);
+		if (this_chunk_offset == UndoLogBlockHeaderSize)
+			PageInit(page, BLCKSZ, 0);
+		memcpy(page + this_chunk_offset, data, this_chunk_size);
+		MarkBufferDirty(buffer);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buffer);
+
+		/* Prepare to put the next chunk on the next page. */
+		insert_undo_ptr += this_chunk_size;
+		data = (char *) data + this_chunk_size;
+		remaining -= this_chunk_size;
+
+		/* Step over the page header if we landed at the start of page. */
+		if (UndoRecPtrGetPageOffset(insert_undo_ptr) == 0)
+			insert_undo_ptr += UndoLogBlockHeaderSize;
+	}
+
+	/* Advance the undo log insert point.  No need to consider headers. */
+	UndoLogAdvance(start_undo_ptr, size, persistence);
+
+	return start_undo_ptr;
+}
+
+/*
+ * Allocate space and write data into it.
+ */
+Datum
+undo_append(PG_FUNCTION_ARGS)
+{
+	bytea *input = PG_GETARG_BYTEA_PP(0);
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(1));
+	void *data = VARDATA_ANY(input);
+	size_t size = VARSIZE_ANY_EXHDR(input);
+
+	PG_RETURN_TEXT_P(undo_rec_ptr_to_text(undo_append_raw(data, size, persistence)));
+}
+
+
+/*
+ * We need to be able to write a transaction header that will prevent the undo
+ * background worker from discarding any data that follows it until the
+ * referenced xid has committed.  We define this here to avoid problematic
+ * interactions with later patches that add record level abstractions, but it
+ * might be removed later.
+ */
+typedef struct TestRecordHeader
+{
+	uint8		urec_type;
+	uint8		urec_info;
+	uint16		urec_prevlen;
+	Oid			urec_relfilenode;
+	TransactionId urec_prevxid;
+	TransactionId urec_xid;
+	CommandId	urec_cid;
+} TestRecordHeader;
+
+typedef struct TestRecordTransaction
+{
+	uint32			urec_xidepoch;
+	uint64			urec_next;
+} TestRecordTransaction;
+
+Datum
+undo_append_transaction_header(PG_FUNCTION_ARGS)
+{
+	TestRecordHeader header1;
+	TestRecordTransaction header2;
+	TransactionId xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(1));
+	UndoRecPtr	result;
+
+	memset(&header1, 0, sizeof(header1));
+	header1.urec_type = 0x08;
+	header1.urec_xid = xid;
+	memset(&header2, 0, sizeof(header2));
+	header2.urec_next = InvalidUndoRecPtr;
+
+	result =
+		undo_append_raw(&header1,
+						offsetof(TestRecordHeader, urec_cid) +
+						sizeof(CommandId),
+						persistence);
+	undo_append_raw(&header2, sizeof(header2), persistence);
+
+	PG_RETURN_TEXT_P(undo_rec_ptr_to_text(result));
+}
+
+Datum
+undo_dump(PG_FUNCTION_ARGS)
+{
+	UndoRecPtr undo_ptr = undo_rec_ptr_from_text(PG_GETARG_TEXT_PP(0));
+	size_t size = (size_t) PG_GETARG_INT32(1);
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(2));
+	size_t remaining;
+
+
+	/* Rewind so that we start on an 8-byte block. */
+	if (undo_ptr % 8 != 0)
+	{
+		int extra_prefix = 8 - undo_ptr % 8;
+
+		undo_ptr -= extra_prefix;
+		size += extra_prefix;
+	}
+	/* Extend size so we show an 8-byte block. */
+	if (size % 8 != 0)
+		size += 8 - size % 8;
+	remaining = size;
+
+	while (remaining > 0)
+	{
+		RelFileNode rfn;
+		Buffer buffer;
+		char *page;
+		size_t this_chunk_offset;
+		size_t this_chunk_size;
+		unsigned char data[8];
+		char line[80];
+		int i;
+
+		/*
+		 * Figure out how much we can read from the page that undo_ptr points
+		 * to.
+		 */
+		this_chunk_offset = UndoRecPtrGetPageOffset(undo_ptr);
+		this_chunk_size = 8;
+
+		/* Copy region of page contents to buffer. */
+		UndoRecPtrAssignRelFileNode(rfn, undo_ptr);
+		buffer =
+			ReadBufferWithoutRelcache(rfn,
+									  UndoLogForkNum,
+									  UndoRecPtrGetBlockNum(undo_ptr),
+									  RBM_NORMAL,
+									  NULL,
+									  persistence);
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+		memcpy(data, page + this_chunk_offset, this_chunk_size);
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+		ReleaseBuffer(buffer);
+
+		/* Write out.  Apologies for this horrible code. */
+		snprintf(line, sizeof(line), UndoRecPtrFormat ": ", undo_ptr);
+		for (i = 0; i < 8; ++i)
+			snprintf(&line[18 + 3 * i], 4, "%02x ", data[i]);
+		for (i = 0; i < 8; ++i)
+		{
+			char c = '.';
+
+			if (data[i] >= ' ' && data[i] <= 127)
+				c = data[i];
+			line[18 + 3 * 8 + i] = c;
+		}
+		line[18 + 3 * i + i] = '\0';
+		elog(NOTICE, "%s", line);
+
+		/* Prepare to put the next chunk on the next page. */
+		undo_ptr += this_chunk_size;
+		remaining -= this_chunk_size;
+
+		/* Step over the page header if we landed at the start of page. */
+		if (UndoRecPtrGetPageOffset(undo_ptr) == 0)
+			undo_ptr += UndoLogBlockHeaderSize;
+	}
+	PG_RETURN_VOID();
+}
+
+Datum
+undo_foreground_discard_test(PG_FUNCTION_ARGS)
+{
+	int loops = PG_GETARG_INT32(0);
+	int size = PG_GETARG_INT32(1);
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(2));
+	int i;
+
+	if (size > BLCKSZ)
+		elog(ERROR, "data too large");
+
+	for (i = 0; i < loops; ++i)
+	{
+		UndoRecPtr undo_ptr;
+
+		/* Allocate some space. */
+		undo_ptr = UndoLogAllocate(size, persistence, NULL);
+		UndoLogAdvance(undo_ptr, size, persistence);
+
+		/* Discard the space that we just allocated. */
+		UndoLogDiscard(undo_ptr + size, InvalidTransactionId);
+	}
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Check if an undo pointer has been discarded.
+ */
+Datum
+undo_is_discarded(PG_FUNCTION_ARGS)
+{
+	PG_RETURN_BOOL(UndoLogIsDiscarded(undo_rec_ptr_from_text(PG_GETARG_TEXT_PP(0))));
+}
diff --git a/src/test/modules/test_undo/test_undo.control b/src/test/modules/test_undo/test_undo.control
new file mode 100644
index 00000000000..4595f52477b
--- /dev/null
+++ b/src/test/modules/test_undo/test_undo.control
@@ -0,0 +1,4 @@
+comment = 'test_undo'
+default_version = '1.0'
+module_pathname = '$libdir/test_undo'
+relocatable = true
-- 
2.17.0

0005-Add-user-facing-documentation-for-undo-logs-v1.patchapplication/octet-stream; name=0005-Add-user-facing-documentation-for-undo-logs-v1.patchDownload

From fa025401a5bc487384b628099ca2c03e4962459d Mon Sep 17 00:00:00 2001
From: Thomas Munro <thomas.munro@enterprisedb.com>
Date: Fri, 25 May 2018 09:43:17 +1200
Subject: [PATCH 5/6] Add user-facing documentation for undo logs.

Document the pg_stat_undo_log view, the wait events, and the storage layout on
disk for undo logs.

Author: Thomas Munro
---
 doc/src/sgml/config.sgml     |  35 ++++++++++++
 doc/src/sgml/monitoring.sgml | 107 ++++++++++++++++++++++++++++++++++-
 doc/src/sgml/storage.sgml    |  56 ++++++++++++++++++
 3 files changed, 197 insertions(+), 1 deletion(-)

diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index b60240ecfe7..3c6886c7f98 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -6687,6 +6687,41 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
       </listitem>
      </varlistentry>
 
+     <varlistentry id="guc-undo-tablespaces" xreflabel="undo_tablespaces">
+      <term><varname>undo_tablespaces</varname> (<type>string</type>)
+      <indexterm>
+       <primary><varname>undo_tablespaces</varname> configuration parameter</primary>
+      </indexterm>
+      <indexterm><primary>tablespace</primary><secondary>temporary</secondary></indexterm>
+      </term>
+      <listitem>
+       <para>
+        This variable specifies tablespaces in which to store undo data, when
+        undo-aware storage managers (initially "zheap") perform writes.
+       </para>
+
+       <para>
+        The value is a list of names of tablespaces.  When there is more than
+        one name in the list, <productname>PostgreSQL</productname> chooses an
+        arbitrary one.  If the name doesn't correspond to an existing
+        tablespace, the next name is tried, and so on until all names have
+        been tried.  If no valid tablespace is specified, an error is raised.
+        The validation of the name doesn't happen until the first attempt to
+        write undo data.
+       </para>
+
+       <para>
+        The variable can only be changed before the first statement is
+        executed in a transaction.
+       </para>
+
+       <para>
+        The default value is an empty string, which results in all temporary
+        objects being created in the default tablespace.
+       </para>
+      </listitem>
+     </varlistentry>
+ 
      <varlistentry id="guc-check-function-bodies" xreflabel="check_function_bodies">
       <term><varname>check_function_bodies</varname> (<type>boolean</type>)
       <indexterm>
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index c278076e68d..8039b57c253 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -332,6 +332,14 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
       </entry>
      </row>
 
+     <row>
+      <entry><structname>pg_stat_undo_logs</structname><indexterm><primary>pg_stat_undo_logs</primary></indexterm></entry>
+      <entry>One row for each undo log, showing current pointers,
+       transactions and backends.
+       See <xref linkend="pg-stat-undo-logs-view"/> for details.
+      </entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
@@ -549,7 +557,6 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
    into the kernel's handling of I/O.
   </para>
 
-
   <table id="pg-stat-activity-view" xreflabel="pg_stat_activity">
    <title><structname>pg_stat_activity</structname> View</title>
 
@@ -1638,6 +1645,30 @@ postgres   27093  0.0  0.0  30096  2752 ?        Ss   11:34   0:00 postgres: ser
          <entry><literal>TwophaseFileWrite</literal></entry>
          <entry>Waiting for a write of a two phase state file.</entry>
         </row>
+        <row>
+         <entry><literal>UndoCheckpointRead</literal></entry>
+         <entry>Waiting for a read from an undo checkpoint file.</entry>
+        </row>
+        <row>
+         <entry><literal>UndoCheckpointSync</literal></entry>
+         <entry>Waiting for changes to an undo checkpoint file to reach stable storage.</entry>
+        </row>
+        <row>
+         <entry><literal>UndoCheckpointWrite</literal></entry>
+         <entry>Waiting for a write to an undo checkpoint file.</entry>
+        </row>
+         <row>
+         <entry><literal>UndoFileRead</literal></entry>
+         <entry>Waiting for a read from an undo data file.</entry>
+        </row>
+        <row>
+         <entry><literal>UndoFileSync</literal></entry>
+         <entry>Waiting for changes to an undo data file to reach stable storage.</entry>
+        </row>
+        <row>
+         <entry><literal>UndoFileWrite</literal></entry>
+         <entry>Waiting for a write to an undo data file.</entry>
+        </row>
         <row>
          <entry><literal>WALBootstrapSync</literal></entry>
          <entry>Waiting for WAL to reach stable storage during bootstrapping.</entry>
@@ -1710,6 +1741,80 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
 </programlisting>
    </para>
 
+  <table id="pg-stat-undo-logs-view" xreflabel="pg_stat_undo_logs">
+   <title><structname>pg_stat_undo_logs</structname> View</title>
+
+   <tgroup cols="3">
+    <thead>
+    <row>
+      <entry>Column</entry>
+      <entry>Type</entry>
+      <entry>Description</entry>
+     </row>
+    </thead>
+
+   <tbody>
+    <row>
+     <entry><structfield>log_number</structfield></entry>
+     <entry><type>oid</type></entry>
+     <entry>Identifier of this undo log</entry>
+    </row>
+    <row>
+     <entry><structfield>persistence</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Persistence level of data stored in this undo log; one of
+      <literal>permanent</literal>, <literal>unlogged</literal> or
+      <literal>temporary</literal>.</entry>
+    </row>
+    <row>
+     <entry><structfield>tablespace</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Tablespace that holds physical storage of this undo log.</entry>
+    </row>
+    <row>
+     <entry><structfield>discard</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Location of the oldest data in this undo log.</entry>
+    </row>
+    <row>
+     <entry><structfield>insert</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Location where the next data will be written in this undo
+      log.</entry>
+    </row>
+    <row>
+     <entry><structfield>end</structfield></entry>
+     <entry><type>text</type></entry>
+     <entry>Location one byte past the end of the allocated physical storage
+      backing this undo log.</entry>
+    </row>
+    <row>
+     <entry><structfield>xid</structfield></entry>
+     <entry><type>xid</type></entry>
+     <entry>Transaction currently attached to this undo log
+      for writing.</entry>
+    </row>
+    <row>
+     <entry><structfield>pid</structfield></entry>
+     <entry><type>integer</type></entry>
+     <entry>Process ID of the backend currently attached to this undo log
+      for writing.</entry>
+    </row>
+   </tbody>
+   </tgroup>
+  </table>
+
+  <para>
+   The <structname>pg_stat_undo_logs</structname> view will have one row for
+   each undo log that exists.  Undo logs are extents within a contiguous
+   addressing space that have their own head and tail pointers.
+   Each  backend that has written undo data is associated with one or more undo
+   log, and is the only backend that is allowed to write data to those undo
+   logs.  Backends can be associated with up to three undo logs at a time,
+   because different undo logs are used for the undo data associated with
+   permanent, unlogged and temporary relations.
+  </para>
+ 
   <table id="pg-stat-replication-view" xreflabel="pg_stat_replication">
    <title><structname>pg_stat_replication</structname> View</title>
    <tgroup cols="3">
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 8ef2ac80106..a693182f72e 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -141,6 +141,11 @@ Item
  <entry>Subdirectory containing state files for prepared transactions</entry>
 </row>
 
+<row>
+ <entry><filename>pg_undo</filename></entry>
+ <entry>Subdirectory containing undo log meta-data files</entry>
+</row>
+
 <row>
  <entry><filename>pg_wal</filename></entry>
  <entry>Subdirectory containing WAL (Write Ahead Log) files</entry>
@@ -686,6 +691,57 @@ erased (they will be recreated automatically as needed).
 
 </sect1>
 
+<sect1 id="undo-logs">
+
+<title>Undo Logs</title>
+
+<indexterm>
+ <primary>Undo Logs</primary>
+</indexterm>
+
+<para>
+Undo logs hold data that is used for rolling back and for implementing
+MVCC in access managers that are undo-aware (currently "zheap").  The storage
+format of undo logs is optimized for reusing existing files.
+</para>
+
+<para>
+Undo data exists in a 64 bit address space broken up into numbered undo logs
+that represent 1TB extents, for efficient management.  The space is further
+broken up into 1MB segment files, for physical storage.  The name of each file
+is the address of of the first byte in the file, with a period inserted after
+the part that indicates the undo log number.
+</para>
+
+<para>
+Each undo log is created in a particular tablespace and stores data for a
+particular persistence level.
+Undo logs are global in the sense that they don't belong to any particular
+database and may contain undo data from relations in any database.
+Undo files backing undo logs in the default tablespace are stored under
+<varname>PGDATA</varname><filename>/base/undo</filename>, and for other
+tablespaces under <filename>undo</filename> in the appropriate tablespace
+directory.  The system view <xref linkend="pg-stat-undo-logs-view"/> can be
+used to see the cluster's current list of undo logs along with their
+tablespaces and persistence levels.
+</para>
+
+<para>
+Just as relations can have one of the three persistence levels permanent,
+unlogged or temporary, the undo data that is generated by modifying them must
+be stored in an undo log of the same persistence level.  This enables the
+undo data to be discarded at appropriate times along with the relations that
+reference it.
+</para>
+
+<para>
+Undo log files contain standard page headers as described in the next section,
+but the format of the rest of the page is determined by the undo-aware
+access method that reads and writes it.
+</para>
+
+</sect1>
+
 <sect1 id="storage-page-layout">
 
 <title>Database Page Layout</title>
-- 
2.17.0

James Sewell

james.sewell@jirotech.com

over 7 years ago

In reply to: Thomas Munro (#1)

Re: Undo logs

Exciting stuff! Really looking forward to having a play with this.

James Sewell,
*Chief Architect*

Suite 112, Jones Bay Wharf, 26-32 Pirrama Road, Pyrmont NSW 2009
*P *(+61) 2 8099 9000 <(+61)%202%208099%209000> *W* www.jirotech.com *F *
(+61) 2 8099 9099 <(+61)%202%208099%209000>

On 25 May 2018 at 08:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

Hello hackers,

As announced elsewhere[1][2][3], at EnterpriseDB we are working on a
proposal to add in-place updates with undo logs to PostgreSQL. The
goal is to improve performance and resource usage by recycling space
better.

The lowest level piece of this work is a physical undo log manager,
which I've personally been working on. Later patches will build on
top, adding record-oriented access and then the main "zheap" access
manager and related infrastructure. My colleagues will write about
those.

The README files[4][5] explain in more detail, but here is a
bullet-point description of what the attached patch set gives you:

1. Efficient appending of new undo data from many concurrent
backends. Like logs.
2. Efficient discarding of old undo data that isn't needed anymore.
Like queues.
3. Efficient buffered random reading of undo data. Like relations.

A test module is provided that can be used to exercise the undo log
code paths without needing any of the later zheap patches.

This is work in progress. A few aspects are under active development
and liable to change, as indicated by comments, and there are no doubt
bugs and room for improvement. The code is also available at
github.com/EnterpriseDB/zheap (these patches are from the
undo-log-storage branch, see also the master branch which has the full
zheap feature). We'd be grateful for any questions, feedback or
ideas.

[1] https://amitkapila16.blogspot.com/2018/03/zheap-storage-
engine-to-provide-better.html
[2] https://rhaas.blogspot.com/2018/01/do-or-undo-there-is-no-vacuum.html
[3] https://www.pgcon.org/2018/schedule/events/1190.en.html
[4] https://github.com/EnterpriseDB/zheap/tree/undo-
log-storage/src/backend/access/undo
[5] https://github.com/EnterpriseDB/zheap/tree/undo-
log-storage/src/backend/storage/smgr

--
Thomas Munro
http://www.enterprisedb.com

--
The contents of this email are confidential and may be subject to legal or
professional privilege and copyright. No representation is made that this
email is free of viruses or other defects. If you have received this
communication in error, you may not copy or distribute any part of it or
otherwise disclose its contents to anyone. Please advise the sender of your
incorrect receipt of this correspondence.

Simon Riggs

simon@2ndquadrant.com

over 7 years ago

In reply to: Thomas Munro (#1)

Re: Undo logs

On 24 May 2018 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

As announced elsewhere[1][2][3], at EnterpriseDB we are working on a
proposal to add in-place updates with undo logs to PostgreSQL. The
goal is to improve performance and resource usage by recycling space
better.

Cool

The lowest level piece of this work is a physical undo log manager,

1. Efficient appending of new undo data from many concurrent
backends. Like logs.
2. Efficient discarding of old undo data that isn't needed anymore.
Like queues.
3. Efficient buffered random reading of undo data. Like relations.

Like an SLRU?

[4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
[5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr

I think there are quite a few design decisions there that need to be
discussed, so lets crack on and discuss them please.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Simon Riggs (#3)

Re: Undo logs

Hi Simon,

On Mon, May 28, 2018 at 11:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 24 May 2018 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

The lowest level piece of this work is a physical undo log manager,

1. Efficient appending of new undo data from many concurrent
backends. Like logs.
2. Efficient discarding of old undo data that isn't needed anymore.
Like queues.
3. Efficient buffered random reading of undo data. Like relations.

Like an SLRU?

Yes, but with some difference:

1. There is a variable number of undo logs. Each one corresponds to
a range of the 64 bit address space, and has its own head and tail
pointers, so that concurrent writers don't contend for buffers when
appending data. (Unlike SLRUs which are statically defined, one for
clog.c, one for commit_ts.c, ...).
2. Undo logs use regular buffers instead of having their own mini
buffer pool, ad hoc search and reclamation algorithm etc.
3. Undo logs support temporary, unlogged and permanent storage (=
local buffers and reset-on-crash-restart, for undo data relating to
relations of those persistence levels).
4. Undo logs storage files are preallocated (rather than being
extended block by block), and the oldest file is renamed to become the
newest file in common cases, like WAL.

[4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
[5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr

I think there are quite a few design decisions there that need to be
discussed, so lets crack on and discuss them please.

What do you think about using the main buffer pool?

Best case: pgbench type workload, discard pointer following closely
behind insert pointer, we never write anything out to disk (except for
checkpoints when we write a few pages), never advance the buffer pool
clock hand, and we use and constantly recycle 1-2 pages per connection
via the free list (as can be seen by monitoring insert - discard in
the pg_stat_undo_logs view).

Worst case: someone opens a snapshot and goes out to lunch so we can't
discard old undo data, and then we start to compete with other stuff
for buffers, and we hope the buffer reclamation algorithm is good at
its job (or can be improved).

I just talked about this proposal at a pgcon unconference session.
Here's some of the feedback I got:

1. Jeff Davis pointed out that I'm probably wrong about not needing
FPI, and there must at least be checksum problems with torn pages. He
also gave me an idea on how to fix that very cheaply, and I'm still
processing that feedback.
2. Andres Freund thought it seemed OK if we have smgr.c routing to
md.c for relations and undofile.c for undo, but if we're going to
generalise this technique to put other things into shared buffers
eventually too (like the SLRUs, as proposed by Shawn Debnath in
another unconf session) then it might be worth investigating how to
get md.c to handle all of their needs. They'd all just use fd.c
files, after all, so it'd be weird if we had to maintain several
different similar things.
3. Andres also suggested that high frequency free page list access
might be quite contended in the "best case" described above. I'll look
into that.
4. Someone said that segment sizes probably shouldn't be hard coded
(cf WAL experience).

I also learned in other sessions that there are other access managers
in development that need undo logs. I'm hoping to find out more about
that.

--
Thomas Munro
http://www.enterprisedb.com

Dilip Kumar

dilipbalaut@gmail.com

over 7 years ago

In reply to: Thomas Munro (#4)

2 attachment(s)

Re: Undo logs

Hello hackers,

As Thomas has already mentioned upthread that we are working on an
undo-log based storage and he has posted the patch sets for the lowest
layer called undo-log-storage.

This is the next layer which sits on top of the undo log storage,
which will provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

To prepare an undo record, first, it will allocate required space
using undo_log_storage module. Next, it will pin and lock the required
buffers and return an undo record pointer where it will insert the
record. Finally, it calls the Insert routine for final insertion of
prepared record. Additionally, there is a mechanism for multi-insert,
wherein multiple records are prepared and inserted at a time.

To fetch an undo record, a caller must provide a valid undo record
pointer. Optionally, the caller can provide a callback function with
the information of the block and offset, which will help in faster
retrieval of undo record, otherwise, it has to traverse the undo-chain.

These patch sets will apply on top of the undo-log-storage branch [1]https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/,
commit id fa3803a048955c4961581e8757fe7263a98fe6e6.

[1]: https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/

undo_interface_v1.patch is the main patch for providing the undo interface.
undo_interface_test_v1.patch is a simple test module to test the undo
interface layer.

On Thu, May 31, 2018 at 4:27 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

Hi Simon,

On Mon, May 28, 2018 at 11:40 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 24 May 2018 at 23:22, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

The lowest level piece of this work is a physical undo log manager,

1. Efficient appending of new undo data from many concurrent
backends. Like logs.
2. Efficient discarding of old undo data that isn't needed anymore.
Like queues.
3. Efficient buffered random reading of undo data. Like relations.

Like an SLRU?

Yes, but with some difference:

1. There is a variable number of undo logs. Each one corresponds to
a range of the 64 bit address space, and has its own head and tail
pointers, so that concurrent writers don't contend for buffers when
appending data. (Unlike SLRUs which are statically defined, one for
clog.c, one for commit_ts.c, ...).
2. Undo logs use regular buffers instead of having their own mini
buffer pool, ad hoc search and reclamation algorithm etc.
3. Undo logs support temporary, unlogged and permanent storage (=
local buffers and reset-on-crash-restart, for undo data relating to
relations of those persistence levels).
4. Undo logs storage files are preallocated (rather than being
extended block by block), and the oldest file is renamed to become the
newest file in common cases, like WAL.

[4] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/access/undo
[5] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/src/backend/storage/smgr

I think there are quite a few design decisions there that need to be
discussed, so lets crack on and discuss them please.

What do you think about using the main buffer pool?

Best case: pgbench type workload, discard pointer following closely
behind insert pointer, we never write anything out to disk (except for
checkpoints when we write a few pages), never advance the buffer pool
clock hand, and we use and constantly recycle 1-2 pages per connection
via the free list (as can be seen by monitoring insert - discard in
the pg_stat_undo_logs view).

Worst case: someone opens a snapshot and goes out to lunch so we can't
discard old undo data, and then we start to compete with other stuff
for buffers, and we hope the buffer reclamation algorithm is good at
its job (or can be improved).

I just talked about this proposal at a pgcon unconference session.
Here's some of the feedback I got:

1. Jeff Davis pointed out that I'm probably wrong about not needing
FPI, and there must at least be checksum problems with torn pages. He
also gave me an idea on how to fix that very cheaply, and I'm still
processing that feedback.
2. Andres Freund thought it seemed OK if we have smgr.c routing to
md.c for relations and undofile.c for undo, but if we're going to
generalise this technique to put other things into shared buffers
eventually too (like the SLRUs, as proposed by Shawn Debnath in
another unconf session) then it might be worth investigating how to
get md.c to handle all of their needs. They'd all just use fd.c
files, after all, so it'd be weird if we had to maintain several
different similar things.
3. Andres also suggested that high frequency free page list access
might be quite contended in the "best case" described above. I'll look
into that.
4. Someone said that segment sizes probably shouldn't be hard coded
(cf WAL experience).

I also learned in other sessions that there are other access managers
in development that need undo logs. I'm hoping to find out more about
that.

--
Thomas Munro
http://www.enterprisedb.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

undo_interface_v1.patchapplication/octet-stream; name=undo_interface_v1.patchDownload

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd8270d5fb..5c33bfe736 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -188,6 +188,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -219,6 +223,9 @@ static TransactionStateData TopTransactionStateData = {
 	false,						/* startedInRecovery */
 	false,						/* didLogXid */
 	0,							/* parallelModeLevel */
+	/* start and end undo record locations for each persistence level */
+	{InvalidUndoRecPtr,InvalidUndoRecPtr,InvalidUndoRecPtr},
+	{InvalidUndoRecPtr,InvalidUndoRecPtr,InvalidUndoRecPtr},
 	NULL						/* link to parent state block */
 };
 
@@ -907,6 +914,24 @@ IsInParallelMode(void)
 	return CurrentTransactionState->parallelModeLevel != 0;
 }
 
+/*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr));
+	UndoPersistence upersistence = log->meta.persistence;
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
 /*
  *	CommandCounterIncrement
  */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 493f1db7b9..cc6b638e6c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8481,6 +8481,35 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 	*epoch = ckptXidEpoch;
 }
 
+/*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
 /*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c6963cf..f41e8f7e5c 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000000..08e8a155a0
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1106 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Handling multilog -
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/* Maximum number of undo record that can be prepared before calling insert. */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId	prev_txid[UndoPersistenceLevels] = { 0 };
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	BlockNumber		blk;			/* block number */
+	Buffer			buf;			/* buffer allocated for the block */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr urp;						/* undo record pointer */
+	UnpackedUndoRecord *urec;			/* undo record */
+	int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace  def_prepared[MAX_PREPARED_UNDO];
+static int prepare_idx;
+static int	max_prepare_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr	multi_prep_urp = InvalidUndoRecPtr;
+static bool	update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+	UndoRecPtr	prev_urecptr; /* prev txn's starting urecptr */
+	int			prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;	/* prev txn's first undo record.*/
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord* UndoGetOneRecord(UnpackedUndoRecord *urec,
+											UndoRecPtr urp, RelFileNode rnode,
+											UndoPersistence persistence);
+static void PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr,
+											 bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int InsertFindBufferSlot(RelFileNode rnode, BlockNumber blk,
+								ReadBufferMode rbm,
+								UndoPersistence persistence);
+static bool IsPrevTxnUndoDiscarded(UndoLogControl *log,
+								   UndoRecPtr prev_xact_urp);
+
+/*
+ * Check if previous transactions undo is already discarded.
+ *
+ * Caller should call this under log->discard_lock
+ */
+static bool
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is not yet initialized.  We have to check
+		 * UndoLogIsDiscarded and if it's already discarded then we have
+		 * nothing to do.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(prev_xact_urp))
+			return true;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (prev_xact_urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+void
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	prev_xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber	cur_blk;
+	RelFileNode	rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno);
+	}
+
+	/*
+	 * TODO: For now we don't know how to build a transaction chain for
+	 * temporary undo logs.  That's because this log might have been used by a
+	 * different backend, and we can't access its buffers.  What should happen
+	 * is that the undo data should be automatically discarded when the other
+	 * backend detaches, but that code doesn't exist yet and the undo worker
+	 * can't do it either.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		prev_xact_urp = InvalidUndoRecPtr;
+	else
+		prev_xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * If previous transaction's urp is not valid means this backend is
+	 * preparing its first undo so fetch the information from the undo log
+	 * if it's still invalid urp means this is the first undo record for this
+	 * log and we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(prev_xact_urp))
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doen't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (IsPrevTxnUndoDiscarded(log, prev_xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, prev_xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(prev_xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(prev_xact_urp);
+
+	while (true)
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk,
+									  RBM_NORMAL,
+									  log->meta.persistence);
+		prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+		index++;
+
+		if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	prev_txn_info.uur.uur_next = urecptr;
+	prev_txn_info.prev_urecptr = prev_xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * PrepareUndoRecordUpdateTransInfo.  This must be called under the critical
+ * section.  This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(prev_txn_info.prev_urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	prev_urp = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno);
+	prev_urp = prev_txn_info.prev_urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doen't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (IsPrevTxnUndoDiscarded(log, prev_urp))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction
+	 * header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(prev_urp);
+
+	do
+	{
+		Buffer  buffer;
+		int		buf_idx;
+
+		buf_idx = prev_txn_info.prev_txn_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&prev_txn_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while(true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+InsertFindBufferSlot(RelFileNode rnode,
+					 BlockNumber blk,
+					 ReadBufferMode rbm,
+					 UndoPersistence persistence)
+{
+	int 	i;
+	Buffer 	buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		if (blk == undo_buffer[i].blk)
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+										GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+						UndoPersistence upersistence, TransactionId txid,
+						xl_undolog_meta *undometa)
+{
+	UnpackedUndoRecord *urec;
+	UndoLogControl *log;
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	bool	need_start_undo = false;
+	bool	first_rec_in_recovery;
+	bool	log_switched = false;
+	int	i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/*
+	 * If this is the first undo record for this transaction then set the
+	 * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+	 * the space for the transaction header and the valid value of the uur_next
+	 * will be updated while preparing the first undo record of the next
+	 * transaction.
+	 */
+	first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+	if ((!InRecovery && prev_txid[upersistence] != txid) ||
+		first_rec_in_recovery)
+	{
+		need_start_undo = true;
+	}
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		if (need_start_undo && i == 0)
+		{
+			urec->uur_next = SpecialUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+		}
+		else
+			urec->uur_next = InvalidUndoRecPtr;
+
+		/* calculate the size of the undo record. */
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr));
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If this is the first record of the log and not the first record of
+	 * the transaction i.e. same transaction continued from the previous log
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, that will make the record larger,
+	 * so we'll have to go back and recompute the size.
+	 */
+	if (!need_start_undo &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_start_undo = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+
+		goto resize;
+	}
+
+	/*
+	 * If transaction id is switched then update the previous transaction's
+	 * start undo record.
+	 */
+	if (first_rec_in_recovery ||
+		(!InRecovery && prev_txid[upersistence] != txid) ||
+		log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			PrepareUndoRecordUpdateTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	/* Copy undometa before advancing the insert location. */
+	if (undometa)
+	{
+		undometa->meta = log->meta;
+		undometa->logno = log->logno;
+		undometa->xid = log->xid;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+void
+UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+				   TransactionId xid, UndoPersistence upersistence,
+				   xl_undolog_meta *undometa)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,
+											 upersistence, txid, undometa);
+	if (max_prepare <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(max_prepare * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's
+	 * starting undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((max_prepare + 1) * MAX_BUFFER_PER_UNDO *
+						 sizeof(UndoBuffers));
+	max_prepare_undo = max_prepare;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+				  TransactionId xid, xl_undolog_meta *undometa)
+{
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	RelFileNode		rnode;
+	UndoRecordSize  cur_size = 0;
+	BlockNumber		cur_blk;
+	TransactionId	txid;
+	int				starting_byte;
+	int				index = 0;
+	int				bufidx;
+	ReadBufferMode	rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepare_undo)
+		return InvalidUndoRecPtr;
+
+	/*
+	 * If this is the first undo record for this top transaction add the
+	 * transaction information to the undo record.
+	 *
+	 * XXX there is also an option that instead of adding the information to
+	 * this record we can prepare a new record which only contain transaction
+	 * informations.
+	 */
+	if (xid == InvalidTransactionId)
+	{
+		/* we expect during recovery, we always have a valid transaction id. */
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping for
+		 * the top most transactions.
+		 */
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(multi_prep_urp))
+		urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid, undometa);
+	else
+		urecptr = multi_prep_urp;
+
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(multi_prep_urp))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* FIXME: Should we just report error ? */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/* Undo record can not fit into this block so go to the next block. */
+		cur_blk++;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+	} while (cur_size < size);
+
+	/*
+	 * Save referenced of undo record pointer as well as undo record.
+	 * InsertPreparedUndo will use these to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will lock the buffers
+ * pinned in the previous step, write the actual undo record into them,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page	page;
+	int		starting_byte;
+	int		already_written;
+	int		bufidx = 0;
+	int		idx;
+	uint16	undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord	*uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+	uint16	prev_undolen;
+
+	Assert(prepare_idx > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		/*
+		 * We can read meta.prevlen without locking, because only we can write
+		 * to it.
+		 */
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp));
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+		prev_undolen = log->meta.prevlen;
+
+		/* store the previous undo record length in the header */
+		uur->uur_prevlen = prev_undolen;
+
+		/* if starting a new log then there is no prevlen to store */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno);
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/* if starting from a new page then include header in prevlen */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+				uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer  buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+			starting_byte = UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			undo_len += UndoLogBlockHeaderSize;
+
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while(true);
+
+		prev_undolen = undo_len;
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
+
+		if (UndoRecPtrIsValid(prev_txn_info.prev_urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int	i;
+	for (i = 0; i < buffer_idx; i++)
+	{
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	prev_txn_info.prev_urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	multi_prep_urp = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepare_undo limit is changed so free the allocated memory and reset
+	 * all the variable back to its default value.
+	 */
+	if (max_prepare_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepare_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer,  now caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if he wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord*
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer			 buffer = urec->uur_buffer;
+	Page			 page;
+	int				 starting_byte = UndoRecPtrGetPageOffset(urp);
+	int				 already_decoded = 0;
+	BlockNumber		 cur_blk;
+	bool			 is_undo_splited = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a previous buffer then no need to allocate new. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * FIXME: This can be optimized to just fetch header first and only
+		 * if matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_splited = true;
+
+		/*
+		 * Complete record is not fitting into one buffer so release the buffer
+		 * pin and also set invalid buffer in the undo record.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer. Otherwise just
+	 * unlock it.
+	 */
+	if (is_undo_splited)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord*
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode		 rnode, prevrnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int	logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/*
+		 * If we have a valid buffer pinned then just ensure that we want to
+		 * find the next tuple from the same block.  Otherwise release the
+		 * buffer and set it invalid
+		 */
+		if (BufferIsValid(urec->uur_buffer))
+		{
+			/*
+			 * Undo buffer will be changed if the next undo record belongs to a
+			 * different block or undo log.
+			 */
+			if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+				(prevrnode.relNode != rnode.relNode))
+			{
+				ReleaseBuffer(urec->uur_buffer);
+				urec->uur_buffer = InvalidBuffer;
+			}
+		}
+		else
+		{
+			/*
+			 * If there is not a valid buffer in urec->uur_buffer that means we
+			 * had copied the payload data and tuple data so free them.
+			 */
+			if (urec->uur_payload.data)
+				pfree(urec->uur_payload.data);
+			if (urec->uur_tuple.data)
+				pfree(urec->uur_tuple.data);
+		}
+
+		/* Reset the urec before fetching the tuple */
+		urec->uur_tuple.data = NULL;
+		urec->uur_tuple.len = 0;
+		urec->uur_payload.data = NULL;
+		urec->uur_payload.len = 0;
+		prevrnode = rnode;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno);
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecPtrIsValid(log->oldest_data))
+		{
+			/*
+			 * UndoDiscardInfo is not yet initialized. Hence, we've to check
+			 * UndoLogIsDiscarded and if it's already discarded then we have
+			 * nothing to do.
+			 */
+			LWLockRelease(&log->discard_lock);
+			if (UndoLogIsDiscarded(urp))
+			{
+				if (BufferIsValid(urec->uur_buffer))
+					ReleaseBuffer(urec->uur_buffer);
+				return NULL;
+			}
+
+			LWLockAcquire(&log->discard_lock, LW_SHARED);
+		}
+
+		/* Check if it's already discarded. */
+		if (urp < log->oldest_data)
+		{
+			LWLockRelease(&log->discard_lock);
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl	*prevlog, *log;
+
+		log = UndoLogGet(logno);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr (logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree (urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000000..fe5a8d7f1e
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,452 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size	size;
+
+	/* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+	/* if (uur->uur_info == 0) */
+		UndoRecordSetInfo(uur);
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char   *writeptr = (char *) page + starting_byte;
+	char   *endptr = (char *) page + BLCKSZ;
+	int		my_bytes_written = *already_written;
+
+	if (uur->uur_info == 0)
+		UndoRecordSetInfo(uur);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption
+	 * that it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_relfilenode = uur->uur_relfilenode;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_tsid = uur->uur_tsid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before,
+		 * or caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_relfilenode == uur->uur_relfilenode);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_tsid == uur->uur_tsid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int		can_write;
+	int		remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing
+	 * to do except update *my_bytes_written, which we must do to ensure
+	 * that the next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+					  int *already_decoded, bool header_only)
+{
+	char	*readptr = (char *)page + starting_byte;
+	char	*endptr = (char *) page + BLCKSZ;
+	int		my_bytes_decoded = *already_decoded;
+	bool	is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_relfilenode = work_hdr.urec_relfilenode;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_tsid = work_rd.urec_tsid;
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any of
+		 * the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int		can_read;
+	int		remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+static void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_tsid != DEFAULTTABLESPACE_OID ||
+		uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000000..7f8f96614d
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord* urec,
+											BlockNumber blkno,
+											OffsetNumber offset,
+											TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, UndoPersistence,
+									TransactionId, xl_undolog_meta *);
+
+/*
+ * Insert a previously-prepared undo record.  This will lock the buffers
+ * pinned in the previous step, write the actual undo record into them,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord* UndoFetchRecord(UndoRecPtr urp,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid,
+										   UndoRecPtr *urec_ptr_out,
+										   SatisfyUndoRecordCallback callback);
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+							   TransactionId xid, UndoPersistence upersistence,
+							   xl_undolog_meta *undometa);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+
+#endif   /* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000000..4dfa4f2652
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,206 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_relfilenode;		/* relfilenode for relation */
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older then RecentGlobalXmin, then we can consider the tuple
+	 * in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;			/* Transaction id */
+	CommandId	urec_cid;			/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+#define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	Oid			urec_tsid;		/* tablespace OID */
+	ForkNumber		urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+	uint32			urec_xidepoch; /* epoch of the current transaction */
+	uint64			urec_next;	/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+#define urec_next_pos \
+	(SizeOfUndoRecordTransaction - SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;		/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordExpectedSize or InsertUndoRecord.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_relfilenode;	/* relfilenode for relation */
+	TransactionId uur_prevxid;		/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	Oid			uur_tsid;		/* tablespace OID */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	StringInfoData uur_payload;	/* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.  For the first call, the given page should be the one which
+ * the caller has determined to contain the current insertion point,
+ * starting_byte should be the byte offset within that page which corresponds
+ * to the current insertion point, and *already_written should be 0.  The
+ * return value will be true if the entire record is successfully written
+ * into that page, and false if not.  In either case, *already_written will
+ * be updated to the number of bytes written by all InsertUndoRecord calls
+ * for this record to date.  If this function is called again to continue
+ * writing the record, the previous value for *already_written should be
+ * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
+ * (since the record will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif   /* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 083e879d5c..1ffa3b3be0 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -429,5 +430,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d775..ea791d58ce 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -277,6 +277,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);

undo_interface_test_v1.patchapplication/octet-stream; name=undo_interface_test_v1.patchDownload

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 43323a6..e05fd00 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_undo \
+		  test_undo_api \
 		  worker_spi
 
 $(recurse)
diff --git a/src/test/modules/test_undo_api/Makefile b/src/test/modules/test_undo_api/Makefile
new file mode 100644
index 0000000..deb3816
--- /dev/null
+++ b/src/test/modules/test_undo_api/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_undo/Makefile
+
+MODULE_big = test_undo_api
+OBJS = test_undo_api.o
+PGFILEDESC = "test_undo_api - a test module for the undo api layer"
+
+EXTENSION = test_undo_api
+DATA = test_undo_api--1.0.sql
+
+REGRESS = test_undo_api
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_undo_api
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_undo_api/expected/test_undo_api.out b/src/test/modules/test_undo_api/expected/test_undo_api.out
new file mode 100644
index 0000000..995b517
--- /dev/null
+++ b/src/test/modules/test_undo_api/expected/test_undo_api.out
@@ -0,0 +1,12 @@
+CREATE EXTENSION test_undo_api;
+--
+-- This test will insert the data in the undo using undo api and after that
+-- it will fetch the data and verify that whether we have got the same data
+-- back or not.
+--
+SELECT test_undo_api(txid_current()::text::xid, 'permanent');
+ test_undo_api 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_undo_api/sql/test_undo_api.sql b/src/test/modules/test_undo_api/sql/test_undo_api.sql
new file mode 100644
index 0000000..4fb40ff
--- /dev/null
+++ b/src/test/modules/test_undo_api/sql/test_undo_api.sql
@@ -0,0 +1,8 @@
+CREATE EXTENSION test_undo_api;
+
+--
+-- This test will insert the data in the undo using undo api and after that
+-- it will fetch the data and verify that whether we have got the same data
+-- back or not.
+--
+SELECT test_undo_api(txid_current()::text::xid, 'permanent');
diff --git a/src/test/modules/test_undo_api/test_undo_api--1.0.sql b/src/test/modules/test_undo_api/test_undo_api--1.0.sql
new file mode 100644
index 0000000..3dd134b
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api--1.0.sql
@@ -0,0 +1,8 @@
+\echo Use "CREATE EXTENSION test_undo_api" to load this file. \quit
+
+CREATE FUNCTION test_undo_api(xid xid, persistence text)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+
diff --git a/src/test/modules/test_undo_api/test_undo_api.c b/src/test/modules/test_undo_api/test_undo_api.c
new file mode 100644
index 0000000..6026582
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api.c
@@ -0,0 +1,84 @@
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_class.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+
+#include <stdlib.h>
+#include <unistd.h>
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_undo_api);
+
+static UndoPersistence
+undo_persistence_from_text(text *t)
+{
+	char *str = text_to_cstring(t);
+
+	if (strcmp(str, "permanent") == 0)
+		return UNDO_PERMANENT;
+	else if (strcmp(str, "temporary") == 0)
+		return UNDO_TEMP;
+	else if (strcmp(str, "unlogged") == 0)
+		return UNDO_UNLOGGED;
+	else
+		elog(ERROR, "unknown undo persistence level: %s", str);
+}
+
+/*
+ * Prepare and insert data in undo storage and fetch it back to verify.
+ */
+Datum
+test_undo_api(PG_FUNCTION_ARGS)
+{
+	TransactionId xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(1));
+	char	*data = "test_data";
+	int		 len = strlen(data);
+	UnpackedUndoRecord	undorecord;
+	UnpackedUndoRecord *undorecord_out;
+	int	header_size = offsetof(UnpackedUndoRecord, uur_next) + sizeof(uint64);
+	UndoRecPtr	undo_ptr;
+
+	undorecord.uur_type = 0;
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_prevxid = FrozenTransactionId;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = 0;
+	undorecord.uur_tsid = 100;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = 0;
+	undorecord.uur_block = 1;
+	undorecord.uur_offset = 100;
+	initStringInfo(&undorecord.uur_tuple);
+	
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) data,
+						   len);
+	undo_ptr = PrepareUndoInsert(&undorecord, persistence, xid, NULL);
+	InsertPreparedUndo();
+	UnlockReleaseUndoBuffers();
+	
+	undorecord_out = UndoFetchRecord(undo_ptr, InvalidBlockNumber,
+									 InvalidOffsetNumber,
+									 InvalidTransactionId, NULL,
+									 NULL);
+
+	if (strncmp((char *) &undorecord, (char *) undorecord_out, header_size) != 0)
+		elog(ERROR, "undo header did not match");
+	if (strncmp(undorecord_out->uur_tuple.data, data, len) != 0)
+		elog(ERROR, "undo data did not match");
+
+	UndoRecordRelease(undorecord_out);
+	pfree(undorecord.uur_tuple.data);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_undo_api/test_undo_api.control b/src/test/modules/test_undo_api/test_undo_api.control
new file mode 100644
index 0000000..09df344
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api.control
@@ -0,0 +1,4 @@
+comment = 'test_undo_api'
+default_version = '1.0'
+module_pathname = '$libdir/test_undo_api'
+relocatable = true

Dilip Kumar

dilip.kumar@enterprisedb.com

over 7 years ago

In reply to: Dilip Kumar (#5)

Re: Undo logs

On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

Hello hackers,

As Thomas has already mentioned upthread that we are working on an
undo-log based storage and he has posted the patch sets for the lowest
layer called undo-log-storage.

This is the next layer which sits on top of the undo log storage,
which will provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

To prepare an undo record, first, it will allocate required space
using undo_log_storage module. Next, it will pin and lock the required
buffers and return an undo record pointer where it will insert the
record. Finally, it calls the Insert routine for final insertion of
prepared record. Additionally, there is a mechanism for multi-insert,
wherein multiple records are prepared and inserted at a time.

To fetch an undo record, a caller must provide a valid undo record
pointer. Optionally, the caller can provide a callback function with
the information of the block and offset, which will help in faster
retrieval of undo record, otherwise, it has to traverse the undo-chain.

These patch sets will apply on top of the undo-log-storage branch [1],
commit id fa3803a048955c4961581e8757fe7263a98fe6e6.

[1] https://github.com/EnterpriseDB/zheap/tree/undo-log-storage/

undo_interface_v1.patch is the main patch for providing the undo interface.
undo_interface_test_v1.patch is a simple test module to test the undo
interface layer.

Thanks to Robert Haas for designing an early prototype for forming
undo record. Later, I’ve completed the remaining parts of the code
including undo record prepare, insert, fetch and other related APIs
with help of Rafia Sabih. Thanks to Amit Kapila for providing valuable
design inputs.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Thomas Munro

thomas.munro@enterprisedb.com

over 7 years ago

In reply to: Dilip Kumar (#6)

Re: Undo logs

On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar
<dilip.kumar@enterprisedb.com> wrote:

On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

As Thomas has already mentioned upthread that we are working on an
undo-log based storage and he has posted the patch sets for the lowest
layer called undo-log-storage.

This is the next layer which sits on top of the undo log storage,
which will provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

I have also pushed a new WIP version of the lower level undo log
storage layer patch set to a public branch[1]https://github.com/EnterpriseDB/zheap/tree/undo-log-storage-v2. I'll leave the earlier
branch[2]https://github.com/EnterpriseDB/zheap/tree/undo-log-storage there because the record-level patch posted by Dilip depends
on it for now.

The changes are mostly internal: it doesn't use DSM segments any more.
Originally I wanted to use DSM because I didn't want arbitrary limits,
but in fact DSM slots can run out in unpredictable ways, and unlike
parallel query the undo log subsystem doesn't have a plan B for when
it can't get the space it needs due to concurrent queries. Instead,
this version uses a pool of size 4 * max_connections, fixed at startup
in regular shared memory. This creates an arbitrary limit on
transaction size, but it's a large at 1TB per slot, can be increased,
doesn't disappear unpredictably, is easy to monitor
(pg_stat_undo_logs), and is probably a useful brake on a system in
trouble.

More soon.

[1]: https://github.com/EnterpriseDB/zheap/tree/undo-log-storage-v2
[2]: https://github.com/EnterpriseDB/zheap/tree/undo-log-storage

--
Thomas Munro
http://www.enterprisedb.com

Dilip Kumar

dilipbalaut@gmail.com

over 7 years ago

In reply to: Thomas Munro (#7)

2 attachment(s)

Re: Undo logs

On Sun, Sep 2, 2018 at 12:18 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar
<dilip.kumar@enterprisedb.com> wrote:

On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

As Thomas has already mentioned upthread that we are working on an
undo-log based storage and he has posted the patch sets for the lowest
layer called undo-log-storage.

This is the next layer which sits on top of the undo log storage,
which will provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

I have also pushed a new WIP version of the lower level undo log
storage layer patch set to a public branch[1]. I'll leave the earlier
branch[2] there because the record-level patch posted by Dilip depends
on it for now.

Rebased undo_interface patches on top of the new branch of undo-log-storage[1]https://github.com/EnterpriseDB/zheap/tree/undo-log-storage-v2.

[1]: https://github.com/EnterpriseDB/zheap/tree/undo-log-storage-v2

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

undo_interface_v2.patchapplication/octet-stream; name=undo_interface_v2.patchDownload

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index cd8270d..ad28762 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -188,6 +188,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -219,6 +223,9 @@ static TransactionStateData TopTransactionStateData = {
 	false,						/* startedInRecovery */
 	false,						/* didLogXid */
 	0,							/* parallelModeLevel */
+	/* start and end undo record locations for each persistence level */
+	{InvalidUndoRecPtr,InvalidUndoRecPtr,InvalidUndoRecPtr},
+	{InvalidUndoRecPtr,InvalidUndoRecPtr,InvalidUndoRecPtr},
 	NULL						/* link to parent state block */
 };
 
@@ -908,6 +915,24 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 65db2e4..748e8bc 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8501,6 +8501,35 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..4d855e0
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1106 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Handling multilog -
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/* Maximum number of undo record that can be prepared before calling insert. */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId	prev_txid[UndoPersistenceLevels] = { 0 };
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	BlockNumber		blk;			/* block number */
+	Buffer			buf;			/* buffer allocated for the block */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr urp;						/* undo record pointer */
+	UnpackedUndoRecord *urec;			/* undo record */
+	int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace  def_prepared[MAX_PREPARED_UNDO];
+static int prepare_idx;
+static int	max_prepare_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr	multi_prep_urp = InvalidUndoRecPtr;
+static bool	update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+	UndoRecPtr	prev_urecptr; /* prev txn's starting urecptr */
+	int			prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;	/* prev txn's first undo record.*/
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord* UndoGetOneRecord(UnpackedUndoRecord *urec,
+											UndoRecPtr urp, RelFileNode rnode,
+											UndoPersistence persistence);
+static void PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr,
+											 bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int InsertFindBufferSlot(RelFileNode rnode, BlockNumber blk,
+								ReadBufferMode rbm,
+								UndoPersistence persistence);
+static bool IsPrevTxnUndoDiscarded(UndoLogControl *log,
+								   UndoRecPtr prev_xact_urp);
+
+/*
+ * Check if previous transactions undo is already discarded.
+ *
+ * Caller should call this under log->discard_lock
+ */
+static bool
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is not yet initialized.  We have to check
+		 * UndoLogIsDiscarded and if it's already discarded then we have
+		 * nothing to do.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(prev_xact_urp))
+			return true;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (prev_xact_urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+void
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	prev_xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber	cur_blk;
+	RelFileNode	rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno, false);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno, false);
+	}
+
+	/*
+	 * TODO: For now we don't know how to build a transaction chain for
+	 * temporary undo logs.  That's because this log might have been used by a
+	 * different backend, and we can't access its buffers.  What should happen
+	 * is that the undo data should be automatically discarded when the other
+	 * backend detaches, but that code doesn't exist yet and the undo worker
+	 * can't do it either.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		prev_xact_urp = InvalidUndoRecPtr;
+	else
+		prev_xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * If previous transaction's urp is not valid means this backend is
+	 * preparing its first undo so fetch the information from the undo log
+	 * if it's still invalid urp means this is the first undo record for this
+	 * log and we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(prev_xact_urp))
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doen't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (IsPrevTxnUndoDiscarded(log, prev_xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, prev_xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(prev_xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(prev_xact_urp);
+
+	while (true)
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk,
+									  RBM_NORMAL,
+									  log->meta.persistence);
+		prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+		index++;
+
+		if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	prev_txn_info.uur.uur_next = urecptr;
+	prev_txn_info.prev_urecptr = prev_xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * PrepareUndoRecordUpdateTransInfo.  This must be called under the critical
+ * section.  This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(prev_txn_info.prev_urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	prev_urp = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	prev_urp = prev_txn_info.prev_urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doen't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (IsPrevTxnUndoDiscarded(log, prev_urp))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction
+	 * header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(prev_urp);
+
+	do
+	{
+		Buffer  buffer;
+		int		buf_idx;
+
+		buf_idx = prev_txn_info.prev_txn_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&prev_txn_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while(true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+InsertFindBufferSlot(RelFileNode rnode,
+					 BlockNumber blk,
+					 ReadBufferMode rbm,
+					 UndoPersistence persistence)
+{
+	int 	i;
+	Buffer 	buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		if (blk == undo_buffer[i].blk)
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+										GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+						UndoPersistence upersistence, TransactionId txid,
+						xl_undolog_meta *undometa)
+{
+	UnpackedUndoRecord *urec;
+	UndoLogControl *log;
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	bool	need_start_undo = false;
+	bool	first_rec_in_recovery;
+	bool	log_switched = false;
+	int	i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/*
+	 * If this is the first undo record for this transaction then set the
+	 * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+	 * the space for the transaction header and the valid value of the uur_next
+	 * will be updated while preparing the first undo record of the next
+	 * transaction.
+	 */
+	first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+	if ((!InRecovery && prev_txid[upersistence] != txid) ||
+		first_rec_in_recovery)
+	{
+		need_start_undo = true;
+	}
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		if (need_start_undo && i == 0)
+		{
+			urec->uur_next = SpecialUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+		}
+		else
+			urec->uur_next = InvalidUndoRecPtr;
+
+		/* calculate the size of the undo record. */
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If this is the first record of the log and not the first record of
+	 * the transaction i.e. same transaction continued from the previous log
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, that will make the record larger,
+	 * so we'll have to go back and recompute the size.
+	 */
+	if (!need_start_undo &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_start_undo = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+
+		goto resize;
+	}
+
+	/*
+	 * If transaction id is switched then update the previous transaction's
+	 * start undo record.
+	 */
+	if (first_rec_in_recovery ||
+		(!InRecovery && prev_txid[upersistence] != txid) ||
+		log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			PrepareUndoRecordUpdateTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	/* Copy undometa before advancing the insert location. */
+	if (undometa)
+	{
+		undometa->meta = log->meta;
+		undometa->logno = log->logno;
+		undometa->xid = log->xid;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+void
+UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+				   TransactionId xid, UndoPersistence upersistence,
+				   xl_undolog_meta *undometa)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,
+											 upersistence, txid, undometa);
+	if (max_prepare <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(max_prepare * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's
+	 * starting undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((max_prepare + 1) * MAX_BUFFER_PER_UNDO *
+						 sizeof(UndoBuffers));
+	max_prepare_undo = max_prepare;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+				  TransactionId xid, xl_undolog_meta *undometa)
+{
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	RelFileNode		rnode;
+	UndoRecordSize  cur_size = 0;
+	BlockNumber		cur_blk;
+	TransactionId	txid;
+	int				starting_byte;
+	int				index = 0;
+	int				bufidx;
+	ReadBufferMode	rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepare_undo)
+		return InvalidUndoRecPtr;
+
+	/*
+	 * If this is the first undo record for this top transaction add the
+	 * transaction information to the undo record.
+	 *
+	 * XXX there is also an option that instead of adding the information to
+	 * this record we can prepare a new record which only contain transaction
+	 * informations.
+	 */
+	if (xid == InvalidTransactionId)
+	{
+		/* we expect during recovery, we always have a valid transaction id. */
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping for
+		 * the top most transactions.
+		 */
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(multi_prep_urp))
+		urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid, undometa);
+	else
+		urecptr = multi_prep_urp;
+
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(multi_prep_urp))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* FIXME: Should we just report error ? */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/* Undo record can not fit into this block so go to the next block. */
+		cur_blk++;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+	} while (cur_size < size);
+
+	/*
+	 * Save referenced of undo record pointer as well as undo record.
+	 * InsertPreparedUndo will use these to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will lock the buffers
+ * pinned in the previous step, write the actual undo record into them,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page	page;
+	int		starting_byte;
+	int		already_written;
+	int		bufidx = 0;
+	int		idx;
+	uint16	undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord	*uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+	uint16	prev_undolen;
+
+	Assert(prepare_idx > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		/*
+		 * We can read meta.prevlen without locking, because only we can write
+		 * to it.
+		 */
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+		prev_undolen = log->meta.prevlen;
+
+		/* store the previous undo record length in the header */
+		uur->uur_prevlen = prev_undolen;
+
+		/* if starting a new log then there is no prevlen to store */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/* if starting from a new page then include header in prevlen */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+				uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer  buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+			starting_byte = UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			undo_len += UndoLogBlockHeaderSize;
+
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while(true);
+
+		prev_undolen = undo_len;
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
+
+		if (UndoRecPtrIsValid(prev_txn_info.prev_urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int	i;
+	for (i = 0; i < buffer_idx; i++)
+	{
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	prev_txn_info.prev_urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	multi_prep_urp = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepare_undo limit is changed so free the allocated memory and reset
+	 * all the variable back to its default value.
+	 */
+	if (max_prepare_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepare_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer,  now caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if he wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord*
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer			 buffer = urec->uur_buffer;
+	Page			 page;
+	int				 starting_byte = UndoRecPtrGetPageOffset(urp);
+	int				 already_decoded = 0;
+	BlockNumber		 cur_blk;
+	bool			 is_undo_splited = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a previous buffer then no need to allocate new. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * FIXME: This can be optimized to just fetch header first and only
+		 * if matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_splited = true;
+
+		/*
+		 * Complete record is not fitting into one buffer so release the buffer
+		 * pin and also set invalid buffer in the undo record.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer. Otherwise just
+	 * unlock it.
+	 */
+	if (is_undo_splited)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord*
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode		 rnode, prevrnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int	logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/*
+		 * If we have a valid buffer pinned then just ensure that we want to
+		 * find the next tuple from the same block.  Otherwise release the
+		 * buffer and set it invalid
+		 */
+		if (BufferIsValid(urec->uur_buffer))
+		{
+			/*
+			 * Undo buffer will be changed if the next undo record belongs to a
+			 * different block or undo log.
+			 */
+			if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+				(prevrnode.relNode != rnode.relNode))
+			{
+				ReleaseBuffer(urec->uur_buffer);
+				urec->uur_buffer = InvalidBuffer;
+			}
+		}
+		else
+		{
+			/*
+			 * If there is not a valid buffer in urec->uur_buffer that means we
+			 * had copied the payload data and tuple data so free them.
+			 */
+			if (urec->uur_payload.data)
+				pfree(urec->uur_payload.data);
+			if (urec->uur_tuple.data)
+				pfree(urec->uur_tuple.data);
+		}
+
+		/* Reset the urec before fetching the tuple */
+		urec->uur_tuple.data = NULL;
+		urec->uur_tuple.len = 0;
+		urec->uur_payload.data = NULL;
+		urec->uur_payload.len = 0;
+		prevrnode = rnode;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecPtrIsValid(log->oldest_data))
+		{
+			/*
+			 * UndoDiscardInfo is not yet initialized. Hence, we've to check
+			 * UndoLogIsDiscarded and if it's already discarded then we have
+			 * nothing to do.
+			 */
+			LWLockRelease(&log->discard_lock);
+			if (UndoLogIsDiscarded(urp))
+			{
+				if (BufferIsValid(urec->uur_buffer))
+					ReleaseBuffer(urec->uur_buffer);
+				return NULL;
+			}
+
+			LWLockAcquire(&log->discard_lock, LW_SHARED);
+		}
+
+		/* Check if it's already discarded. */
+		if (urp < log->oldest_data)
+		{
+			LWLockRelease(&log->discard_lock);
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl	*prevlog, *log;
+
+		log = UndoLogGet(logno, false);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno, false);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr (logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree (urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..fe5a8d7
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,452 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size	size;
+
+	/* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+	/* if (uur->uur_info == 0) */
+		UndoRecordSetInfo(uur);
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char   *writeptr = (char *) page + starting_byte;
+	char   *endptr = (char *) page + BLCKSZ;
+	int		my_bytes_written = *already_written;
+
+	if (uur->uur_info == 0)
+		UndoRecordSetInfo(uur);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption
+	 * that it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_relfilenode = uur->uur_relfilenode;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_tsid = uur->uur_tsid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before,
+		 * or caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_relfilenode == uur->uur_relfilenode);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_tsid == uur->uur_tsid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int		can_write;
+	int		remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing
+	 * to do except update *my_bytes_written, which we must do to ensure
+	 * that the next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+					  int *already_decoded, bool header_only)
+{
+	char	*readptr = (char *)page + starting_byte;
+	char	*endptr = (char *) page + BLCKSZ;
+	int		my_bytes_decoded = *already_decoded;
+	bool	is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_relfilenode = work_hdr.urec_relfilenode;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_tsid = work_rd.urec_tsid;
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any of
+		 * the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int		can_read;
+	int		remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+static void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_tsid != DEFAULTTABLESPACE_OID ||
+		uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..7f8f966
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,99 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord* urec,
+											BlockNumber blkno,
+											OffsetNumber offset,
+											TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, UndoPersistence,
+									TransactionId, xl_undolog_meta *);
+
+/*
+ * Insert a previously-prepared undo record.  This will lock the buffers
+ * pinned in the previous step, write the actual undo record into them,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord* UndoFetchRecord(UndoRecPtr urp,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid,
+										   UndoRecPtr *urec_ptr_out,
+										   SatisfyUndoRecordCallback callback);
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+							   TransactionId xid, UndoPersistence upersistence,
+							   xl_undolog_meta *undometa);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+
+#endif   /* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..4dfa4f2
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,206 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_relfilenode;		/* relfilenode for relation */
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older then RecentGlobalXmin, then we can consider the tuple
+	 * in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;			/* Transaction id */
+	CommandId	urec_cid;			/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+#define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	Oid			urec_tsid;		/* tablespace OID */
+	ForkNumber		urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+	uint32			urec_xidepoch; /* epoch of the current transaction */
+	uint64			urec_next;	/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+#define urec_next_pos \
+	(SizeOfUndoRecordTransaction - SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;		/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordExpectedSize or InsertUndoRecord.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_relfilenode;	/* relfilenode for relation */
+	TransactionId uur_prevxid;		/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	Oid			uur_tsid;		/* tablespace OID */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	StringInfoData uur_payload;	/* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.  For the first call, the given page should be the one which
+ * the caller has determined to contain the current insertion point,
+ * starting_byte should be the byte offset within that page which corresponds
+ * to the current insertion point, and *already_written should be 0.  The
+ * return value will be true if the entire record is successfully written
+ * into that page, and false if not.  In either case, *already_written will
+ * be updated to the number of bytes written by all InsertUndoRecord calls
+ * for this record to date.  If this function is called again to continue
+ * writing the record, the previous value for *already_written should be
+ * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
+ * (since the record will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif   /* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 083e879..1ffa3b3 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -429,5 +430,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 421ba6d..ea791d5 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -277,6 +277,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);

undo_interface_test_v2.patchapplication/octet-stream; name=undo_interface_test_v2.patchDownload

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 43323a6..e05fd00 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_undo \
+		  test_undo_api \
 		  worker_spi
 
 $(recurse)
diff --git a/src/test/modules/test_undo_api/Makefile b/src/test/modules/test_undo_api/Makefile
new file mode 100644
index 0000000..deb3816
--- /dev/null
+++ b/src/test/modules/test_undo_api/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_undo/Makefile
+
+MODULE_big = test_undo_api
+OBJS = test_undo_api.o
+PGFILEDESC = "test_undo_api - a test module for the undo api layer"
+
+EXTENSION = test_undo_api
+DATA = test_undo_api--1.0.sql
+
+REGRESS = test_undo_api
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_undo_api
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_undo_api/expected/test_undo_api.out b/src/test/modules/test_undo_api/expected/test_undo_api.out
new file mode 100644
index 0000000..995b517
--- /dev/null
+++ b/src/test/modules/test_undo_api/expected/test_undo_api.out
@@ -0,0 +1,12 @@
+CREATE EXTENSION test_undo_api;
+--
+-- This test will insert the data in the undo using undo api and after that
+-- it will fetch the data and verify that whether we have got the same data
+-- back or not.
+--
+SELECT test_undo_api(txid_current()::text::xid, 'permanent');
+ test_undo_api 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_undo_api/sql/test_undo_api.sql b/src/test/modules/test_undo_api/sql/test_undo_api.sql
new file mode 100644
index 0000000..4fb40ff
--- /dev/null
+++ b/src/test/modules/test_undo_api/sql/test_undo_api.sql
@@ -0,0 +1,8 @@
+CREATE EXTENSION test_undo_api;
+
+--
+-- This test will insert the data in the undo using undo api and after that
+-- it will fetch the data and verify that whether we have got the same data
+-- back or not.
+--
+SELECT test_undo_api(txid_current()::text::xid, 'permanent');
diff --git a/src/test/modules/test_undo_api/test_undo_api--1.0.sql b/src/test/modules/test_undo_api/test_undo_api--1.0.sql
new file mode 100644
index 0000000..3dd134b
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api--1.0.sql
@@ -0,0 +1,8 @@
+\echo Use "CREATE EXTENSION test_undo_api" to load this file. \quit
+
+CREATE FUNCTION test_undo_api(xid xid, persistence text)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+
diff --git a/src/test/modules/test_undo_api/test_undo_api.c b/src/test/modules/test_undo_api/test_undo_api.c
new file mode 100644
index 0000000..6026582
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api.c
@@ -0,0 +1,84 @@
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_class.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+
+#include <stdlib.h>
+#include <unistd.h>
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_undo_api);
+
+static UndoPersistence
+undo_persistence_from_text(text *t)
+{
+	char *str = text_to_cstring(t);
+
+	if (strcmp(str, "permanent") == 0)
+		return UNDO_PERMANENT;
+	else if (strcmp(str, "temporary") == 0)
+		return UNDO_TEMP;
+	else if (strcmp(str, "unlogged") == 0)
+		return UNDO_UNLOGGED;
+	else
+		elog(ERROR, "unknown undo persistence level: %s", str);
+}
+
+/*
+ * Prepare and insert data in undo storage and fetch it back to verify.
+ */
+Datum
+test_undo_api(PG_FUNCTION_ARGS)
+{
+	TransactionId xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(1));
+	char	*data = "test_data";
+	int		 len = strlen(data);
+	UnpackedUndoRecord	undorecord;
+	UnpackedUndoRecord *undorecord_out;
+	int	header_size = offsetof(UnpackedUndoRecord, uur_next) + sizeof(uint64);
+	UndoRecPtr	undo_ptr;
+
+	undorecord.uur_type = 0;
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_prevxid = FrozenTransactionId;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = 0;
+	undorecord.uur_tsid = 100;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = 0;
+	undorecord.uur_block = 1;
+	undorecord.uur_offset = 100;
+	initStringInfo(&undorecord.uur_tuple);
+	
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) data,
+						   len);
+	undo_ptr = PrepareUndoInsert(&undorecord, persistence, xid, NULL);
+	InsertPreparedUndo();
+	UnlockReleaseUndoBuffers();
+	
+	undorecord_out = UndoFetchRecord(undo_ptr, InvalidBlockNumber,
+									 InvalidOffsetNumber,
+									 InvalidTransactionId, NULL,
+									 NULL);
+
+	if (strncmp((char *) &undorecord, (char *) undorecord_out, header_size) != 0)
+		elog(ERROR, "undo header did not match");
+	if (strncmp(undorecord_out->uur_tuple.data, data, len) != 0)
+		elog(ERROR, "undo data did not match");
+
+	UndoRecordRelease(undorecord_out);
+	pfree(undorecord.uur_tuple.data);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_undo_api/test_undo_api.control b/src/test/modules/test_undo_api/test_undo_api.control
new file mode 100644
index 0000000..09df344
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api.control
@@ -0,0 +1,4 @@
+comment = 'test_undo_api'
+default_version = '1.0'
+module_pathname = '$libdir/test_undo_api'
+relocatable = true

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Thomas Munro (#7)

Re: Undo logs

On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

On Fri, Aug 31, 2018 at 10:24 PM Dilip Kumar
<dilip.kumar@enterprisedb.com> wrote:

On Fri, Aug 31, 2018 at 3:08 PM, Dilip Kumar <dilipbalaut@gmail.com> wrote:

As Thomas has already mentioned upthread that we are working on an
undo-log based storage and he has posted the patch sets for the lowest
layer called undo-log-storage.

This is the next layer which sits on top of the undo log storage,
which will provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

I have also pushed a new WIP version of the lower level undo log
storage layer patch set to a public branch[1]. I'll leave the earlier
branch[2] there because the record-level patch posted by Dilip depends
on it for now.

I have started reading the patch and have a few assorted comments
which are mentioned below. I have been involved in the high-level
design of this module and I have also shared given some suggestions
during development, but this is mainly Thomas's work with some help
from Dilip. It would be good if other members of the community also
review the design or participate in the discussion.

Comments
------------------
undo/README
-----------------------
1.
+The undo log subsystem provides a way to store data that is needed for
+a limited time.  Undo data is generated
whenever zheap relations are
+modified, but it is only useful until (1) the generating transaction
+is committed or
rolled back and (2) there is no snapshot that might
+need it for MVCC purposes.

I think the snapshots need it for MVCC purpose and we need it till the
transaction is committed and all-visible.

2.
+* the tablespace that holds its segment files
+* the persistence level (permanent, unlogged, temporary)
+* the
"discard" pointer; data before this point has been discarded
+* the "insert" pointer: new data will be written here
+*
the "end" pointer: a new undo segment file will be needed at this point
+
+The three pointers discard, insert and end
move strictly forwards
+until the whole undo log has been exhausted.  At all times discard <=
+insert <= end.  When
discard == insert, the undo log is empty
+(everything that has ever been inserted has since been discarded).
+The
insert pointer advances when regular backends allocate new space,
+and the discard pointer usually advances when an
undo worker process
+determines that no session could need the data either for rollback or
+for finding old versions of
tuples to satisfy a snapshot.  In some
+special cases including single-user mode and temporary undo logs the
+discard
pointer might also be advanced synchronously by a foreground
+session.

Here, the use of insert and discard pointer are explained nicely. Can
you elaborate the usage of end pointer as well?

3.
+UndoLogControl objects corresponding to the current set of active undo
+logs are held in a fixed-sized pool in shared
memory.  The size of
+the array is a multiple of max_connections, and limits the total size of
+transactions.

Here, it is mentioned the array is multiple of max_connections, but
the code uses MaxBackends. Can you sync them?

4.
+The meta-data for all undo logs is written to disk at every
+checkpoint.  It is stored in files under
PGDATA/pg_undo/, using the
+checkpoint's redo point (a WAL LSN) as its filename.  At startup time,
+the redo point's
file can be used to restore all undo logs' meta-data
+as of the moment of the redo point into shared memory.  Changes
to the
+discard pointer and end pointer are WAL-logged by undolog.c and will
+bring the in-memory meta-data up to date
in the event of recovery
+after a crash.  Changes to insert pointers are included in other WAL
+records (see below).

I see one inconvenience for using checkpoint's redo point for meta
file name which is what if someone uses pg_resetxlog to truncate the
redo? Is there any reason we can't keep a different name for the meta
file?

5.
+stabilize on one undo log per active writing backend (or more if
+different tablespaces are persistence levels are
used).

/tablespaces are persistence levels/tablespaces and persistence levels

I think due to the above design, we can now reach the maximum number
of undo logs quickly as the patch now uses fixed shared memory to
represent them. I am not sure if there is an easy way to avoid that.
Can we try to expose guc for a maximum number of undo slots such that
instead of MaxBackends * 4, it could be MaxBackends * <new_guc>?

undo-log-manager patch
------------------------------------
6.
@@ -127,6 +128,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
  size = add_size(size,
ProcGlobalShmemSize());
  size = add_size(size, XLOGShmemSize());
  size = add_size(size,
CLOGShmemSize());
+ size = add_size(size, UndoLogShmemSize());
  size = add_size(size,
CommitTsShmemSize());
  size = add_size(size, SUBTRANSShmemSize());
  size = add_size(size,
TwoPhaseShmemSize());
@@ -219,6 +221,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
  */

XLOGShmemInit();
CLOGShmemInit();
+ UndoLogShmemInit();

It seems to me that we will always allocate shared memory for undo
logs irrespective of whether someone wants to use them or not. Am I
right? If so, isn't it better if we find some way that this memory is
allocated only when someone has a need for it?

7.
+/*
+ * How many undo logs can be active at a time?  This creates a theoretical
+ * maximum transaction size, but it we
set it to a factor the maximum number
+ * of backends it will be a very high limit.  Alternative designs involving
+ *
demand paging or dynamic shared memory could remove this limit but
+ * introduce other problems.
+ */
+static inline
size_t
+UndoLogNumSlots(void)
+{+ return MaxBackends * 4;
+}

Seems like typos in above comment
/but it we/but if we
/factor the maximum number -- the sentence is not completely clear.

8.
+ * Extra shared memory will be managed using DSM segments.
+ */
+Size
+UndoLogShmemSize(void)

You told in the email that the patch doesn't use DSM segments anymore,
but the comment seems to indicate otherwise.

9.
/*
+ * TODO: Should this function be usable in a critical section?
+
 * Woudl it make sense to detect that we are in a critical

Typo
/Woudl/Would

10.
+static void
+undolog_xlog_attach(XLogReaderState *record)
+{
+ xl_undolog_attach *xlrec = (xl_undolog_attach *)
XLogRecGetData(record);
+ UndoLogControl *log;
+
+ undolog_xid_map_add(xlrec->xid, xlrec->logno);
+
+ /*
+
 * Whatever follows is the first record for this transaction.  Zheap will
+ * use this to add
UREC_INFO_TRANSACTION.
+ */
+ log = get_undo_log(xlrec->logno, false);
+ /* TODO */

There are a lot of TODO's in the code, among them, above is not at all clear.

11.
+ UndoRecPtr oldest_data;
+
+} UndoLogControl;

Extra space after the last member looks odd.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#10

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Amit Kapila (#9)

Re: Undo logs

On Mon, Oct 15, 2018 at 6:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I have also pushed a new WIP version of the lower level undo log
storage layer patch set to a public branch[1]. I'll leave the earlier
branch[2] there because the record-level patch posted by Dilip depends
on it for now.

I have started reading the patch and have a few assorted comments
which are mentioned below. I have been involved in the high-level
design of this module and I have also shared given some suggestions
during development, but this is mainly Thomas's work with some help
from Dilip. It would be good if other members of the community also
review the design or participate in the discussion.

Comments
------------------

Some more comments/questions on the design level choices you have made
in this patch and some general comments.

1. To allocate an undo log (UndoLogAllocate()), it seems first we are
creating the shared memory state for an undo log, write a WAL for it,
create an actual file and segment in it and write a separate WAL for
it. Now imagine the system crashed after creating a shared memory
state and before actually allocating an undo log segment, then it is
quite possible that after recovery we will block multiple slots for
undo logs without having actual undo logs for them. Apart from that
writing separate WAL for them doesn't appear to be the best way to
deal with it considering that we also need to write a third WAL to
attach an undo log.

Now, IIUC, one advantage of arranging the things this way is that we
avoid dropping the tablespaces when a particular undo log exists in
it. I understand that this design kind of works, but I think we
should try to think of some alternatives here. You might have already
thought of making it work similar to how the interaction for regular
tables or temp_tablespaces works with dropping the tablespaces but
decided to do something different here. Can you explain why you have
made a different design choice here?

2.
extend_undo_log()
{
..
+ /*
+ * Flush the parent dir so that the directory metadata survives a crash
+ * after this point.
+
 */
+ UndoLogDirectory(log->meta.tablespace, dir);
+ fsync_fname(dir, true);
+
+ /*
+ * If we're not in
recovery, we need to WAL-log the creation of the new
+ * file(s).  We do that after the above filesystem
modifications, in
+ * violation of the data-before-WAL rule as exempted by
+ *
src/backend/access/transam/README.  This means that it's possible for
+ * us to crash having made some or all of the
filesystem changes but
+ * before WAL logging, but in that case we'll eventually try to create the
+ * same
segment(s) again, which is tolerated.
+ */
+ if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+
XLogRecPtr ptr;
..
}

I don't understand this WAL logging action. If the crash happens
before or during syncing the file, then we anyway don't have WAL to
replay. If it happens after WAL writing, then anyway we are sure that
the extended undo log segment must be there. Can you explain how this
works?

3.
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
}

What will happen if the transaction creating undo log segment rolls
back? Do we want to have pendingDeletes stuff as we have for normal
relation files? This might also help in clearing the shared memory
state (undo log slots) if any.

4.
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
{
..
/*
+ * Take the tablespace create/drop lock while we look the name up.
+ * This prevents the
tablespace from being dropped while we're trying
+ * to resolve the name, or while the called is trying
to create an
+ * undo log in it.  The caller will have to release this lock.
+ */
+
LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
..

This appears quite expensive, for selecting an undo log to attach, we
might need to wait for an unrelated tablespace create/drop. Have you
considered any other ideas to prevent this? How other callers of
get_tablespace_oid prevent it from being dropped? If we don't find
any better solution, then I think at the very least we should start a
separate thread to know the opinion of others on this matter. I think
this is somewhat related to point-1.

5.
+static inline Oid
+UndoRecPtrGetTablespace(UndoRecPtr urp)
+{
+ UndoLogNumber logno = UndoRecPtrGetLogNo
(urp);
+ UndoLogTableEntry  *entry;
+
+ /*
+ * Fast path, for undo logs we've seen before.  This is safe because
+
 * tablespaces are constant for the lifetime of an undo log number.
+ */
+ entry = undologtable_lookup
(undologtable_cache, logno);
+ if (likely(entry))
+ return entry->tablespace;
+
+ /*
+ * Slow path:
force cache entry to be created.  Raises an error if the
+ * undo log has been entirely discarded, or hasn't
been created yet.  That
+ * is appropriate here, because this interface is designed for accessing
+ *
undo pages via bufmgr, and we should never be trying to access undo
+ * pages that have been discarded.
+ */
+
UndoLogGet(logno, false);

It seems UndoLogGet() probes hash table first, so what is the need for
doing it in the caller and if you think it is better to perform in the
caller, then maybe we should avoid doing it inside
UndoLogGet()->get_undo_log()->undologtable_lookup().

6.
+get_undo_log(UndoLogNumber logno, bool locked)
{
..
+ /*
+ * If we didn't find it, then it must already have been entirely
+ *
discarded.  We create a negative cache entry so that we can answer
+ * this question quickly next time.
+
*
+ * TODO: We could track the lowest known undo log number, to reduce
+ * the
negative cache entry bloat.
+ */
+ if (result == NULL)
+ {
+ /*
+
* Sanity check: the caller should not be asking about undo logs
+ * that have
never existed.
+ */
+ if (logno >= shared->next_logno)
+
elog(ERROR, "undo log %u hasn't been created yet", logno);
+ entry = undologtable_insert
(undologtable_cache, logno, &found);
+ entry->number = logno;
+ entry->control =
NULL;
+ entry->tablespace = 0;
+ }
..
}

Are you planning to take care of this TODO? In any case, do we have
any mechanism to clear this bloat or will it stay till the end of the
session? If it is later, then I think it is important to take care of
TODO.

7.
+void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno);
+/* Redo interface. */
+extern void
undolog_redo(XLogReaderState *record);

You might want to add an extra line before /* Redo interface. */
following what has been done earlier in this file.

8.
+ * XXX For now an xl_undolog_meta object is filled in, in case it turns out
+ * to be necessary to write it into the
WAL record (like FPI, this must be
+ * logged once for each undo log after each checkpoint).  I think this should
+ *
be moved out of this interface and done differently -- to review.
+ */
+UndoRecPtr
+UndoLogAllocate(size_t size,
UndoPersistence persistence)

This function doesn't appear to be filling xl_undolog_meta. Am I
missing something, if not, then this comments needs to be changed?

9.
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+ char   *rawname;
+ List
*namelist;
+ bool need_to_unlock;
+ int length;
+ int i;
+
+ /* We need a
modifiable copy of string. */
+ rawname = pstrdup(undo_tablespaces);

I don't see the usage of rawname outside this function, isn't it
better to free it? I understand that this function won't be called
frequently enough to matter, but still there is some theoretical
danger if a user continuously changes undo_tablespaces.

10.
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
{
..
+ /*
+ * For now we have a simple linked list of unattached undo logs for each
+ * persistence level.
 We'll grovel though it to find something for the

Typo.
/though/through

11.
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
{
..
+ /* WAL-log the creation of this new undo log. */
+ {
+
xl_undolog_create xlrec;
+
+ xlrec.logno = logno;
+ xlrec.tablespace = log-

meta.tablespace;

+ xlrec.persistence = log->meta.persistence;
+
+
XLogBeginInsert();
+ XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+
XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE);
+ }
..
}

Do we need to WAL log this for temporary/unlogged persistence level?

12.
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
{
..
+ oid = get_tablespace_oid(name, true);
..

Do we need to check permissions to see if the current user is allowed
to create in this tablespace?

13.
+UndoLogAllocate(size_t size, UndoPersistence persistence)
{
..
+ log->meta.prevlogno = prevlogno;

Is it okay to update meta information without lock or we should do it
few lines down after taking mutex lock? If it is okay, then it is
better to write a comment for the same?

14.
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
+ /* Flush the contents of the file to disk. */
+ if (pg_fsync(fd) != 0)
+ elog(ERROR, "cannot fsync
file \"%s\": %m", path);
..
}

You might want to have a wait event for this as we do have at other
places where we perform fsync.

15.
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+
UndoLogOffset end)
{
..
+ if (!InRecovery)
+ {
+ xl_undolog_extend xlrec;
+ XLogRecPtr ptr;
+
+
xlrec.logno = logno;
+ xlrec.end = end;
+
+ XLogBeginInsert();
+ XLogRegisterData
((char *) &xlrec, sizeof(xlrec));
+ ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+
XLogFlush(ptr);
+ }
..
}

Do we need it for temporary/unlogged persistence level?

16.
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+ xl_undolog_create *xlrec = (xl_undolog_create *)
XLogRecGetData(record);
+ UndoLogControl *log;
+ UndoLogSharedData *shared = MyUndoLogState.shared;
+
+ /*
Create meta-data space in shared memory. */
+ LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+ /* TODO: assert that
it doesn't exist already? */
+ log = allocate_undo_log();
+ LWLockAcquire(&log->mutex, LW_EXCLUSIVE);

Do we need to acquire UndoLogLock during replay? What else can be
going in concurrent to this which can create a problem?

17.
UndoLogAllocate()
{
..
+ /*
+ * While we have the lock, check if we have been forcibly detached by
+ *
DROP TABLESPACE.  That can only happen between transactions (see
+ * DropUndoLogsInsTablespace()).
+
*/
..
}

Function name in above comment is wrong.

18.
+ {
+ {"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
+ gettext_noop("Sets the tablespace(s) to use for undo logs."),
+ NULL,
+ GUC_LIST_INPUT | GUC_LIST_QUOTE
+ },
+ &undo_tablespaces,
+ "",
+ check_undo_tablespaces, assign_undo_tablespaces, NULL
+ },

It seems you need to update variable_is_guc_list_quote for this variable.

Till now, I have mainly reviewed undo log allocation part. This is a
big patch and can take much more time to complete the review. I will
review the other parts of the patch later. I have changed the status
of this CF entry as "Waiting on Author", feel free to change it once
you think all the comments are addressed.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#11

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#8)

4 attachment(s)

Re: Undo logs

On Mon, Sep 3, 2018 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thomas has already posted the latest version of undo log patches on
'Cleaning up orphaned files using undo logs' thread[1]/messages/by-id/CAEepm=0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG+XpssKqmezug@mail.gmail.com. So I have
rebased the undo-interface patch also. This patch also includes
latest defect fixes from the main zheap branch [2]https://github.com/EnterpriseDB/zheap/.

I have also done some changes to the undo-log patches. Basically, it
is just some cleanup work and also make these patches independently
compilable. I have moved some of the code into undo-log patches and
also moved out some code which is not relevant to undo-log.

Some examples:
1. Moved UndoLog Startup and Checkpoint related code into
'0001-Add-undo-log-manager_v2.patch patch'
+ /* Recover undo log meta data corresponding to this checkpoint. */
+ StartupUndoLogs(ControlFile->checkPointCopy.redo);
+
2. Removed undo-worker related stuff out of this patch
+ case WAIT_EVENT_UNDO_DISCARD_WORKER_MAIN:
+ event_name = "UndoDiscardWorkerMain";
+ break;
+ case WAIT_EVENT_UNDO_LAUNCHER_MAIN:
+ event_name = "UndoLauncherMain";
+ break;

[1]: /messages/by-id/CAEepm=0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG+XpssKqmezug@mail.gmail.com
[2]: https://github.com/EnterpriseDB/zheap/

Patch applying order:
0001-Add-undo-log-manager.patch
0002-Provide-access-to-undo-log-data-via-the-buffer-manag.patch
0003-undo-interface-v3.patch
0004-Add-tests-for-the-undo-log-manager.patch from Cleaning up
orphaned files using undo logs' thread[1]/messages/by-id/CAEepm=0ULqYgM2aFeOnrx6YrtBg3xUdxALoyCG+XpssKqmezug@mail.gmail.com
0004-undo-interface-test-v3.patch

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0002-Provide-access-to-undo-log-data-via-the-buffer-manag_v2.patchapplication/x-patch; name=0002-Provide-access-to-undo-log-data-via-the-buffer-manag_v2.patchDownload

From c9fc74f03a3c81f441095abc3ad46cc6e00b27d9 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 5 Nov 2018 00:20:39 -0800
Subject: [PATCH 2/4] Provide access to undo log data via the buffer manager.

In ancient Berkeley POSTGRES, smgr.c allowed for different storage engines, of
which only md.c survives.  Revive this mechanism to provide access to undo log
data through the existing buffer manager.

Undo logs exist in a pseudo-database whose OID is used to dispatch IO requests
to undofile.c instead of md.c.

Note: a separate proposal generalizes the fsync request machinery, see
https://commitfest.postgresql.org/20/1829/.  This patch has some stand-in
fsync machinery, but will be rebased on that other one depending on progress.
It seems better to avoid tangling up too many concurrently proposals so for
now this patch has its own fsync queue, duplicating some code from md.c.

Author: Thomas Munro, though ForgetBuffer() was contributed by Robert Haas
Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com
---
 src/backend/access/transam/xlogutils.c |  10 +-
 src/backend/postmaster/checkpointer.c  |   2 +-
 src/backend/postmaster/pgstat.c        |  24 +-
 src/backend/storage/buffer/bufmgr.c    |  82 ++++-
 src/backend/storage/smgr/Makefile      |   2 +-
 src/backend/storage/smgr/md.c          |  15 +-
 src/backend/storage/smgr/smgr.c        |  49 ++-
 src/backend/storage/smgr/undofile.c    | 556 +++++++++++++++++++++++++++++++++
 src/include/pgstat.h                   |  16 +-
 src/include/storage/bufmgr.h           |  14 +-
 src/include/storage/smgr.h             |  35 ++-
 src/include/storage/undofile.h         |  50 +++
 12 files changed, 820 insertions(+), 35 deletions(-)
 create mode 100644 src/backend/storage/smgr/undofile.c
 create mode 100644 src/include/storage/undofile.h

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4ecdc92..8fed7b1 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -346,7 +346,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * Make sure that if the block is marked with WILL_INIT, the caller is
 	 * going to initialize it. And vice versa.
 	 */
-	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+	zeromode = (mode == RBM_ZERO || mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
 	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
@@ -462,7 +462,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -487,7 +487,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -497,7 +498,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index 1a03309..d5bdf53 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1326,7 +1326,7 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		smgrrequestsync(request->rnode, request->forknum, request->segno);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 42bccce..dc86307 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3511,7 +3511,7 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_WAL_WRITER_MAIN:
 			event_name = "WalWriterMain";
 			break;
-			/* no default case, so that compiler will warn */
+		/* no default case, so that compiler will warn */
 	}
 
 	return event_name;
@@ -3893,6 +3893,28 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_READ:
+			event_name = "UndoCheckpointRead";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_WRITE:
+			event_name = "UndoCheckpointWrite";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_SYNC:
+			event_name = "UndoCheckpointSync";
+			break;
+		case WAIT_EVENT_UNDO_FILE_READ:
+			event_name = "UndoFileRead";
+			break;
+		case WAIT_EVENT_UNDO_FILE_WRITE:
+			event_name = "UndoFileWrite";
+			break;
+		case WAIT_EVENT_UNDO_FILE_FLUSH:
+			event_name = "UndoFileFlush";
+			break;
+		case WAIT_EVENT_UNDO_FILE_SYNC:
+			event_name = "UndoFileSync";
+			break;
+
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 01eabe5..0d208e6 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -176,6 +176,7 @@ static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
 static inline int32 GetPrivateRefCount(Buffer buffer);
 static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
+static void InvalidateBuffer(BufferDesc *buf);
 
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
@@ -618,10 +619,12 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * valid, the page is zeroed instead of throwing an error. This is intended
  * for non-critical data, where the caller is prepared to repair errors.
  *
- * In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's
+ * In RBM_ZERO mode, if the page isn't in buffer cache already, it's
  * filled with zeros instead of reading it from disk.  Useful when the caller
  * is going to fill the page from scratch, since this saves I/O and avoids
  * unnecessary failure if the page-on-disk has corrupt page headers.
+ *
+ * In RBM_ZERO_AND_LOCK mode, the page is zeroed and also locked.
  * The page is returned locked to ensure that the caller has a chance to
  * initialize the page before it's made visible to others.
  * Caution: do not use this mode to read a page that is beyond the relation's
@@ -672,24 +675,20 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy,
+						  char relpersistence)
 {
 	bool		hit;
 
-	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
-
-	Assert(InRecovery);
+	SMgrRelation smgr = smgropen(rnode,
+								 relpersistence == RELPERSISTENCE_TEMP
+								 ? MyBackendId : InvalidBackendId);
 
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -877,7 +876,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Read in the page, unless the caller intends to overwrite it and
 		 * just wants us to allocate a buffer.
 		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+		if (mode == RBM_ZERO ||
+			mode == RBM_ZERO_AND_LOCK ||
+			mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
@@ -1332,6 +1333,61 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 }
 
 /*
+ * ForgetBuffer -- drop a buffer from shared buffers
+ *
+ * If the buffer isn't present in shared buffers, nothing happens.  If it is
+ * present, it is discarded without making any attempt to write it back out to
+ * the operating system.  The caller must therefore somehow be sure that the
+ * data won't be needed for anything now or in the future.  It assumes that
+ * there is no concurrent access to the block, except that it might be being
+ * concurrently written.
+ */
+void
+ForgetBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum)
+{
+	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
+	BufferTag	tag;			/* identity of target block */
+	uint32		hash;			/* hash value for tag */
+	LWLock	   *partitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(tag, smgr->smgr_rnode.node, forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	hash = BufTableHashCode(&tag);
+	partitionLock = BufMappingPartitionLock(hash);
+
+	/* see if the block is in the buffer pool */
+	LWLockAcquire(partitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&tag, hash);
+	LWLockRelease(partitionLock);
+
+	/* didn't find it, so nothing to do */
+	if (buf_id < 0)
+		return;
+
+	/* take the buffer header lock */
+	bufHdr = GetBufferDescriptor(buf_id);
+	buf_state = LockBufHdr(bufHdr);
+
+	/*
+	 * The buffer might been evicted after we released the partition lock and
+	 * before we acquired the buffer header lock.  If so, the buffer we've
+	 * locked might contain some other data which we shouldn't touch. If the
+	 * buffer hasn't been recycled, we proceed to invalidate it.
+	 */
+	if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+		bufHdr->tag.blockNum == blockNum &&
+		bufHdr->tag.forkNum == forkNum)
+		InvalidateBuffer(bufHdr);		/* releases spinlock */
+	else
+		UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
  * InvalidateBuffer -- mark a shared buffer invalid and return it to the
  * freelist.
  *
@@ -1406,7 +1462,7 @@ retry:
 		LWLockRelease(oldPartitionLock);
 		/* safety check: should definitely not be our *own* pin */
 		if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
-			elog(ERROR, "buffer is pinned in InvalidateBuffer");
+			elog(PANIC, "buffer is pinned in InvalidateBuffer");
 		WaitIO(buf);
 		goto retry;
 	}
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0..b657eb2 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrtype.o undofile.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index f4374d0..0395398 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -45,7 +45,7 @@
 #define UNLINKS_PER_ABSORB		10
 
 /*
- * Special values for the segno arg to RememberFsyncRequest.
+ * Special values for the segno arg to mdrequestsync.
  *
  * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
  * fsync request from the queue if an identical, subsequent request is found.
@@ -1434,7 +1434,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+		mdrequestsync(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
 	}
 	else
 	{
@@ -1470,8 +1470,7 @@ register_unlink(RelFileNodeBackend rnode)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
+		mdrequestsync(rnode.node, MAIN_FORKNUM, UNLINK_RELATION_REQUEST);
 	}
 	else
 	{
@@ -1490,7 +1489,7 @@ register_unlink(RelFileNodeBackend rnode)
 }
 
 /*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ * mdrequestsync() -- callback from checkpointer side of fsync request
  *
  * We stuff fsync requests into the local hash table for execution
  * during the checkpointer's next checkpoint.  UNLINK requests go into a
@@ -1511,7 +1510,7 @@ register_unlink(RelFileNodeBackend rnode)
  * heavyweight operation anyhow, so we'll live with it.)
  */
 void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
 {
 	Assert(pendingOpsTable);
 
@@ -1654,7 +1653,7 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+		mdrequestsync(rnode, forknum, FORGET_RELATION_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
@@ -1693,7 +1692,7 @@ ForgetDatabaseFsyncRequests(Oid dbid)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
+		mdrequestsync(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 189342e..d0b2c0d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,6 +58,8 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
+	void		(*smgr_requestsync) (RelFileNode rnode, ForkNumber forknum,
+									 int segno);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
 	void		(*smgr_sync) (void);	/* may be NULL */
@@ -81,15 +83,45 @@ static const f_smgr smgrsw[] = {
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
+		.smgr_requestsync = mdrequestsync,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_pre_ckpt = mdpreckpt,
 		.smgr_sync = mdsync,
 		.smgr_post_ckpt = mdpostckpt
+	},
+	/* undo logs */
+	{
+		.smgr_init = undofile_init,
+		.smgr_shutdown = undofile_shutdown,
+		.smgr_close = undofile_close,
+		.smgr_create = undofile_create,
+		.smgr_exists = undofile_exists,
+		.smgr_unlink = undofile_unlink,
+		.smgr_extend = undofile_extend,
+		.smgr_prefetch = undofile_prefetch,
+		.smgr_read = undofile_read,
+		.smgr_write = undofile_write,
+		.smgr_writeback = undofile_writeback,
+		.smgr_nblocks = undofile_nblocks,
+		.smgr_truncate = undofile_truncate,
+		.smgr_requestsync = undofile_requestsync,
+		.smgr_immedsync = undofile_immedsync,
+		.smgr_pre_ckpt = undofile_preckpt,
+		.smgr_sync = undofile_sync,
+		.smgr_post_ckpt = undofile_postckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
+/*
+ * In ancient Postgres the catalog entry for each relation controlled the
+ * choice of storage manager implementation.  Now we have only md.c for
+ * regular relations, and undofile.c for undo log storage in the undolog
+ * pseudo-database.
+ */
+#define SmgrWhichForRelFileNode(rfn)			\
+	((rfn).dbNode == 9 ? 1 : 0)
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -185,11 +217,18 @@ smgropen(RelFileNode rnode, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		reln->smgr_fsm_nblocks = InvalidBlockNumber;
 		reln->smgr_vm_nblocks = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+
+		/* Which storage manager implementation? */
+		reln->smgr_which = SmgrWhichForRelFileNode(rnode);
 
 		/* mark it not open */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
 			reln->md_num_open_segs[forknum] = 0;
+			reln->md_seg_fds[forknum] = NULL;
+		}
+
+		reln->private_data = NULL;
 
 		/* it has no owner yet */
 		add_to_unowned_list(reln);
@@ -723,6 +762,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 }
 
 /*
+ *	smgrrequestsync() -- Enqueue a request for smgrsync() to flush data.
+ */
+void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	smgrsw[SmgrWhichForRelFileNode(rnode)].smgr_requestsync(rnode, forknum, segno);
+}
+
+/*
  *	smgrimmedsync() -- Force the specified relation to stable storage.
  *
  *		Synchronously force all previous writes to the specified relation
diff --git a/src/backend/storage/smgr/undofile.c b/src/backend/storage/smgr/undofile.c
new file mode 100644
index 0000000..3d06dd7
--- /dev/null
+++ b/src/backend/storage/smgr/undofile.c
@@ -0,0 +1,556 @@
+/*
+ * undofile.h
+ *
+ * PostgreSQL undo file manager.  This module provides SMGR-compatible
+ * interface to the files that back undo logs on the filesystem, so that undo
+ * log data can use the shared buffer pool.  Other aspects of undo log
+ * management are provided by undolog.c, so the SMGR interfaces not directly
+ * concerned with reading, writing and flushing data are unimplemented.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/storage/smgr/undofile.c
+ */
+
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "storage/fd.h"
+#include "storage/undofile.h"
+#include "utils/memutils.h"
+
+/* intervals for calling AbsorbFsyncRequests in undofile_sync */
+#define FSYNCS_PER_ABSORB		10
+
+/*
+ * Special values for the fork arg to undofile_requestsync.
+ */
+#define FORGET_UNDO_SEGMENT_FSYNC	(InvalidBlockNumber)
+
+/*
+ * While md.c expects random access and has a small number of huge
+ * segments, undofile.c manages a potentially very large number of smaller
+ * segments and has a less random access pattern.  Therefore, instead of
+ * keeping a potentially huge array of vfds we'll just keep the most
+ * recently accessed N.
+ *
+ * For now, N == 1, so we just need to hold onto one 'File' handle.
+ */
+typedef struct UndoFileState
+{
+	int		mru_segno;
+	File	mru_file;
+} UndoFileState;
+
+static MemoryContext UndoFileCxt;
+
+typedef uint16 CycleCtr;
+
+/*
+ * An entry recording the segments that need to be fsynced by undofile_sync().
+ * This is a bit simpler than md.c's version, though it could perhaps be
+ * merged into a common struct.  One difference is that we can have much
+ * larger segment numbers, so we'll adjust for that to avoid having a lot of
+ * leading zero bits.
+ */
+typedef struct
+{
+	RelFileNode rnode;
+	Bitmapset  *requests;
+	CycleCtr	cycle_ctr;
+} PendingOperationEntry;
+
+static HTAB *pendingOpsTable = NULL;
+static MemoryContext pendingOpsCxt;
+
+static CycleCtr undofile_sync_cycle_ctr = 0;
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok);
+static File undofile_get_segment_file(SMgrRelation reln, int segno);
+
+void
+undofile_init(void)
+{
+	UndoFileCxt = AllocSetContextCreate(TopMemoryContext,
+										"UndoFileSmgr",
+										ALLOCSET_DEFAULT_SIZES);
+
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		pendingOpsCxt = AllocSetContextCreate(UndoFileCxt,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(RelFileNode);
+		hash_ctl.entrysize = sizeof(PendingOperationEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOpsTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+}
+
+void
+undofile_shutdown(void)
+{
+}
+
+void
+undofile_close(SMgrRelation reln, ForkNumber forknum)
+{
+}
+
+void
+undofile_create(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_create is not supported");
+}
+
+bool
+undofile_exists(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_exists is not supported");
+}
+
+void
+undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_unlink is not supported");
+}
+
+void
+undofile_extend(SMgrRelation reln, ForkNumber forknum,
+				BlockNumber blocknum, char *buffer,
+				bool skipFsync)
+{
+	elog(ERROR, "undofile_extend is not supported");
+}
+
+void
+undofile_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	elog(ERROR, "undofile_prefetch is not supported");
+}
+
+void
+undofile_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  char *buffer)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	if (FileSeek(file, seekpos, SEEK_SET) != seekpos)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek to block %u in file \"%s\": %m",
+						blocknum, FilePathName(file))));
+	nbytes = FileRead(file, buffer, BLCKSZ, WAIT_EVENT_UNDO_FILE_READ);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+}
+
+static void
+register_dirty_segment(SMgrRelation reln, ForkNumber forknum, int segno, File file)
+{
+	/* Temp relations should never be fsync'd */
+	Assert(!SmgrIsTemp(reln));
+
+	if (pendingOpsTable)
+	{
+		/* push it into local pending-ops table */
+		undofile_requestsync(reln->smgr_rnode.node, forknum, segno);
+	}
+	else
+	{
+		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, segno))
+			return;				/* passed it off successfully */
+
+		ereport(DEBUG1,
+				(errmsg("could not forward fsync request because request queue is full")));
+
+		if (FileSync(file, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(file))));
+	}
+}
+
+void
+undofile_write(SMgrRelation reln, ForkNumber forknum,
+			   BlockNumber blocknum, char *buffer,
+			   bool skipFsync)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	if (FileSeek(file, seekpos, SEEK_SET) != seekpos)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not seek to block %u in file \"%s\": %m",
+						blocknum, FilePathName(file))));
+	nbytes = FileWrite(file, buffer, BLCKSZ, WAIT_EVENT_UNDO_FILE_WRITE);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		/*
+		 * short write: unexpected, because this should be overwriting an
+		 * entirely pre-allocated segment file
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_DISK_FULL),
+				 errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+
+	if (!skipFsync && !SmgrIsTemp(reln))
+		register_dirty_segment(reln, forknum, blocknum / UNDOSEG_SIZE, file);
+}
+
+void
+undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+				   BlockNumber blocknum, BlockNumber nblocks)
+{
+	while (nblocks > 0)
+	{
+		File	file;
+		int		nflush;
+
+		file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+
+		/* compute number of desired writes within the current segment */
+		nflush = Min(nblocks,
+					 1 + UNDOSEG_SIZE - (blocknum % UNDOSEG_SIZE));
+
+		FileWriteback(file,
+					  (blocknum % UNDOSEG_SIZE) * BLCKSZ,
+					  nflush * BLCKSZ, WAIT_EVENT_UNDO_FILE_FLUSH);
+
+		nblocks -= nflush;
+		blocknum += nflush;
+	}
+}
+
+BlockNumber
+undofile_nblocks(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_nblocks is not supported");
+	return 0;
+}
+
+void
+undofile_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
+{
+	elog(ERROR, "undofile_truncate is not supported");
+}
+
+void
+undofile_immedsync(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_immedsync is not supported");
+}
+
+void
+undofile_preckpt(void)
+{
+}
+
+void
+undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+	PendingOperationEntry *entry;
+	bool		found;
+
+	Assert(pendingOpsTable);
+
+	if (forknum == FORGET_UNDO_SEGMENT_FSYNC)
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_FIND,
+													  NULL);
+		if (entry)
+			entry->requests = bms_del_member(entry->requests, segno);
+	}
+	else
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_ENTER,
+													  &found);
+		if (!found)
+		{
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+			entry->requests = bms_make_singleton(segno);
+		}
+		else
+			entry->requests = bms_add_member(entry->requests, segno);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+void
+undofile_forgetsync(Oid logno, Oid tablespace, int segno)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = 9;
+	rnode.spcNode = tablespace;
+	rnode.relNode = logno;
+
+	if (pendingOpsTable)
+		undofile_requestsync(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno);
+	else if (IsUnderPostmaster)
+	{
+		while (!ForwardFsyncRequest(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno))
+			pg_usleep(10000L);
+	}
+}
+
+void
+undofile_sync(void)
+{
+	static bool undofile_sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingOperationEntry *entry;
+	int			absorb_counter;
+	int			segno;
+
+	if (!pendingOpsTable)
+		elog(ERROR, "cannot sync without a pendingOpsTable");
+
+	AbsorbFsyncRequests();
+
+	if (undofile_sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOpsTable);
+		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+	}
+
+	undofile_sync_cycle_ctr++;
+	undofile_sync_in_progress = true;
+
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOpsTable);
+	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		Bitmapset	   *requests;
+
+		/* Skip entries that arrived after we arrived. */
+		if (entry->cycle_ctr == undofile_sync_cycle_ctr)
+			continue;
+
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == undofile_sync_cycle_ctr);
+
+		if (!enableFsync)
+			continue;
+
+		requests = entry->requests;
+		entry->requests = NULL;
+
+		segno = -1;
+		while ((segno = bms_next_member(requests, segno)) >= 0)
+		{
+			File		file;
+
+			if (!enableFsync)
+				continue;
+
+			file = undofile_open_segment_file(entry->rnode.relNode,
+											  entry->rnode.spcNode,
+											  segno, true /* missing_ok */);
+
+			/*
+			 * The file may be gone due to concurrent discard.  We'll ignore
+			 * that, but only if we find a cancel request for this segment in
+			 * the queue.
+			 *
+			 * It's also possible that we succeed in opening a segment file
+			 * that is subsequently recycled (renamed to represent a new range
+			 * of undo log), in which case we'll fsync that later file
+			 * instead.  That is rare and harmless.
+			 */
+			if (file <= 0)
+			{
+				char		name[MAXPGPATH];
+
+				/*
+				 * Put the request back into the bitset in a way that can't
+				 * fail due to memory allocation.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				/*
+				 * Check if a forgetsync request has arrived to delete that
+				 * segment.
+				 */
+				AbsorbFsyncRequests();
+				if (bms_is_member(segno, entry->requests))
+				{
+					UndoLogSegmentPath(entry->rnode.relNode,
+									   segno,
+									   entry->rnode.spcNode,
+									   name);
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not fsync file \"%s\": %m", name)));
+				}
+				/* It must have been removed, so we can safely skip it. */
+				continue;
+			}
+
+			elog(LOG, "fsync()ing %s", FilePathName(file));	/* TODO: remove me */
+			if (FileSync(file, WAIT_EVENT_UNDO_FILE_SYNC) < 0)
+			{
+				char		name[MAXPGPATH];
+
+				strcpy(name, FilePathName(file));
+				FileClose(file);
+
+				/*
+				 * Keep the failed requests, but merge with any new ones.  The
+				 * requirement to be able to do this without risk of failure
+				 * prevents us from using a smaller bitmap that doesn't bother
+				 * tracking leading zeros.  Perhaps another data structure
+				 * would be better.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m", name)));
+			}
+			requests = bms_del_member(requests, segno);
+			FileClose(file);
+
+			if (--absorb_counter <= 0)
+			{
+				AbsorbFsyncRequests();
+				absorb_counter = FSYNCS_PER_ABSORB;
+			}
+		}
+
+		bms_free(requests);
+	}
+
+	undofile_sync_in_progress = true;
+}
+
+void undofile_postckpt(void)
+{
+}
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok)
+{
+	File		file;
+	char		path[MAXPGPATH];
+
+	UndoLogSegmentPath(relNode, segno, spcNode, path);
+	file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+
+	if (file <= 0 && (!missing_ok || errno != ENOENT))
+		elog(ERROR, "cannot open undo segment file '%s': %m", path);
+
+	return file;
+}
+
+/*
+ * Get a File for a particular segment of a SMgrRelation representing an undo
+ * log.
+ */
+static File undofile_get_segment_file(SMgrRelation reln, int segno)
+{
+	UndoFileState *state;
+
+
+	/*
+	 * Create private state space on demand.
+	 *
+	 * XXX There should probably be a smgr 'open' or 'init' interface that
+	 * would do this.  smgr.c currently initializes reln->md_XXX stuff
+	 * directly...
+	 */
+	state = (UndoFileState *) reln->private_data;
+	if (unlikely(state == NULL))
+	{
+		state = MemoryContextAllocZero(UndoFileCxt, sizeof(UndoFileState));
+		reln->private_data = state;
+	}
+
+	/* If we have a file open already, check if we need to close it. */
+	if (state->mru_file > 0 && state->mru_segno != segno)
+	{
+		/* These are not the blocks we're looking for. */
+		FileClose(state->mru_file);
+		state->mru_file = 0;
+	}
+
+	/* Check if we need to open a new file. */
+	if (state->mru_file <= 0)
+	{
+		state->mru_file =
+			undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+									   reln->smgr_rnode.node.spcNode,
+									   segno, InRecovery);
+		if (InRecovery && state->mru_file <= 0)
+		{
+			/*
+			 * If in recovery, we may be trying to access a file that will
+			 * later be unlinked.  Tolerate missing files, creating a new
+			 * zero-filled file as required.
+			 */
+			UndoLogNewSegment(reln->smgr_rnode.node.relNode,
+							  reln->smgr_rnode.node.spcNode,
+							  segno);
+			state->mru_file =
+				undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+										   reln->smgr_rnode.node.spcNode,
+										   segno, false);
+			Assert(state->mru_file > 0);
+		}
+		state->mru_segno = segno;
+	}
+
+	return state->mru_file;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f1c10d1..763379e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -624,6 +624,11 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter tuples_inserted;
 	PgStat_Counter tuples_updated;
 	PgStat_Counter tuples_deleted;
+
+	/*
+	 * Counter tuples_hot_updated stores number of hot updates for heap table
+	 * and the number of inplace updates for zheap table.
+	 */
 	PgStat_Counter tuples_hot_updated;
 
 	PgStat_Counter n_live_tuples;
@@ -743,6 +748,7 @@ typedef enum BackendState
 #define PG_WAIT_IPC					0x08000000U
 #define PG_WAIT_TIMEOUT				0x09000000U
 #define PG_WAIT_IO					0x0A000000U
+#define PG_WAIT_PAGE_TRANS_SLOT		0x0B000000U
 
 /* ----------
  * Wait Events - Activity
@@ -767,7 +773,7 @@ typedef enum
 	WAIT_EVENT_SYSLOGGER_MAIN,
 	WAIT_EVENT_WAL_RECEIVER_MAIN,
 	WAIT_EVENT_WAL_SENDER_MAIN,
-	WAIT_EVENT_WAL_WRITER_MAIN
+	WAIT_EVENT_WAL_WRITER_MAIN,
 } WaitEventActivity;
 
 /* ----------
@@ -913,6 +919,13 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_UNDO_CHECKPOINT_READ,
+	WAIT_EVENT_UNDO_CHECKPOINT_WRITE,
+	WAIT_EVENT_UNDO_CHECKPOINT_SYNC,
+	WAIT_EVENT_UNDO_FILE_READ,
+	WAIT_EVENT_UNDO_FILE_WRITE,
+	WAIT_EVENT_UNDO_FILE_FLUSH,
+	WAIT_EVENT_UNDO_FILE_SYNC,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
@@ -1317,6 +1330,7 @@ pgstat_report_wait_end(void)
 
 extern void pgstat_count_heap_insert(Relation rel, PgStat_Counter n);
 extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_zheap_update(Relation rel);
 extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce390..5b13556 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -38,8 +38,9 @@ typedef enum BufferAccessStrategyType
 typedef enum
 {
 	RBM_NORMAL,					/* Normal read */
-	RBM_ZERO_AND_LOCK,			/* Don't read from disk, caller will
-								 * initialize. Also locks the page. */
+	RBM_ZERO,					/* Don't read from disk, caller will
+								 * initialize. */
+	RBM_ZERO_AND_LOCK,			/* Like RBM_ZERO, but also locks the page. */
 	RBM_ZERO_AND_CLEANUP_LOCK,	/* Like RBM_ZERO_AND_LOCK, but locks the page
 								 * in "cleanup" mode */
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
@@ -171,7 +172,10 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 				   BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 						  ForkNumber forkNum, BlockNumber blockNum,
-						  ReadBufferMode mode, BufferAccessStrategy strategy);
+						  ReadBufferMode mode, BufferAccessStrategy strategy,
+						  char relpersistence);
+extern void ForgetBuffer(RelFileNode rnode, ForkNumber forkNum,
+			 BlockNumber blockNum);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -228,6 +232,10 @@ extern void AtProcExit_LocalBuffers(void);
 
 extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
 
+/* in localbuf.c */
+extern void ForgetLocalBuffer(RelFileNode rnode, ForkNumber forkNum,
+				  BlockNumber blockNum);
+
 /* in freelist.c */
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index c843bbc..65d164b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -71,6 +71,9 @@ typedef struct SMgrRelationData
 	int			md_num_open_segs[MAX_FORKNUM + 1];
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
+	/* For use by implementations. */
+	void	   *private_data;
+
 	/* if unowned, list link in list of all unowned SMgrRelations */
 	struct SMgrRelationData *next_unowned_reln;
 } SMgrRelationData;
@@ -105,6 +108,7 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
+extern void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
 extern void smgrsync(void);
@@ -133,14 +137,41 @@ extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
+extern void mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);
 
+/* in undofile.c */
+extern void undofile_init(void);
+extern void undofile_shutdown(void);
+extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
+							bool isRedo);
+extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
+							bool isRedo);
+extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber blocknum);
+extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum, char *buffer);
+extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber nblocks);
+extern void undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno);
+extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_preckpt(void);
+extern void undofile_sync(void);
+extern void undofile_postckpt(void);
+
 extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/undofile.h b/src/include/storage/undofile.h
new file mode 100644
index 0000000..7544be3
--- /dev/null
+++ b/src/include/storage/undofile.h
@@ -0,0 +1,50 @@
+/*
+ * undofile.h
+ *
+ * PostgreSQL undo file manager.  This module manages the files that back undo
+ * logs on the filesystem.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/undofile.h
+ */
+
+#ifndef UNDOFILE_H
+#define UNDOFILE_H
+
+#include "storage/smgr.h"
+
+/* Prototypes of functions exposed to SMgr. */
+extern void undofile_init(void);
+extern void undofile_shutdown(void);
+extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
+							bool isRedo);
+extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
+							bool isRedo);
+extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum, char *buffer,
+							bool skipFsync);
+extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
+extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum, char *buffer);
+extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, char *buffer,
+						   bool skipFsync);
+extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber nblocks);
+extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_pre_ckpt(void);
+extern void undofile_sync(void);
+extern void undofile_post_ckpt(void);
+
+/* Functions used by undolog.c. */
+extern void undofile_forgetsync(Oid logno, Oid tablespace, int segno);
+
+#endif
-- 
1.8.3.1

0001-Add-undo-log-manager_v2.patchapplication/x-patch; name=0001-Add-undo-log-manager_v2.patchDownload

From db1edf775f550567e0008b3e89d1ad3eee0c276e Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 5 Nov 2018 00:16:33 -0800
Subject: [PATCH 1/4] Add undo log manager.

Add a new subsystem to manage undo logs.  Undo logs allow data to be appended
efficiently, like logs.  They also allow data to be discarded efficiently from
the other end, like a queue.  Thirdly, they allow efficient buffered random
access, like a relation.

Undo logs physically consist of a set of 1MB segment files under
$PGDATA/base/undo (or per-tablespace equivalent) that are created, deleted or
renamed as required, similarly to the way that WAL segments are managed.
Meta-data about the set of undo logs is stored in shared memory, and written
to per-checkpoint files under $PGDATA/pg_undo.

This commit provides an API for allocating and discarding undo log storage
space and managing the files in a crash-safe way.  A later commit will provide
support for accessing the data stored inside them.

XXX Status: WIP.  Some details around WAL are being reconsidered, as noted in
comments.

Author: Thomas Munro, with contributions from Dilip Kumar and input from
        Amit Kapila and Robert Haas
Tested-By: Neha Sharma
Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com
---
 src/backend/access/Makefile               |    2 +-
 src/backend/access/rmgrdesc/Makefile      |    2 +-
 src/backend/access/rmgrdesc/undologdesc.c |   88 +
 src/backend/access/transam/rmgr.c         |    1 +
 src/backend/access/transam/xlog.c         |   17 +
 src/backend/access/undo/Makefile          |   17 +
 src/backend/access/undo/undolog.c         | 2643 +++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql      |    4 +
 src/backend/commands/tablespace.c         |   23 +
 src/backend/replication/logical/decode.c  |    1 +
 src/backend/storage/ipc/ipci.c            |    3 +
 src/backend/storage/lmgr/lwlock.c         |    2 +
 src/backend/storage/lmgr/lwlocknames.txt  |    1 +
 src/backend/utils/init/postinit.c         |    1 +
 src/backend/utils/misc/guc.c              |   12 +
 src/bin/initdb/initdb.c                   |    2 +
 src/bin/pg_waldump/rmgrdesc.c             |    1 +
 src/include/access/rmgrlist.h             |    1 +
 src/include/access/undolog.h              |  405 +++++
 src/include/access/undolog_xlog.h         |   72 +
 src/include/catalog/pg_proc.dat           |    7 +
 src/include/storage/lwlock.h              |    2 +
 src/include/utils/guc.h                   |    2 +
 src/test/regress/expected/rules.out       |   11 +
 24 files changed, 3318 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/undologdesc.c
 create mode 100644 src/backend/access/undo/Makefile
 create mode 100644 src/backend/access/undo/undolog.c
 create mode 100644 src/include/access/undolog.h
 create mode 100644 src/include/access/undolog_xlog.h

diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index bd93a6a..7f7380c 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  tablesample transam
+			  tablesample transam undo
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..91ad1ef 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -11,6 +11,6 @@ include $(top_builddir)/src/Makefile.global
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
 	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
 	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o undologdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/undologdesc.c b/src/backend/access/rmgrdesc/undologdesc.c
new file mode 100644
index 0000000..6cf32f4
--- /dev/null
+++ b/src/backend/access/rmgrdesc/undologdesc.c
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * undologdesc.c
+ *	  rmgr descriptor routines for access/undo/undolog.c
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/undologdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+
+void
+undolog_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_UNDOLOG_CREATE)
+	{
+		xl_undolog_create *xlrec = (xl_undolog_create *) rec;
+
+		appendStringInfo(buf, "logno %u", xlrec->logno);
+	}
+	else if (info == XLOG_UNDOLOG_EXTEND)
+	{
+		xl_undolog_extend *xlrec = (xl_undolog_extend *) rec;
+
+		appendStringInfo(buf, "logno %u end " UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_ATTACH)
+	{
+		xl_undolog_attach *xlrec = (xl_undolog_attach *) rec;
+
+		appendStringInfo(buf, "logno %u xid %u", xlrec->logno, xlrec->xid);
+	}
+	else if (info == XLOG_UNDOLOG_DISCARD)
+	{
+		xl_undolog_discard *xlrec = (xl_undolog_discard *) rec;
+
+		appendStringInfo(buf, "logno %u discard " UndoLogOffsetFormat " end "
+						 UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->discard, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_REWIND)
+	{
+		xl_undolog_rewind *xlrec = (xl_undolog_rewind *) rec;
+
+		appendStringInfo(buf, "logno %u insert " UndoLogOffsetFormat " prevlen %d",
+						 xlrec->logno, xlrec->insert, xlrec->prevlen);
+	}
+
+}
+
+const char *
+undolog_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			id = "CREATE";
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			id = "EXTEND";
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			id = "ATTACH";
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			id = "DISCARD";
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			id = "REWIND";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..8b05374 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -18,6 +18,7 @@
 #include "access/multixact.h"
 #include "access/nbtxlog.h"
 #include "access/spgxlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 246869b..dce4c01 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/transam.h"
 #include "access/tuptoaster.h"
 #include "access/twophase.h"
+#include "access/undolog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
@@ -6881,6 +6882,9 @@ StartupXLOG(void)
 	 */
 	restoreTwoPhaseData();
 
+	/* Recover undo log meta data corresponding to this checkpoint. */
+	StartupUndoLogs(ControlFile->checkPointCopy.redo);
+
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
 	RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
@@ -7503,7 +7507,13 @@ StartupXLOG(void)
 	 * end-of-recovery steps fail.
 	 */
 	if (InRecovery)
+	{
 		ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+		ResetUndoLogs(UNDO_UNLOGGED);
+	}
+
+	/* Always reset temporary undo logs. */
+	ResetUndoLogs(UNDO_TEMP);
 
 	/*
 	 * We don't need the latch anymore. It's not strictly necessary to disown
@@ -9208,6 +9218,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointBuffers(flags);	/* performs all required fsyncs */
+	CheckPointUndoLogs(checkPointRedo, ControlFile->checkPointCopy.redo);
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
@@ -9914,6 +9925,9 @@ xlog_redo(XLogReaderState *record)
 		XLogCtl->ckptXid = checkPoint.nextXid;
 		SpinLockRelease(&XLogCtl->info_lck);
 
+		/* Write an undo log metadata snapshot. */
+		CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo);
+
 		/*
 		 * We should've already switched to the new TLI before replaying this
 		 * record.
@@ -9973,6 +9987,9 @@ xlog_redo(XLogReaderState *record)
 		XLogCtl->ckptXid = checkPoint.nextXid;
 		SpinLockRelease(&XLogCtl->info_lck);
 
+		/* Write an undo log metadata snapshot. */
+		CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo);
+
 		/* TLI should not change in an on-line checkpoint */
 		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
 			ereport(PANIC,
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
new file mode 100644
index 0000000..219c696
--- /dev/null
+++ b/src/backend/access/undo/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/undo
+#
+# IDENTIFICATION
+#    src/backend/access/undo/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/undo
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = undolog.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undolog.c b/src/backend/access/undo/undolog.c
new file mode 100644
index 0000000..48dd662
--- /dev/null
+++ b/src/backend/access/undo/undolog.c
@@ -0,0 +1,2643 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.c
+ *	  management of undo logs
+ *
+ * PostgreSQL undo log manager.  This module is responsible for managing the
+ * lifecycle of undo logs and their segment files, associating undo logs with
+ * backends, and allocating space within undo logs.
+ *
+ * For the code that reads and writes blocks of data, see undofile.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undolog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogreader.h"
+#include "catalog/catalog.h"
+#include "catalog/pg_tablespace.h"
+#include "commands/tablespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "nodes/execnodes.h"
+#include "pgstat.h"
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/standby.h"
+#include "storage/undofile.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/varlena.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+/*
+ * During recovery we maintain a mapping of transaction ID to undo logs
+ * numbers.  We do this with a two-level array, so that we use memory only for
+ * chunks of the array that overlap with the range of active xids.
+ */
+#define UndoLogXidLowBits 16
+
+/*
+ * Number of high bits.
+ */
+#define UndoLogXidHighBits \
+	(sizeof(TransactionId) * CHAR_BIT - UndoLogXidLowBits)
+
+/* Extract the upper bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidHigh(xid) ((xid) >> UndoLogXidLowBits)
+
+/* Extract the lower bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidLow(xid) ((xid) & ((1 << UndoLogXidLowBits) - 1))
+
+/*
+ * Main control structure for undo log management in shared memory.
+ * UndoLogControl objects are arranged in a fixed-size array, at a position
+ * determined by the undo log number.
+ */
+typedef struct UndoLogSharedData
+{
+	UndoLogNumber free_lists[UndoPersistenceLevels];
+	UndoLogNumber low_logno; /* the lowest logno */
+	UndoLogNumber next_logno; /* one past the highest logno */
+	UndoLogNumber array_size; /* how many UndoLogControl objects do we have? */
+	UndoLogControl logs[FLEXIBLE_ARRAY_MEMBER];
+} UndoLogSharedData;
+
+/*
+ * Per-backend state for the undo log module.
+ * Backend-local pointers to undo subsystem state in shared memory.
+ */
+typedef struct UndoLogSession
+{
+	UndoLogSharedData *shared;
+
+	/*
+	 * The control object for the undo logs that this session is currently
+	 * attached to at each persistence level.  This is where it will write new
+	 * undo data.
+	 */
+	UndoLogControl *logs[UndoPersistenceLevels];
+
+	/*
+	 * If the undo_tablespaces GUC changes we'll remember to examine it and
+	 * attach to a new undo log using this flag.
+	 */
+	bool			need_to_choose_tablespace;
+
+	/*
+	 * During recovery, the startup process maintains a mapping of xid to undo
+	 * log number, instead of using 'log' above.  This is not used in regular
+	 * backends and can be in backend-private memory so long as recovery is
+	 * single-process.  This map references UNDO_PERMANENT logs only, since
+	 * temporary and unlogged relations don't have WAL to replay.
+	 */
+	UndoLogNumber **xid_map;
+
+	/*
+	 * The slot for the oldest xids still running.  We advance this during
+	 * checkpoints to free up chunks of the map.
+	 */
+	uint16			xid_map_oldest_chunk;
+
+	/* Current dbid.  Used during recovery. */
+	Oid				dbid;
+} UndoLogSession;
+
+UndoLogSession MyUndoLogState;
+
+undologtable_hash *undologtable_cache;
+
+/* GUC variables */
+char	   *undo_tablespaces = NULL;
+
+static UndoLogControl *get_undo_log(UndoLogNumber logno, bool locked);
+static UndoLogControl *allocate_undo_log(void);
+static void free_undo_log(UndoLogControl *log);
+static void attach_undo_log(UndoPersistence level, Oid tablespace);
+static void detach_current_undo_log(UndoPersistence level, bool full);
+static void extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end);
+static void undo_log_before_exit(int code, Datum value);
+static void forget_undo_buffers(int logno, UndoLogOffset old_discard,
+								UndoLogOffset new_discard,
+								bool drop_tail);
+static bool choose_undo_tablespace(bool force_detach, Oid *oid);
+static void undolog_xid_map_gc(void);
+
+PG_FUNCTION_INFO_V1(pg_stat_get_undo_logs);
+
+/*
+ * How many undo logs can be active at a time?  This creates a theoretical
+ * maximum transaction size, but it we set it to a factor the maximum number
+ * of backends it will be a very high limit.  Alternative designs involving
+ * demand paging or dynamic shared memory could remove this limit but
+ * introduce other problems.
+ */
+static inline size_t
+UndoLogNumSlots(void)
+{
+	return MaxBackends * 4;
+}
+
+/*
+ * Return the amount of traditional smhem required for undo log management.
+ * Extra shared memory will be managed using DSM segments.
+ */
+Size
+UndoLogShmemSize(void)
+{
+	return sizeof(UndoLogSharedData) +
+		UndoLogNumSlots() * sizeof(UndoLogControl);
+}
+
+/*
+ * Initialize the undo log subsystem.  Called in each backend.
+ */
+void
+UndoLogShmemInit(void)
+{
+	bool found;
+
+	MyUndoLogState.shared = (UndoLogSharedData *)
+		ShmemInitStruct("UndoLogShared", UndoLogShmemSize(), &found);
+
+	/* The postmaster initialized the shared memory state. */
+	if (!IsUnderPostmaster)
+	{
+		UndoLogSharedData *shared = MyUndoLogState.shared;
+		int		i;
+
+		Assert(!found);
+
+		/*
+		 * We start with no active undo logs.  StartUpUndoLogs() will recreate
+		 * the undo logs that were known at the last checkpoint.
+		 */
+		memset(shared, 0, sizeof(*shared));
+		shared->array_size = UndoLogNumSlots();
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+			shared->free_lists[i] = InvalidUndoLogNumber;
+		for (i = 0; i < shared->array_size; ++i)
+		{
+			memset(&shared->logs[i], 0, sizeof(shared->logs[i]));
+			shared->logs[i].logno = InvalidUndoLogNumber;
+			LWLockInitialize(&shared->logs[i].mutex,
+							 LWTRANCHE_UNDOLOG);
+			LWLockInitialize(&shared->logs[i].discard_lock,
+							 LWTRANCHE_UNDODISCARD);
+		}
+	}
+	else
+		Assert(found);
+
+	/* All backends prepare their per-backend lookup table. */
+	undologtable_cache = undologtable_create(TopMemoryContext,
+											 UndoLogNumSlots(),
+											 NULL);
+}
+
+void
+UndoLogInit(void)
+{
+	before_shmem_exit(undo_log_before_exit, 0);
+}
+
+/*
+ * Figure out which directory holds an undo log based on tablespace.
+ */
+static void
+UndoLogDirectory(Oid tablespace, char *dir)
+{
+	if (tablespace == DEFAULTTABLESPACE_OID ||
+		tablespace == InvalidOid)
+		snprintf(dir, MAXPGPATH, "base/undo");
+	else
+		snprintf(dir, MAXPGPATH, "pg_tblspc/%u/%s/undo",
+				 tablespace, TABLESPACE_VERSION_DIRECTORY);
+}
+
+/*
+ * Compute the pathname to use for an undo log segment file.
+ */
+void
+UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace, char *path)
+{
+	char		dir[MAXPGPATH];
+
+	/* Figure out which directory holds the segment, based on tablespace. */
+	UndoLogDirectory(tablespace, dir);
+
+	/*
+	 * Build the path from log number and offset.  The pathname is the
+	 * UndoRecPtr of the first byte in the segment in hexadecimal, with a
+	 * period inserted between the components.
+	 */
+	snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno,
+			 segno * UndoLogSegmentSize);
+}
+
+/*
+ * Iterate through the set of currently active logs.  Pass in NULL to get the
+ * first undo log.  NULL indicates the end of the set of logs.  The caller
+ * must lock the returned log before accessing its members, and must skip if
+ * logno is not valid.
+ */
+UndoLogControl *
+UndoLogNext(UndoLogControl *log)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+	for (;;)
+	{
+		/* Advance to the next log. */
+		if (log == NULL)
+		{
+			/* Start at the beginning. */
+			log = &shared->logs[0];
+		}
+		else if (++log == &shared->logs[shared->array_size])
+		{
+			/* Past the end. */
+			log = NULL;
+			break;
+		}
+		/* Have we found a slot with a valid log? */
+		if (log->logno != InvalidUndoLogNumber)
+			break;
+	}
+	LWLockRelease(UndoLogLock);
+
+	/* XXX: erm, which lock should the caller hold!? */
+	return log;
+}
+
+/*
+ * Check if an undo log position has been discarded.  'point' must be an undo
+ * log pointer that was allocated at some point in the past, otherwise the
+ * result is undefined.
+ */
+bool
+UndoLogIsDiscarded(UndoRecPtr point)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(point);
+	UndoLogControl *log;
+	bool	result;
+
+	log = get_undo_log(logno, false);
+
+	/*
+	 * If we couldn't find the undo log number, then it must be entirely
+	 * discarded.
+	 */
+	if (log == NULL)
+		return true;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	if (unlikely(logno != log->logno))
+	{
+		/*
+		 * The undo log has been entirely discarded since we looked it up, and
+		 * the UndoLogControl slot is now unused or being used for some other
+		 * undo log.  That means that any pointer within it must be discarded.
+		 */
+		result = true;
+	}
+	else
+	{
+		/* Check if this point is before the discard pointer. */
+		result = UndoRecPtrGetOffset(point) < log->meta.discard;
+	}
+	LWLockRelease(&log->mutex);
+
+	return result;
+}
+
+/*
+ * Store latest transaction's start undo record point in undo meta data.  It
+ * will fetched by the backend when it's reusing the undo log and preparing
+ * its first undo.
+ */
+void
+UndoLogSetLastXactStartPoint(UndoRecPtr point)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(point);
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	/* TODO: review */
+	log->meta.last_xact_start = UndoRecPtrGetOffset(point);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Fetch the previous transaction's start undo record point.
+ */
+UndoRecPtr
+UndoLogGetLastXactStartPoint(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	uint64 last_xact_start = 0;
+
+	if (unlikely(log == NULL))
+		return InvalidUndoRecPtr;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	/* TODO: review */
+	last_xact_start = log->meta.last_xact_start;
+	LWLockRelease(&log->mutex);
+
+	if (last_xact_start == 0)
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, last_xact_start);
+}
+
+/*
+ * Store the last undo record's length in undo meta-data so that it can be
+ * persistent across restart.
+ */
+void
+UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	/* TODO review */
+	log->meta.prevlen = prevlen;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get the last undo record's length.
+ */
+uint16
+UndoLogGetPrevLen(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	uint16	prevlen;
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	/* TODO review */
+	prevlen = log->meta.prevlen;
+	LWLockRelease(&log->mutex);
+
+	return prevlen;
+}
+
+/*
+ * Is this record is the first record for any transaction.
+ */
+bool
+IsTransactionFirstRec(TransactionId xid)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	Assert(InRecovery);
+
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	log = get_undo_log(logno, false);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	/* TODO review */
+	return log->meta.is_first_rec;
+}
+
+/*
+ * Detach from the undo log we are currently attached to, returning it to the
+ * appropriate free list if it still has space.
+ */
+static void
+detach_current_undo_log(UndoPersistence persistence, bool full)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+
+	MyUndoLogState.logs[persistence] = NULL;
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = InvalidPid;
+	log->xid = InvalidTransactionId;
+	if (full)
+		log->meta.status = UNDO_LOG_STATUS_FULL;
+	LWLockRelease(&log->mutex);
+
+	/* Push back onto the appropriate free list, unless it's full. */
+	if (!full)
+	{
+		LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+		log->next_free = shared->free_lists[persistence];
+		shared->free_lists[persistence] = log->logno;
+		LWLockRelease(UndoLogLock);
+	}
+}
+
+/*
+ * Exit handler, detaching from all undo logs.
+ */
+static void
+undo_log_before_exit(int code, Datum arg)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		if (MyUndoLogState.logs[i] != NULL)
+			detach_current_undo_log(i, false);
+	}
+}
+
+/*
+ * Create a new empty segment file on disk for the byte starting at 'end'.
+ */
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+							UndoLogOffset end)
+{
+	struct stat	stat_buffer;
+	off_t	size;
+	char	path[MAXPGPATH];
+	void   *zeroes;
+	size_t	nzeroes = 8192;
+	int		fd;
+
+	UndoLogSegmentPath(logno, end / UndoLogSegmentSize, tablespace, path);
+
+	/*
+	 * Create and fully allocate a new file.  If we crashed and recovered
+	 * then the file might already exist, so use flags that tolerate that.
+	 * It's also possible that it exists but is too short, in which case
+	 * we'll write the rest.  We don't really care what's in the file, we
+	 * just want to make sure that the filesystem has allocated physical
+	 * blocks for it, so that non-COW filesystems will report ENOSPC now
+	 * rather than later when the space is needed and we'll avoid creating
+	 * files with holes.
+	 */
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0 && tablespace != 0)
+	{
+		char undo_path[MAXPGPATH];
+
+		/* Try creating the undo directory for this tablespace. */
+		UndoLogDirectory(tablespace, undo_path);
+		if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+		{
+			char	   *parentdir;
+
+			if (errno != ENOENT || !InRecovery)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+
+			/*
+			 * In recovery, it's possible that the tablespace directory
+			 * doesn't exist because a later WAL record removed the whole
+			 * tablespace.  In that case we create a regular directory to
+			 * stand in for it.  This is similar to the logic in
+			 * TablespaceCreateDbspace().
+			 */
+
+			/* create two parents up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			/* create one parent up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+		}
+
+		fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	}
+	if (fd < 0)
+		elog(ERROR, "could not create new file \"%s\": %m", path);
+	if (fstat(fd, &stat_buffer) < 0)
+		elog(ERROR, "could not stat \"%s\": %m", path);
+	size = stat_buffer.st_size;
+
+	/* A buffer full of zeroes we'll use to fill up new segment files. */
+	zeroes = palloc0(nzeroes);
+
+	while (size < UndoLogSegmentSize)
+	{
+		ssize_t written;
+
+		written = write(fd, zeroes, Min(nzeroes, UndoLogSegmentSize - size));
+		if (written < 0)
+			elog(ERROR, "cannot initialize undo log segment file \"%s\": %m",
+				 path);
+		size += written;
+	}
+
+	/* Flush the contents of the file to disk. */
+	if (pg_fsync(fd) != 0)
+		elog(ERROR, "cannot fsync file \"%s\": %m", path);
+	CloseTransientFile(fd);
+
+	pfree(zeroes);
+
+	elog(LOG, "created undo segment \"%s\"", path); /* XXX: remove me */
+}
+
+/*
+ * Create a new undo segment, when it is unexpectedly not present.
+ */
+void
+UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno)
+{
+	Assert(InRecovery);
+	allocate_empty_undo_segment(logno, tablespace, segno * UndoLogSegmentSize);
+}
+
+/*
+ * Create and zero-fill a new segment for a given undo log number.
+ */
+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
+{
+	UndoLogControl *log;
+	char		dir[MAXPGPATH];
+	size_t		end;
+
+	log = get_undo_log(logno, false);
+
+	/* TODO review interlocking */
+
+	Assert(log != NULL);
+	Assert(log->meta.end % UndoLogSegmentSize == 0);
+	Assert(new_end % UndoLogSegmentSize == 0);
+	Assert(MyUndoLogState.logs[log->meta.persistence] == log || InRecovery);
+
+	/*
+	 * Create all the segments needed to increase 'end' to the requested
+	 * size.  This is quite expensive, so we will try to avoid it completely
+	 * by renaming files into place in UndoLogDiscard instead.
+	 */
+	end = log->meta.end;
+	while (end < new_end)
+	{
+		allocate_empty_undo_segment(logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	Assert(end == new_end);
+
+	/*
+	 * Flush the parent dir so that the directory metadata survives a crash
+	 * after this point.
+	 */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/*
+	 * If we're not in recovery, we need to WAL-log the creation of the new
+	 * file(s).  We do that after the above filesystem modifications, in
+	 * violation of the data-before-WAL rule as exempted by
+	 * src/backend/access/transam/README.  This means that it's possible for
+	 * us to crash having made some or all of the filesystem changes but
+	 * before WAL logging, but in that case we'll eventually try to create the
+	 * same segment(s) again, which is tolerated.
+	 */
+	if (!InRecovery)
+	{
+		xl_undolog_extend xlrec;
+		XLogRecPtr	ptr;
+
+		xlrec.logno = logno;
+		xlrec.end = end;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+		XLogFlush(ptr);
+	}
+
+	/*
+	 * We didn't need to acquire the mutex to read 'end' above because only
+	 * we write to it.  But we need the mutex to update it, because the
+	 * checkpointer might read it concurrently.
+	 *
+	 * XXX It's possible for meta.end to be higher already during
+	 * recovery, because of the timing of a checkpoint; in that case we did
+	 * nothing above and we shouldn't update shmem here.  That interaction
+	 * needs more analysis.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (log->meta.end < end)
+		log->meta.end = end;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get an insertion point that is guaranteed to be backed by enough space to
+ * hold 'size' bytes of data.  To actually write into the undo log, client
+ * code should call this first and then use bufmgr routines to access buffers
+ * and provide WAL logs and redo handlers.  In other words, while this module
+ * looks after making sure the undo log has sufficient space and the undo meta
+ * data is crash safe, the *contents* of the undo log and (indirectly) the
+ * insertion point are the responsibility of client code.
+ *
+ * Return an undo log insertion point that can be converted to a buffer tag
+ * and an insertion point within a buffer page.
+ *
+ * XXX For now an xl_undolog_meta object is filled in, in case it turns out
+ * to be necessary to write it into the WAL record (like FPI, this must be
+ * logged once for each undo log after each checkpoint).  I think this should
+ * be moved out of this interface and done differently -- to review.
+ */
+UndoRecPtr
+UndoLogAllocate(size_t size, UndoPersistence persistence)
+{
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+	UndoLogOffset new_insert;
+	UndoLogNumber prevlogno = InvalidUndoLogNumber;
+	TransactionId logxid;
+
+	/*
+	 * We may need to attach to an undo log, either because this is the first
+	 * time this backend as needed to write to an undo log at all or because
+	 * the undo_tablespaces GUC was changed.  When doing that, we'll need
+	 * interlocking against tablespaces being concurrently dropped.
+	 */
+
+ retry:
+	/* See if we need to check the undo_tablespaces GUC. */
+	if (unlikely(MyUndoLogState.need_to_choose_tablespace || log == NULL))
+	{
+		Oid		tablespace;
+		bool	need_to_unlock;
+
+		need_to_unlock =
+			choose_undo_tablespace(MyUndoLogState.need_to_choose_tablespace,
+								   &tablespace);
+		attach_undo_log(persistence, tablespace);
+		if (need_to_unlock)
+			LWLockRelease(TablespaceCreateLock);
+		log = MyUndoLogState.logs[persistence];
+		log->meta.prevlogno = prevlogno;
+		MyUndoLogState.need_to_choose_tablespace = false;
+	}
+
+	/*
+	 * If this is the first time we've allocated undo log space in this
+	 * transaction, we'll record the xid->undo log association so that it can
+	 * be replayed correctly. Before that, we set the first record flag to
+	 * false.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.is_first_rec = false;
+	logxid = log->xid;
+
+	if (logxid != GetTopTransactionId())
+	{
+		xl_undolog_attach xlrec;
+
+		/*
+		 * While we have the lock, check if we have been forcibly detached by
+		 * DROP TABLESPACE.  That can only happen between transactions (see
+		 * DropUndoLogsInsTablespace()).
+		 */
+		if (log->pid == InvalidPid)
+		{
+			LWLockRelease(&log->mutex);
+			log = NULL;
+			goto retry;
+		}
+		log->xid = GetTopTransactionId();
+		log->meta.is_first_rec = true;
+		LWLockRelease(&log->mutex);
+
+		/* Skip the attach record for unlogged and temporary tables. */
+		if (persistence == UNDO_PERMANENT)
+		{
+			xlrec.xid = GetTopTransactionId();
+			xlrec.logno = log->logno;
+			xlrec.dbid = MyDatabaseId;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_ATTACH);
+		}
+	}
+	else
+	{
+		LWLockRelease(&log->mutex);
+	}
+
+	/*
+	 * 'size' is expressed in usable non-header bytes.  Figure out how far we
+	 * have to move insert to create space for 'size' usable bytes, stepping
+	 * over any intervening headers.
+	 */
+	Assert(log->meta.insert % BLCKSZ >= UndoLogBlockHeaderSize);
+	new_insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	Assert(new_insert % BLCKSZ >= UndoLogBlockHeaderSize);
+
+	/*
+	 * We don't need to acquire log->mutex to read log->meta.insert and
+	 * log->meta.end, because this backend is the only one that can
+	 * modify them.
+	 */
+	if (unlikely(new_insert > log->meta.end))
+	{
+		if (new_insert > UndoLogMaxSize)
+		{
+			/* This undo log is entirely full.  Get a new one. */
+			if (logxid == GetTopTransactionId())
+			{
+				/*
+				 * If the same transaction is split over two undo logs then
+				 * store the previous log number in new log.  See detailed
+				 * comments in undorecord.c file header.
+				 */
+				prevlogno = log->logno;
+			}
+			elog(LOG, "undo log %u is full, switching to a new one", log->logno);
+			log = NULL;
+			detach_current_undo_log(persistence, true);
+			goto retry;
+		}
+		/*
+		 * Extend the end of this undo log to cover new_insert (in other words
+		 * round up to the segment size).
+		 */
+		extend_undo_log(log->logno,
+						new_insert + UndoLogSegmentSize -
+						new_insert % UndoLogSegmentSize);
+		Assert(new_insert <= log->meta.end);
+	}
+
+	return MakeUndoRecPtr(log->logno, log->meta.insert);
+}
+
+/*
+ * In recovery, we expect the xid to map to a known log which already has
+ * enough space in it.
+ */
+UndoRecPtr
+UndoLogAllocateInRecovery(TransactionId xid, size_t size,
+						  UndoPersistence level)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	/*
+	 * The sequence of calls to UndoLogAllocateRecovery() during REDO
+	 * (recovery) must match the sequence of calls to UndoLogAllocate during
+	 * DO, for any given session.  The XXX_redo code for any UNDO-generating
+	 * operation must use UndoLogAllocateRecovery() rather than
+	 * UndoLogAllocate(), because it must supply the extra 'xid' argument so
+	 * that we can find out which undo log number to use.  During DO, that's
+	 * tracked per-backend, but during REDO the original backends/sessions are
+	 * lost and we have only the Xids.
+	 */
+	Assert(InRecovery);
+
+	/*
+	 * Look up the undo log number for this xid.  The mapping must already
+	 * have been created by an XLOG_UNDOLOG_ATTACH record emitted during the
+	 * first call to UndoLogAllocate for this xid after the most recent
+	 * checkpoint.
+	 */
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	if (logno == InvalidUndoLogNumber)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	/*
+	 * This log must already have been created by an XLOG_UNDOLOG_CREATE
+	 * record emitted by UndoLogAllocate().
+	 */
+	log = get_undo_log(logno, false);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	/*
+	 * This log must already have been extended to cover the requested size by
+	 * XLOG_UNDOLOG_EXTEND records emitted by UndoLogAllocate(), or by
+	 * XLOG_UNDLOG_DISCARD records recycling segments.
+	 */
+	if (log->meta.end < UndoLogOffsetPlusUsableBytes(log->meta.insert, size))
+		elog(ERROR,
+			 "unexpectedly couldn't allocate %zu bytes in undo log number %d",
+			 size, logno);
+
+	/*
+	 * By this time we have allocated a undo log in transaction so after this
+	 * it will not be first undo record for the transaction.
+	 */
+	log->meta.is_first_rec = false;
+
+	return MakeUndoRecPtr(logno, log->meta.insert);
+}
+
+/*
+ * Advance the insertion pointer by 'size' usable (non-header) bytes.
+ */
+void
+UndoLogAdvance(UndoRecPtr insertion_point, size_t size, UndoPersistence persistence)
+{
+	UndoLogControl *log = NULL;
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insertion_point) ;
+
+	/*
+	 * During recovery, MyUndoLogState is uninitialized. Hence, we need to work
+	 * more.
+	 */
+	log = (InRecovery) ? get_undo_log(logno, false)
+		: MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+	Assert(InRecovery || logno == log->logno);
+	Assert(UndoRecPtrGetOffset(insertion_point) == log->meta.insert);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Advance the discard pointer in one undo log, discarding all undo data
+ * relating to one or more whole transactions.  The passed in undo pointer is
+ * the address of the oldest data that the called would like to keep, and the
+ * affected undo log is implied by this pointer, ie
+ * UndoRecPtrGetLogNo(discard_pointer).
+ *
+ * The caller asserts that there will be no attempts to access the undo log
+ * region being discarded after this moment.  This operation will cause the
+ * relevant buffers to be dropped immediately, without writing any data out to
+ * disk.  Any attempt to read the buffers (except a partial buffer at the end
+ * of this range which will remain) may result in IO errors, because the
+ * underlying segment file may have been physically removed.
+ *
+ * Only one backend should call this for a given undo log concurrently, or
+ * data structures will become corrupted.  It is expected that the caller will
+ * be an undo worker; only one undo worker should be working on a given undo
+ * log at a time.
+ */
+void
+UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(discard_point);
+	UndoLogOffset discard = UndoRecPtrGetOffset(discard_point);
+	UndoLogOffset old_discard;
+	UndoLogOffset end;
+	UndoLogControl *log;
+	int			segno;
+	int			new_segno;
+	bool		need_to_flush_wal = false;
+	bool		entirely_discarded = false;
+
+	log = get_undo_log(logno, false);
+	if (unlikely(log == NULL))
+		elog(ERROR,
+			 "cannot advance discard pointer for undo log %d because it is already entirely discarded",
+			 logno);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (unlikely(log->logno != logno))
+		elog(ERROR,
+			 "cannot advance discard pointer for undo log %d because it is entirely discarded",
+			 logno);
+	if (discard > log->meta.insert)
+		elog(ERROR, "cannot move discard point past insert point");
+	old_discard = log->meta.discard;
+	if (discard < old_discard)
+		elog(ERROR, "cannot move discard pointer backwards");
+	end = log->meta.end;
+	/* Are we discarding the last remaining data in a log marked as full? */
+	if (log->meta.status == UNDO_LOG_STATUS_FULL &&
+		discard == log->meta.insert)
+	{
+		/*
+		 * Adjust the discard and insert pointers so that the final segment is
+		 * deleted from disk, and remember not to recycle it.
+		 */
+		entirely_discarded = true;
+		log->meta.insert = log->meta.end;
+		discard = log->meta.end;
+	}
+	LWLockRelease(&log->mutex);
+
+	/*
+	 * Drop all buffers holding this undo data out of the buffer pool (except
+	 * the last one, if the new location is in the middle of it somewhere), so
+	 * that the contained data doesn't ever touch the disk.  The caller
+	 * promises that this data will not be needed again.  We have to drop the
+	 * buffers from the buffer pool before removing files, otherwise a
+	 * concurrent session might try to write the block to evict the buffer.
+	 */
+	forget_undo_buffers(logno, old_discard, discard, entirely_discarded);
+
+	/*
+	 * Check if we crossed a segment boundary and need to do some synchronous
+	 * filesystem operations.
+	 */
+	segno = old_discard / UndoLogSegmentSize;
+	new_segno = discard / UndoLogSegmentSize;
+	if (segno < new_segno)
+	{
+		int		recycle;
+		UndoLogOffset pointer;
+
+		/*
+		 * We always WAL-log discards, but we only need to flush the WAL if we
+		 * have performed a filesystem operation.
+		 */
+		need_to_flush_wal = true;
+
+		/*
+		 * XXX When we rename or unlink a file, it's possible that some
+		 * backend still has it open because it has recently read a page from
+		 * it.  smgr/undofile.c in any such backend will eventually close it,
+		 * because it considers that fd to belong to the file with the name
+		 * that we're unlinking or renaming and it doesn't like to keep more
+		 * than one open at a time.  No backend should ever try to read from
+		 * such a file descriptor; that is what it means when we say that the
+		 * caller of UndoLogDiscard() asserts that there will be no attempts
+		 * to access the discarded range of undo log.  In the case of a
+		 * rename, if a backend were to attempt to read undo data in the range
+		 * being discarded, it would read entirely the wrong data.
+		 */
+
+		/*
+		 * How many segments should we recycle (= rename from tail position to
+		 * head position)?  For now it's always 1 unless there is already a
+		 * spare one, but we could have an adaptive algorithm that recycles
+		 * multiple segments at a time and pays just one fsync().
+		 */
+		LWLockAcquire(&log->mutex, LW_SHARED);
+		if ((log->meta.end - log->meta.insert) < UndoLogSegmentSize &&
+			log->meta.status == UNDO_LOG_STATUS_ACTIVE)
+			recycle = 1;
+		else
+			recycle = 0;
+		LWLockRelease(&log->mutex);
+
+		/* Rewind to the start of the segment. */
+		pointer = segno * UndoLogSegmentSize;
+
+		while (pointer < new_segno * UndoLogSegmentSize)
+		{
+			char	discard_path[MAXPGPATH];
+
+			/*
+			 * Before removing the file, make sure that undofile_sync knows
+			 * that it might be missing.
+			 */
+			undofile_forgetsync(log->logno,
+								log->meta.tablespace,
+								pointer / UndoLogSegmentSize);
+
+			UndoLogSegmentPath(logno, pointer / UndoLogSegmentSize,
+							   log->meta.tablespace, discard_path);
+
+			/* Can we recycle the oldest segment? */
+			if (recycle > 0)
+			{
+				char	recycle_path[MAXPGPATH];
+
+				/*
+				 * End points one byte past the end of the current undo space,
+				 * ie to the first byte of the segment file we want to create.
+				 */
+				UndoLogSegmentPath(logno, end / UndoLogSegmentSize,
+								   log->meta.tablespace, recycle_path);
+				if (rename(discard_path, recycle_path) == 0)
+				{
+					elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+					end += UndoLogSegmentSize;
+					--recycle;
+				}
+				else
+				{
+					elog(ERROR, "could not rename \"%s\" to \"%s\": %m",
+						 discard_path, recycle_path);
+				}
+			}
+			else
+			{
+				if (unlink(discard_path) == 0)
+					elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+				else
+					elog(ERROR, "could not unlink \"%s\": %m", discard_path);
+			}
+			pointer += UndoLogSegmentSize;
+		}
+	}
+
+	/* WAL log the discard. */
+	{
+		xl_undolog_discard xlrec;
+		XLogRecPtr ptr;
+
+		xlrec.logno = logno;
+		xlrec.discard = discard;
+		xlrec.end = end;
+		xlrec.latestxid = xid;
+		xlrec.entirely_discarded = entirely_discarded;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_DISCARD);
+
+		if (need_to_flush_wal)
+			XLogFlush(ptr);
+	}
+
+	/* Update shmem to show the new discard and end pointers. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+
+	/* If we discarded everything, the slot can be given up. */
+	if (entirely_discarded)
+		free_undo_log(log);
+}
+
+/*
+ * Return an UndoRecPtr to the oldest valid data in an undo log, or
+ * InvalidUndoRecPtr if it is empty.
+ */
+UndoRecPtr
+UndoLogGetFirstValidRecord(UndoLogControl *log, bool *full)
+{
+	UndoRecPtr	result;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	if (log->meta.discard == log->meta.insert)
+		result = InvalidUndoRecPtr;
+	else
+		result = MakeUndoRecPtr(log->logno, log->meta.discard);
+	*full = log->meta.status == UNDO_LOG_STATUS_FULL;
+	LWLockRelease(&log->mutex);
+
+	return result;
+}
+
+/*
+ * Return the Next insert location.  This will also validate the input xid
+ * if latest insert point is not for the same transaction id then this will
+ * return Invalid Undo pointer.
+ */
+UndoRecPtr
+UndoLogGetNextInsertPtr(UndoLogNumber logno, TransactionId xid)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	TransactionId	logxid;
+	UndoRecPtr	insert;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	insert = log->meta.insert;
+	logxid = log->xid;
+	LWLockRelease(&log->mutex);
+
+	if (TransactionIdIsValid(logxid) && !TransactionIdEquals(logxid, xid))
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, insert);
+}
+
+/*
+ * Get the address of the most recently inserted record.
+ */
+UndoRecPtr
+UndoLogGetLastRecordPtr(UndoLogNumber logno, TransactionId xid)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	TransactionId logxid;
+	UndoRecPtr insert;
+	uint16 prevlen;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	insert = log->meta.insert;
+	logxid = log->xid;
+	prevlen = log->meta.prevlen;
+	LWLockRelease(&log->mutex);
+
+	if (TransactionIdIsValid(logxid) &&
+		TransactionIdIsValid(xid) &&
+		!TransactionIdEquals(logxid, xid))
+		return InvalidUndoRecPtr;
+
+	if (prevlen == 0)
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, insert - prevlen);
+}
+
+/*
+ * Rewind the undo log insert position also set the prevlen in the mata
+ */
+void
+UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen)
+{
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insert_urp);
+	UndoLogControl *log = get_undo_log(logno, false);
+	UndoLogOffset	insert = UndoRecPtrGetOffset(insert_urp);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = insert;
+	log->meta.prevlen = prevlen;
+
+	/*
+	 * Force the wal log on next undo allocation. So that during recovery undo
+	 * insert location is consistent with normal allocation.
+	 */
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	/* WAL log the rewind. */
+	{
+		xl_undolog_rewind xlrec;
+
+		xlrec.logno = logno;
+		xlrec.insert = insert;
+		xlrec.prevlen = prevlen;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_REWIND);
+	}
+}
+
+/*
+ * Delete unreachable files under pg_undo.  Any files corresponding to LSN
+ * positions before the previous checkpoint are no longer needed.
+ */
+static void
+CleanUpUndoCheckPointFiles(XLogRecPtr checkPointRedo)
+{
+	DIR	   *dir;
+	struct dirent *de;
+	char	path[MAXPGPATH];
+	char	oldest_path[MAXPGPATH];
+
+	/*
+	 * If a base backup is in progress, we can't delete any checkpoint
+	 * snapshot files because one of them corresponds to the backup label but
+	 * there could be any number of checkpoints during the backup.
+	 */
+	if (BackupInProgress())
+		return;
+
+	/* Otherwise keep only those >= the previous checkpoint's redo point. */
+	snprintf(oldest_path, MAXPGPATH, "%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	dir = AllocateDir("pg_undo");
+	while ((de = ReadDir(dir, "pg_undo")) != NULL)
+	{
+		/*
+		 * Assume that fixed width uppercase hex strings sort the same way as
+		 * the values they represent, so we can use strcmp to identify undo
+		 * log snapshot files corresponding to checkpoints that we don't need
+		 * anymore.  This assumption holds for ASCII.
+		 */
+		if (!(strlen(de->d_name) == UNDO_CHECKPOINT_FILENAME_LENGTH))
+			continue;
+
+		if (UndoCheckPointFilenamePrecedes(de->d_name, oldest_path))
+		{
+			snprintf(path, MAXPGPATH, "pg_undo/%s", de->d_name);
+			if (unlink(path) != 0)
+				elog(ERROR, "could not unlink file \"%s\": %m", path);
+		}
+	}
+	FreeDir(dir);
+}
+
+/*
+ * Write out the undo log meta data to the pg_undo directory.  The actual
+ * contents of undo logs is in shared buffers and therefore handled by
+ * CheckPointBuffers(), but here we record the table of undo logs and their
+ * properties.
+ */
+void
+CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogMetaData *serialized = NULL;
+	size_t	serialized_size = 0;
+	char   *data;
+	char	path[MAXPGPATH];
+	int		num_logs;
+	int		fd;
+	int		i;
+	pg_crc32c crc;
+
+	/*
+	 * We acquire UndoLogLock to prevent any undo logs from being created or
+	 * discarded while we build a snapshot of them.  This isn't expected to
+	 * take long on a healthy system because the number of active logs should
+	 * be around the number of backends.  Holding this lock won't prevent
+	 * concurrent access to the undo log, except when segments need to be
+	 * added or removed.
+	 */
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+
+	/*
+	 * Rather than doing the file IO while we hold locks, we'll copy the
+	 * meta-data into a palloc'd buffer.
+	 */
+	serialized_size = sizeof(UndoLogMetaData) * UndoLogNumSlots();
+	serialized = (UndoLogMetaData *) palloc0(serialized_size);
+
+	/* Scan through all slots looking for non-empty ones. */
+	num_logs = 0;
+	for (i = 0; i < UndoLogNumSlots(); ++i)
+	{
+		UndoLogControl *slot = &shared->logs[i];
+
+		/* Skip empty slots. */
+		if (slot->logno == InvalidUndoLogNumber)
+			continue;
+
+		/* Capture snapshot while holding each mutex. */
+		LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
+		serialized[num_logs++] = slot->meta;
+		slot->need_attach_wal_record = true; /* XXX: ?!? */
+		LWLockRelease(&slot->mutex);
+	}
+
+	LWLockRelease(UndoLogLock);
+
+	/* Dump into a file under pg_undo. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE);
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", path)));
+
+	/* Compute header checksum. */
+	INIT_CRC32C(crc);
+	COMP_CRC32C(crc, &shared->low_logno, sizeof(shared->low_logno));
+	COMP_CRC32C(crc, &shared->next_logno, sizeof(shared->next_logno));
+	COMP_CRC32C(crc, &num_logs, sizeof(num_logs));
+	FIN_CRC32C(crc);
+
+	/* Write out the number of active logs + crc. */
+	if ((write(fd, &shared->low_logno, sizeof(shared->low_logno)) != sizeof(shared->low_logno)) ||
+		(write(fd, &shared->next_logno, sizeof(shared->next_logno)) != sizeof(shared->next_logno)) ||
+		(write(fd, &num_logs, sizeof(num_logs)) != sizeof(num_logs)) ||
+		(write(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+	/* Write out the meta data for all active undo logs. */
+	data = (char *) serialized;
+	INIT_CRC32C(crc);
+	serialized_size = num_logs * sizeof(UndoLogMetaData);
+	while (serialized_size > 0)
+	{
+		ssize_t written;
+
+		written = write(fd, data, serialized_size);
+		if (written < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write to file \"%s\": %m", path)));
+		COMP_CRC32C(crc, data, written);
+		serialized_size -= written;
+		data += written;
+	}
+	FIN_CRC32C(crc);
+
+	if (write(fd, &crc, sizeof(crc)) != sizeof(crc))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+
+	/* Flush file and directory entry. */
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC);
+	pg_fsync(fd);
+	CloseTransientFile(fd);
+	fsync_fname("pg_undo", true);
+	pgstat_report_wait_end();
+
+	if (serialized)
+		pfree(serialized);
+
+	CleanUpUndoCheckPointFiles(priorCheckPointRedo);
+	undolog_xid_map_gc();
+}
+
+void
+StartupUndoLogs(XLogRecPtr checkPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char	path[MAXPGPATH];
+	int		i;
+	int		fd;
+	int		nlogs;
+	pg_crc32c crc;
+	pg_crc32c new_crc;
+
+	/* If initdb is calling, there is no file to read yet. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	/* Open the pg_undo file corresponding to the given checkpoint. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_READ);
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+		elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
+
+	/* Read the active log number range. */
+	if ((read(fd, &shared->low_logno, sizeof(shared->low_logno))
+		 != sizeof(shared->low_logno)) ||
+		(read(fd, &shared->next_logno, sizeof(shared->next_logno))
+		 != sizeof(shared->next_logno)) ||
+		(read(fd, &nlogs, sizeof(nlogs)) != sizeof(nlogs)) ||
+		(read(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+
+	/* Verify the header checksum. */
+	INIT_CRC32C(new_crc);
+	COMP_CRC32C(new_crc, &shared->low_logno, sizeof(shared->low_logno));
+	COMP_CRC32C(new_crc, &shared->next_logno, sizeof(shared->next_logno));
+	COMP_CRC32C(new_crc, &nlogs, sizeof(shared->next_logno));
+	FIN_CRC32C(new_crc);
+
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	/*
+	 * We'll acquire UndoLogLock just because allocate_undo_log() asserts we
+	 * hold it (we don't actually expect concurrent access yet).
+	 */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+	/* Initialize all the logs and set up the freelist. */
+	INIT_CRC32C(new_crc);
+	for (i = 0; i < nlogs; ++i)
+	{
+		ssize_t size;
+		UndoLogControl *log;
+
+		/*
+		 * Get a new slot to hold this UndoLogControl object.  If this
+		 * checkpoint was created on a system with a higher max_connections
+		 * setting, it's theoretically possible that we don't have enough
+		 * space and cannot start up.
+		 */
+		log = allocate_undo_log();
+		if (!log)
+			ereport(ERROR,
+					(errmsg("not enough undo log slots to recover from checkpoint: need at least %d, have %zu",
+							nlogs, UndoLogNumSlots()),
+					 errhint("Consider increasing max_connections")));
+
+		/* Read in the meta data for this undo log. */
+		if ((size = read(fd, &log->meta, sizeof(log->meta))) != sizeof(log->meta))
+			elog(ERROR, "short read of pg_undo meta data in file \"%s\": %m (got %zu, wanted %zu)",
+				 path, size, sizeof(log->meta));
+		COMP_CRC32C(new_crc, &log->meta, sizeof(log->meta));
+
+		/*
+		 * At normal start-up, or during recovery, all active undo logs start
+		 * out on the appropriate free list.
+		 */
+		log->logno = log->meta.logno;
+		log->pid = InvalidPid;
+		log->xid = InvalidTransactionId;
+		if (log->meta.status == UNDO_LOG_STATUS_ACTIVE)
+		{
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = log->logno;
+		}
+	}
+	FIN_CRC32C(new_crc);
+
+	LWLockRelease(UndoLogLock);
+
+	/* Verify body checksum. */
+	if (read(fd, &crc, sizeof(crc)) != sizeof(crc))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	CloseTransientFile(fd);
+	pgstat_report_wait_end();
+}
+
+/*
+ * Return a pointer to a newly allocated UndoLogControl object in shared
+ * memory, or return NULL if there are no free slots.  The caller should
+ * acquire the mutex and set up the object.
+ */
+static UndoLogControl *
+allocate_undo_log(void)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log;
+	int		i;
+
+	Assert(LWLockHeldByMeInMode(UndoLogLock, LW_EXCLUSIVE));
+
+	for (i = 0; i < UndoLogNumSlots(); ++i)
+	{
+		log = &shared->logs[i];
+		if (log->logno == InvalidUndoLogNumber)
+		{
+			memset(&log->meta, 0, sizeof(log->meta));
+			log->next_free = InvalidUndoLogNumber;
+			/* TODO: oldest_xid etc? */
+			return log;
+		}
+	}
+
+	return NULL;
+}
+
+/*
+ * Free an UndoLogControl object in shared memory, so that it can be reused.
+ */
+static void
+free_undo_log(UndoLogControl *log)
+{
+	/*
+	 * When removing an undo log from a slot in shared memory, we acquire
+	 * UndoLogLock and log->mutex, so that other code can hold either lock to
+	 * prevent the object from disappearing.
+	 */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	Assert(log->logno != InvalidUndoLogNumber);
+	log->logno = InvalidUndoLogNumber;
+	memset(&log->meta, 0, sizeof(log->meta));
+	LWLockRelease(&log->mutex);
+	LWLockRelease(UndoLogLock);
+}
+
+/*
+ * Get the UndoLogControl object for a given log number.
+ *
+ * The caller may or may not already hold UndoLogLock, and should indicate
+ * this by passing 'locked'.  We'll acquire it in the slow path if necessary.
+ * Either way, the caller must deal with the possibility that the returned
+ * UndoLogControl object pointed to no longer contains the requested logno by
+ * the time it is accessed.
+ *
+ * To do that, one of the following approaches must be taken by the calling
+ * code:
+ *
+ * 1.  If it is known that the calling backend is attached to the log, then it
+ * can be assumed that the UndoLogControl slot still holds the same undo log
+ * number.  The UndoLogControl slot can only change with the cooperation of
+ * the undo log that is attached to it (it must first be marked as
+ * UNDO_LOG_STATUS_FULL, which happens when a backend detaches).  Calling
+ * code should probably assert that it is attached and the logno is as
+ * expected, however.
+ *
+ * 2.  Acquire log->mutex before accessing any members, and after doing so,
+ * check that the logno is as expected.  If it is not, the entire undo log
+ * must be assumed to be discarded and the caller must behave accordingly.
+ *
+ * Return NULL if the undo log has been entirely discarded.  It is an error to
+ * ask for undo logs that have never been created.
+ */
+static UndoLogControl *
+get_undo_log(UndoLogNumber logno, bool locked)
+{
+	UndoLogControl *result = NULL;
+	UndoLogTableEntry *entry;
+	bool	   found;
+
+	Assert(locked == LWLockHeldByMe(UndoLogLock));
+
+	/* First see if we already have it in our cache. */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	if (likely(entry))
+		result = entry->control;
+	else
+	{
+		UndoLogSharedData *shared = MyUndoLogState.shared;
+		int		i;
+
+		/* Nope.  Linear search for the slot in shared memory. */
+		if (!locked)
+			LWLockAcquire(UndoLogLock, LW_SHARED);
+		for (i = 0; i < UndoLogNumSlots(); ++i)
+		{
+			if (shared->logs[i].logno == logno)
+			{
+				/* Found it. */
+
+				/*
+				 * TODO: Should this function be usable in a critical section?
+				 * Woudl it make sense to detect that we are in a critical
+				 * section and just return the pointer to the log without
+				 * updating the cache, to avoid any chance of allocating
+				 * memory?
+				 */
+
+				entry = undologtable_insert(undologtable_cache, logno, &found);
+				entry->number = logno;
+				entry->control = &shared->logs[i];
+				entry->tablespace = entry->control->meta.tablespace;
+				result = entry->control;
+				break;
+			}
+		}
+
+		/*
+		 * If we didn't find it, then it must already have been entirely
+		 * discarded.  We create a negative cache entry so that we can answer
+		 * this question quickly next time.
+		 *
+		 * TODO: We could track the lowest known undo log number, to reduce
+		 * the negative cache entry bloat.
+		 */
+		if (result == NULL)
+		{
+			/*
+			 * Sanity check: the caller should not be asking about undo logs
+			 * that have never existed.
+			 */
+			if (logno >= shared->next_logno)
+				elog(PANIC, "undo log %u hasn't been created yet", logno);
+			entry = undologtable_insert(undologtable_cache, logno, &found);
+			entry->number = logno;
+			entry->control = NULL;
+			entry->tablespace = 0;
+		}
+		if (!locked)
+			LWLockRelease(UndoLogLock);
+	}
+
+	return result;
+}
+
+/*
+ * Get a pointer to an UndoLogControl object corresponding to a given logno.
+ *
+ * In general, the caller must acquire the UndoLogControl's mutex to access
+ * the contents, and at that time must consider that the logno might have
+ * changed because the undo log it contained has been entirely discarded.
+ *
+ * If the calling backend is currently attached to the undo log, that is not
+ * possible, because logs can only reach UNDO_LOG_STATUS_DISCARDED after first
+ * reaching UNDO_LOG_STATUS_FULL, and that only happens while detaching.
+ */
+UndoLogControl *
+UndoLogGet(UndoLogNumber logno, bool missing_ok)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	if (log == NULL && !missing_ok)
+		elog(ERROR, "unknown undo log number %d", logno);
+
+	return log;
+}
+
+/*
+ * Attach to an undo log, possibly creating or recycling one as required.
+ */
+static void
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = NULL;
+	UndoLogNumber logno;
+	UndoLogNumber *place;
+
+	Assert(!InRecovery);
+	Assert(MyUndoLogState.logs[persistence] == NULL);
+
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+	/*
+	 * For now we have a simple linked list of unattached undo logs for each
+	 * persistence level.  We'll grovel though it to find something for the
+	 * tablespace you asked for.  If you're not using multiple tablespaces
+	 * it'll be able to pop one off the front.  We might need a hash table
+	 * keyed by tablespace if this simple scheme turns out to be too slow when
+	 * using many tablespaces and many undo logs, but that seems like an
+	 * unusual use case not worth optimizing for.
+	 */
+	place = &shared->free_lists[persistence];
+	while (*place != InvalidUndoLogNumber)
+	{
+		UndoLogControl *candidate = get_undo_log(*place, true);
+
+		/*
+		 * There should never be an undo log on the freelist that has been
+		 * entirely discarded, or hasn't been created yet.  The persistence
+		 * level should match the freelist.
+		 */
+		if (unlikely(candidate == NULL))
+			elog(ERROR,
+				 "corrupted undo log freelist, no such undo log %u", *place);
+		if (unlikely(candidate->meta.persistence != persistence))
+			elog(ERROR,
+				 "corrupted undo log freelist, undo log %u with persistence %d found on freelist %d",
+				 *place, candidate->meta.persistence, persistence);
+
+		if (candidate->meta.tablespace == tablespace)
+		{
+			logno = *place;
+			log = candidate;
+			*place = candidate->next_free;
+			break;
+		}
+		place = &candidate->next_free;
+	}
+
+	/*
+	 * All existing undo logs for this tablespace and persistence level are
+	 * busy, so we'll have to create a new one.
+	 */
+	if (log == NULL)
+	{
+		if (shared->next_logno > MaxUndoLogNumber)
+		{
+			/*
+			 * You've used up all 16 exabytes of undo log addressing space.
+			 * This is a difficult state to reach using only 16 exabytes of
+			 * WAL.
+			 */
+			elog(ERROR, "undo log address space exhausted");
+		}
+
+		/* Allocate a slot from the UndoLogControl pool. */
+		log = allocate_undo_log();
+		if (unlikely(!log))
+			ereport(ERROR,
+					(errmsg("could not create new undo log"),
+					 errdetail("The maximum number of active undo logs is %zu.",
+							   UndoLogNumSlots()),
+					 errhint("Consider increasing max_connections.")));
+		log->logno = logno = shared->next_logno;
+
+		/*
+		 * The insert and discard pointers start after the first block's
+		 * header.  XXX That means that insert is > end for a short time in a
+		 * newly created undo log.  Is there any problem with that?
+		 */
+		log->meta.insert = UndoLogBlockHeaderSize;
+		log->meta.discard = UndoLogBlockHeaderSize;
+
+		log->meta.logno = logno;
+		log->meta.tablespace = tablespace;
+		log->meta.persistence = persistence;
+		log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+
+		/* Move the high log number pointer past this one. */
+		++shared->next_logno;
+
+		/* WAL-log the creation of this new undo log. */
+		{
+			xl_undolog_create xlrec;
+
+			xlrec.logno = logno;
+			xlrec.tablespace = log->meta.tablespace;
+			xlrec.persistence = log->meta.persistence;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE);
+		}
+
+		/*
+		 * This undo log has no segments.  UndoLogAllocate will create the
+		 * first one on demand.
+		 */
+	}
+	LWLockRelease(UndoLogLock);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = MyProcPid;
+	log->xid = InvalidTransactionId;
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	MyUndoLogState.logs[persistence] = log;
+}
+
+/*
+ * Free chunks of the xid/undo log map that relate to transactions that are no
+ * longer running.  This is run at each checkpoint.
+ */
+static void
+undolog_xid_map_gc(void)
+{
+	UndoLogNumber **xid_map = MyUndoLogState.xid_map;
+	TransactionId oldest_xid;
+	uint16 new_oldest_chunk;
+	uint16 oldest_chunk;
+
+	if (xid_map == NULL)
+		return;
+
+	/*
+	 * During crash recovery, it may not be possible to call GetOldestXmin()
+	 * yet because latestCompletedXid is invalid.
+	 */
+	if (!TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid))
+		return;
+
+	oldest_xid = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
+	new_oldest_chunk = UndoLogGetXidHigh(oldest_xid);
+	oldest_chunk = MyUndoLogState.xid_map_oldest_chunk;
+
+	while (oldest_chunk != new_oldest_chunk)
+	{
+		if (xid_map[oldest_chunk])
+		{
+			pfree(xid_map[oldest_chunk]);
+			xid_map[oldest_chunk] = NULL;
+		}
+		oldest_chunk = (oldest_chunk + 1) % (1 << UndoLogXidHighBits);
+	}
+	MyUndoLogState.xid_map_oldest_chunk = new_oldest_chunk;
+}
+
+/*
+ * Associate a xid with an undo log, during recovery.  In a primary server,
+ * this isn't necessary because backends know which undo log they're attached
+ * to.  During recovery, the natural association between backends and xids is
+ * lost, so we need to manage that explicitly.
+ */
+static void
+undolog_xid_map_add(TransactionId xid, UndoLogNumber logno)
+{
+	uint16		high_bits;
+	uint16		low_bits;
+
+	high_bits = UndoLogGetXidHigh(xid);
+	low_bits = UndoLogGetXidLow(xid);
+
+	if (unlikely(MyUndoLogState.xid_map == NULL))
+	{
+		/* First time through.  Create mapping array. */
+		MyUndoLogState.xid_map =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber *) *
+								   (1 << (32 - UndoLogXidLowBits)));
+		MyUndoLogState.xid_map_oldest_chunk = high_bits;
+	}
+
+	if (unlikely(MyUndoLogState.xid_map[high_bits] == NULL))
+	{
+		/* This bank of mappings doesn't exist yet.  Create it. */
+		MyUndoLogState.xid_map[high_bits] =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber) *
+								   (1 << UndoLogXidLowBits));
+	}
+
+	/* Associate this xid with this undo log number. */
+	MyUndoLogState.xid_map[high_bits][low_bits] = logno;
+}
+
+/* check_hook: validate new undo_tablespaces */
+bool
+check_undo_tablespaces(char **newval, void **extra, GucSource source)
+{
+	char	   *rawname;
+	List	   *namelist;
+
+	/* Need a modifiable copy of string */
+	rawname = pstrdup(*newval);
+
+	/*
+	 * Parse string into list of identifiers, just to check for
+	 * well-formedness (unfortunateley we can't validate the names in the
+	 * catalog yet).
+	 */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+	{
+		/* syntax error in name list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawname);
+		list_free(namelist);
+		return false;
+	}
+
+	/*
+	 * Make sure we aren't already in a transaction that has been assigned an
+	 * XID.  This ensures we don't detach from an undo log that we might have
+	 * started writing undo data into for this transaction.
+	 */
+	if (GetTopTransactionIdIfAny() != InvalidTransactionId)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 (errmsg("undo_tablespaces cannot be changed while a transaction is in progress"))));
+	list_free(namelist);
+
+	return true;
+}
+
+/* assign_hook: do extra actions as needed */
+void
+assign_undo_tablespaces(const char *newval, void *extra)
+{
+	/*
+	 * This is normally called only when GetTopTransactionIdIfAny() ==
+	 * InvalidTransactionId (because you can't change undo_tablespaces in the
+	 * middle of a transaction that's been asigned an xid), but we can't
+	 * assert that because it's also called at the end of a transaction that's
+	 * rolling back, to reset the GUC if it was set inside the transaction.
+	 */
+
+	/* Tell UndoLogAllocate() to reexamine undo_tablespaces. */
+	MyUndoLogState.need_to_choose_tablespace = true;
+}
+
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+	char   *rawname;
+	List   *namelist;
+	bool	need_to_unlock;
+	int		length;
+	int		i;
+
+	/* We need a modifiable copy of string. */
+	rawname = pstrdup(undo_tablespaces);
+
+	/* Break string into list of identifiers. */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+		elog(ERROR, "undo_tablespaces is unexpectedly malformed");
+
+	length = list_length(namelist);
+	if (length == 0 ||
+		(length == 1 && ((char *) linitial(namelist))[0] == '\0'))
+	{
+		/*
+		 * If it's an empty string, then we'll use the default tablespace.  No
+		 * locking is required because it can't be dropped.
+		 */
+		*tablespace = DEFAULTTABLESPACE_OID;
+		need_to_unlock = false;
+	}
+	else
+	{
+		/*
+		 * Choose an OID using our pid, so that if several backends have the
+		 * same multi-tablespace setting they'll spread out.  We could easily
+		 * do better than this if more serious load balancing is judged
+		 * useful.
+		 */
+		int		index = MyProcPid % length;
+		int		first_index = index;
+		Oid		oid = InvalidOid;
+
+		/*
+		 * Take the tablespace create/drop lock while we look the name up.
+		 * This prevents the tablespace from being dropped while we're trying
+		 * to resolve the name, or while the called is trying to create an
+		 * undo log in it.  The caller will have to release this lock.
+		 */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		for (;;)
+		{
+			const char *name = list_nth(namelist, index);
+
+			oid = get_tablespace_oid(name, true);
+			if (oid == InvalidOid)
+			{
+				/* Unknown tablespace, try the next one. */
+				index = (index + 1) % length;
+				/*
+				 * But if we've tried them all, it's time to complain.  We'll
+				 * arbitrarily complain about the last one we tried in the
+				 * error message.
+				 */
+				if (index == first_index)
+					ereport(ERROR,
+							(errcode(ERRCODE_UNDEFINED_OBJECT),
+							 errmsg("tablespace \"%s\" does not exist", name),
+							 errhint("Create the tablespace or set undo_tablespaces to a valid or empty list.")));
+				continue;
+			}
+			if (oid == GLOBALTABLESPACE_OID)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("undo logs cannot be placed in pg_global tablespace")));
+			/* If we got here we succeeded in finding one. */
+			break;
+		}
+
+		Assert(oid != InvalidOid);
+		*tablespace = oid;
+		need_to_unlock = true;
+	}
+
+	/*
+	 * If we came here because the user changed undo_tablesaces, then detach
+	 * from any undo logs we happen to be attached to.
+	 */
+	if (force_detach)
+	{
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+		{
+			UndoLogControl *log = MyUndoLogState.logs[i];
+			UndoLogSharedData *shared = MyUndoLogState.shared;
+
+			if (log != NULL)
+			{
+				LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+				log->pid = InvalidPid;
+				log->xid = InvalidTransactionId;
+				LWLockRelease(&log->mutex);
+
+				LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+				log->next_free = shared->free_lists[i];
+				shared->free_lists[i] = log->logno;
+				LWLockRelease(UndoLogLock);
+
+				MyUndoLogState.logs[i] = NULL;
+			}
+		}
+	}
+
+	return need_to_unlock;
+}
+
+bool
+DropUndoLogsInTablespace(Oid tablespace)
+{
+	DIR *dir;
+	char undo_path[MAXPGPATH];
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log;
+	int		i;
+
+	Assert(LWLockHeldByMe(TablespaceCreateLock));
+	Assert(tablespace != DEFAULTTABLESPACE_OID);
+
+	/* First, try to kick everyone off any undo logs in this tablespace. */
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		bool ok;
+		bool return_to_freelist = false;
+
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/* Check if this undo log can be forcibly detached. */
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		if (log->meta.discard == log->meta.insert &&
+			(log->xid == InvalidTransactionId ||
+			 !TransactionIdIsInProgress(log->xid)))
+		{
+			log->xid = InvalidTransactionId;
+			if (log->pid != InvalidPid)
+			{
+				log->pid = InvalidPid;
+				return_to_freelist = true;
+			}
+			ok = true;
+		}
+		else
+		{
+			/*
+			 * There is data we need in this undo log.  We can't force it to
+			 * be detached.
+			 */
+			ok = false;
+		}
+		LWLockRelease(&log->mutex);
+
+		/* If we failed, then give up now and report failure. */
+		if (!ok)
+			return false;
+
+		/*
+		 * Put this undo log back on the appropriate free-list.  No one can
+		 * attach to it while we hold TablespaceCreateLock, but if we return
+		 * earlier in a future go around this loop, we need the undo log to
+		 * remain usable.  We'll remove all appropriate logs from the
+		 * free-lists in a separate step below.
+		 */
+		if (return_to_freelist)
+		{
+			LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = log->logno;
+			LWLockRelease(UndoLogLock);
+		}
+	}
+
+	/*
+	 * We detached all backends from undo logs in this tablespace, and no one
+	 * can attach to any non-default-tablespace undo logs while we hold
+	 * TablespaceCreateLock.  We can now drop the undo logs.
+	 */
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/*
+		 * Make sure no buffers remain.  When that is done by UndoDiscard(),
+		 * the final page is left in shared_buffers because it may contain
+		 * data, or at least be needed again very soon.  Here we need to drop
+		 * even that page from the buffer pool.
+		 */
+		forget_undo_buffers(log->logno, log->meta.discard, log->meta.discard, true);
+
+		/*
+		 * TODO: For now we drop the undo log, meaning that it will never be
+		 * used again.  That wastes the rest of its address space.  Instead,
+		 * we should put it onto a special list of 'offline' undo logs, ready
+		 * to be reactivated in some other tablespace.  Then we can keep the
+		 * unused portion of its address space.
+		 */
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		log->meta.status = UNDO_LOG_STATUS_DISCARDED;
+		LWLockRelease(&log->mutex);
+	}
+
+	/* Unlink all undo segment files in this tablespace. */
+	UndoLogDirectory(tablespace, undo_path);
+
+	dir = AllocateDir(undo_path);
+	if (dir != NULL)
+	{
+		struct dirent *de;
+
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 ||
+				strcmp(de->d_name, "..") == 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+	}
+
+	/* Remove all dropped undo logs from the free-lists. */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		UndoLogControl *log;
+		UndoLogNumber *place;
+
+		place = &shared->free_lists[i];
+		while (*place != InvalidUndoLogNumber)
+		{
+			log = get_undo_log(*place, true);
+			if (!log)
+				elog(ERROR,
+					 "corrupted undo log freelist, unknown log %u", *place);
+			if (log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+				*place = log->next_free;
+			else
+				place = &log->next_free;
+		}
+	}
+	LWLockRelease(UndoLogLock);
+
+	return true;
+}
+
+void
+ResetUndoLogs(UndoPersistence persistence)
+{
+	UndoLogControl *log;
+
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		DIR	   *dir;
+		struct dirent *de;
+		char	undo_path[MAXPGPATH];
+		char	segment_prefix[MAXPGPATH];
+		size_t	segment_prefix_size;
+
+		if (log->meta.persistence != persistence)
+			continue;
+
+		/* Scan the directory for files belonging to this undo log. */
+		snprintf(segment_prefix, sizeof(segment_prefix), "%06X.", log->logno);
+		segment_prefix_size = strlen(segment_prefix);
+		UndoLogDirectory(log->meta.tablespace, undo_path);
+		dir = AllocateDir(undo_path);
+		if (dir == NULL)
+			continue;
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			elog(LOG, "unlinked undo segment \"%s\"", segment_path); /* XXX: remove me */
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+
+		/*
+		 * We have no segment files.  Set the pointers to indicate that there
+		 * is no data.  The discard and insert pointers point to the first
+		 * usable byte in the segment we will create when we next try to
+		 * allocate.  This is a bit strange, because it means that they are
+		 * past the end pointer.  That's the same as when new undo logs are
+		 * created.
+		 *
+		 * TODO: Should we rewind to zero instead, so we can reuse that (now)
+		 * unreferenced address space?
+		 */
+		log->meta.insert = log->meta.discard = log->meta.end +
+			UndoLogBlockHeaderSize;
+	}
+}
+
+Datum
+pg_stat_get_undo_logs(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_UNDO_LOGS_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char *tablespace_name = NULL;
+	Oid last_tablespace = InvalidOid;
+	int			i;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Scan all undo logs to build the results. */
+	for (i = 0; i < shared->array_size; ++i)
+	{
+		UndoLogControl *log = &shared->logs[i];
+		char buffer[17];
+		Datum values[PG_STAT_GET_UNDO_LOGS_COLS];
+		bool nulls[PG_STAT_GET_UNDO_LOGS_COLS] = { false };
+		Oid tablespace;
+
+		if (log == NULL)
+			continue;
+
+		/*
+		 * This won't be a consistent result overall, but the values for each
+		 * log will be consistent because we'll take the per-log lock while
+		 * copying them.
+		 */
+		LWLockAcquire(&log->mutex, LW_SHARED);
+
+		/* Skip unused slots and entirely discarded undo logs. */
+		if (log->logno == InvalidUndoLogNumber ||
+			log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+		{
+			LWLockRelease(&log->mutex);
+			continue;
+		}
+
+		values[0] = ObjectIdGetDatum((Oid) log->logno);
+		values[1] = CStringGetTextDatum(
+			log->meta.persistence == UNDO_PERMANENT ? "permanent" :
+			log->meta.persistence == UNDO_UNLOGGED ? "unlogged" :
+			log->meta.persistence == UNDO_TEMP ? "temporary" : "<uknown>");
+		tablespace = log->meta.tablespace;
+
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.discard));
+		values[3] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.insert));
+		values[4] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.end));
+		values[5] = CStringGetTextDatum(buffer);
+		if (log->xid == InvalidTransactionId)
+			nulls[6] = true;
+		else
+			values[6] = TransactionIdGetDatum(log->xid);
+		if (log->pid == InvalidPid)
+			nulls[7] = true;
+		else
+			values[7] = Int32GetDatum((int64) log->pid);
+		if (log->meta.prevlogno == InvalidUndoLogNumber)
+			nulls[8] = true;
+		else
+			values[8] = ObjectIdGetDatum((Oid) log->meta.prevlogno);
+		switch (log->meta.status)
+		{
+		case UNDO_LOG_STATUS_ACTIVE:
+			values[9] = CStringGetTextDatum("ACTIVE"); break;
+		case UNDO_LOG_STATUS_FULL:
+			values[9] = CStringGetTextDatum("FULL"); break;
+		default:
+			nulls[9] = true;
+		}
+		LWLockRelease(&log->mutex);
+
+		/*
+		 * Deal with potentially slow tablespace name lookup without the lock.
+		 * Avoid making multiple calls to that expensive function for the
+		 * common case of repeating tablespace.
+		 */
+		if (tablespace != last_tablespace)
+		{
+			if (tablespace_name)
+				pfree(tablespace_name);
+			tablespace_name = get_tablespace_name(tablespace);
+			last_tablespace = tablespace;
+		}
+		if (tablespace_name)
+		{
+			values[2] = CStringGetTextDatum(tablespace_name);
+			nulls[2] = false;
+		}
+		else
+			nulls[2] = true;
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	if (tablespace_name)
+		pfree(tablespace_name);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * replay the creation of a new undo log
+ */
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+	xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	/* Create meta-data space in shared memory. */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	/* TODO: assert that it doesn't exist already? */
+	log = allocate_undo_log();
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->logno = xlrec->logno;
+	log->meta.logno = xlrec->logno;
+	log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+	log->meta.persistence = xlrec->persistence;
+	log->meta.tablespace = xlrec->tablespace;
+	log->meta.insert = UndoLogBlockHeaderSize;
+	log->meta.discard = UndoLogBlockHeaderSize;
+	shared->next_logno = Max(xlrec->logno + 1, shared->next_logno);
+	LWLockRelease(&log->mutex);
+	LWLockRelease(UndoLogLock);
+}
+
+/*
+ * replay the addition of a new segment to an undo log
+ */
+static void
+undolog_xlog_extend(XLogReaderState *record)
+{
+	xl_undolog_extend *xlrec = (xl_undolog_extend *) XLogRecGetData(record);
+
+	/* Extend exactly as we would during DO phase. */
+	extend_undo_log(xlrec->logno, xlrec->end);
+}
+
+/*
+ * replay the association of an xid with a specific undo log
+ */
+static void
+undolog_xlog_attach(XLogReaderState *record)
+{
+	xl_undolog_attach *xlrec = (xl_undolog_attach *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	undolog_xid_map_add(xlrec->xid, xlrec->logno);
+
+	/* Restore current dbid */
+	MyUndoLogState.dbid = xlrec->dbid;
+
+	/*
+	 * Whatever follows is the first record for this transaction.  Zheap will
+	 * use this to add UREC_INFO_TRANSACTION.
+	 */
+	log = get_undo_log(xlrec->logno, false);
+	/* TODO */
+	log->meta.is_first_rec = true;
+	log->xid = xlrec->xid;
+}
+
+/*
+ * Drop all buffers for the given undo log, from the old_discard to up
+ * new_discard.  If drop_tail is true, also drop the buffer that holds
+ * new_discard; this is used when discarding undo logs completely, for example
+ * via DROP TABLESPACE.  If it is false, then the final buffer is not dropped
+ * because it may contain data.
+ *
+ */
+static void
+forget_undo_buffers(int logno, UndoLogOffset old_discard,
+					UndoLogOffset new_discard, bool drop_tail)
+{
+	BlockNumber old_blockno;
+	BlockNumber new_blockno;
+	RelFileNode	rnode;
+
+	UndoRecPtrAssignRelFileNode(rnode, MakeUndoRecPtr(logno, old_discard));
+	old_blockno = old_discard / BLCKSZ;
+	new_blockno = new_discard / BLCKSZ;
+	if (drop_tail)
+		++new_blockno;
+	while (old_blockno < new_blockno)
+		ForgetBuffer(rnode, UndoLogForkNum, old_blockno++);
+}
+
+/*
+ * replay an undo segment discard record
+ */
+static void
+undolog_xlog_discard(XLogReaderState *record)
+{
+	xl_undolog_discard *xlrec = (xl_undolog_discard *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogOffset discard;
+	UndoLogOffset end;
+	UndoLogOffset old_segment_begin;
+	UndoLogOffset new_segment_begin;
+	RelFileNode rnode = {0};
+	char	dir[MAXPGPATH];
+
+	log = get_undo_log(xlrec->logno, false);
+	if (log == NULL)
+		elog(ERROR, "unknown undo log %d", xlrec->logno);
+
+	/*
+	 * We're about to discard undologs. In Hot Standby mode, ensure that
+	 * there's no queries running which need to get tuple from discarded undo.
+	 *
+	 * XXX we are passing empty rnode to the conflict function so that it can
+	 * check conflict in all the backend regardless of which database the
+	 * backend is connected.
+	 */
+	if (InHotStandby && TransactionIdIsValid(xlrec->latestxid))
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestxid, rnode);
+
+	/*
+	 * See if we need to unlink or rename any files, but don't consider it an
+	 * error if we find that files are missing.  Since UndoLogDiscard()
+	 * performs filesystem operations before WAL logging or updating shmem
+	 * which could be checkpointed, a crash could have left files already
+	 * deleted, but we could replay WAL that expects the files to be there.
+	 */
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	Assert(log->logno == xlrec->logno);
+	discard = log->meta.discard;
+	end = log->meta.end;
+	LWLockRelease(&log->mutex);
+
+	/* Drop buffers before we remove/recycle any files. */
+	forget_undo_buffers(xlrec->logno, discard, xlrec->discard,
+						xlrec->entirely_discarded);
+
+	/* Rewind to the start of the segment. */
+	old_segment_begin = discard - discard % UndoLogSegmentSize;
+	new_segment_begin = xlrec->discard - xlrec->discard % UndoLogSegmentSize;
+
+	/* Unlink or rename segments that are no longer in range. */
+	while (old_segment_begin < new_segment_begin)
+	{
+		char	discard_path[MAXPGPATH];
+
+		/*
+		 * Before removing the file, make sure that undofile_sync knows that
+		 * it might be missing.
+		 */
+		undofile_forgetsync(log->logno,
+							log->meta.tablespace,
+							old_segment_begin / UndoLogSegmentSize);
+
+		UndoLogSegmentPath(xlrec->logno, old_segment_begin / UndoLogSegmentSize,
+						   log->meta.tablespace, discard_path);
+
+		/* Can we recycle the oldest segment? */
+		if (end < xlrec->end)
+		{
+			char	recycle_path[MAXPGPATH];
+
+			UndoLogSegmentPath(xlrec->logno, end / UndoLogSegmentSize,
+							   log->meta.tablespace, recycle_path);
+			if (rename(discard_path, recycle_path) == 0)
+			{
+				elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+				end += UndoLogSegmentSize;
+			}
+			else
+			{
+				elog(LOG, "could not rename \"%s\" to \"%s\": %m",
+					 discard_path, recycle_path);
+			}
+		}
+		else
+		{
+			if (unlink(discard_path) == 0)
+				elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+			else
+				elog(LOG, "could not unlink \"%s\": %m", discard_path);
+		}
+		old_segment_begin += UndoLogSegmentSize;
+	}
+
+	/* Create any further new segments that are needed the slow way. */
+	while (end < xlrec->end)
+	{
+		allocate_empty_undo_segment(xlrec->logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	/* Flush the directory entries. */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/* Update shmem. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = xlrec->discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+
+	/* If we discarded everything, the slot can be given up. */
+	if (xlrec->entirely_discarded)
+		free_undo_log(log);
+}
+
+/*
+ * replay the rewind of a undo log
+ */
+static void
+undolog_xlog_rewind(XLogReaderState *record)
+{
+	xl_undolog_rewind *xlrec = (xl_undolog_rewind *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	log = get_undo_log(xlrec->logno, false);
+	log->meta.insert = xlrec->insert;
+	log->meta.prevlen = xlrec->prevlen;
+}
+
+void
+undolog_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			undolog_xlog_create(record);
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			undolog_xlog_extend(record);
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			undolog_xlog_attach(record);
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			undolog_xlog_discard(record);
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			undolog_xlog_rewind(record);
+			break;
+		default:
+			elog(PANIC, "undo_redo: unknown op code %u", info);
+	}
+}
+
+/*
+ * For assertions only.
+ */
+bool
+AmAttachedToUndoLog(UndoLogControl *log)
+{
+	/*
+	 * In general, we can't access log's members without locking.  But this
+	 * function is intended only for asserting that you are attached, and
+	 * while you're attached the slot can't be recycled, so don't bother
+	 * locking.
+	 */
+	return MyUndoLogState.logs[log->meta.persistence] == log;
+}
+
+/*
+ * For testing use only.  This function is only used by the test_undo module.
+ */
+void
+UndoLogDetachFull(void)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+		if (MyUndoLogState.logs[i])
+			detach_current_undo_log(i, true);
+}
+
+/*
+ * Fetch database id from the undo log state
+ */
+Oid
+UndoLogStateGetDatabaseId()
+{
+	Assert(InRecovery);
+	return MyUndoLogState.dbid;
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 53ddc59..17cbc8e 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -939,6 +939,10 @@ GRANT SELECT (subdbid, subname, subowner, subenabled, subslotname, subpublicatio
     ON pg_subscription TO public;
 
 
+CREATE VIEW pg_stat_undo_logs AS
+    SELECT *
+    FROM pg_stat_get_undo_logs();
+
 --
 -- We have a few function definitions in here, too.
 -- At some point there might be enough to justify breaking them out into
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index f7e9160..b9daba4 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -54,6 +54,7 @@
 #include "access/reloptions.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/undolog.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -483,6 +484,20 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 	LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
 
 	/*
+	 * Drop the undo logs in this tablespace.  This will fail (without
+	 * dropping anything) if there are undo logs that we can't afford to drop
+	 * because they contain non-discarded data or a transaction is in
+	 * progress.  Since we hold TablespaceCreateLock, no other session will be
+	 * able to attach to an undo log in this tablespace (or any tablespace
+	 * except default) concurrently.
+	 */
+	if (!DropUndoLogsInTablespace(tablespaceoid))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs",
+						tablespacename)));
+
+	/*
 	 * Try to remove the physical infrastructure.
 	 */
 	if (!destroy_tablespace_directories(tablespaceoid, false))
@@ -1482,6 +1497,14 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		/* This shouldn't be able to fail in recovery. */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		if (!DropUndoLogsInTablespace(xlrec->ts_id))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("tablespace cannot be dropped because it contains non-empty undo logs")));
+		LWLockRelease(TablespaceCreateLock);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index afb4972..f60ecc5 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -154,6 +154,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
+		case RM_UNDOLOG_ID:
 			/* just deal with xid, and done */
 			ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
 									buf.origptr);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a58..4725cbe 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/undolog.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -127,6 +128,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, UndoLogShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
@@ -219,6 +221,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	UndoLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81..b6c0b00 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,8 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+	LWLockRegisterTranche(LWTRANCHE_UNDOLOG, "undo_log");
+	LWLockRegisterTranche(LWTRANCHE_UNDODISCARD, "undo_discard");
 
 	/* Register named tranches. */
 	for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ec..554af46 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
 CLogTruncationLock					45
+UndoLogLock							46
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 4f1d2a0..a3fc997 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -556,6 +556,7 @@ BaseInit(void)
 	InitFileAccess();
 	smgrinit();
 	InitBufferPoolAccess();
+	UndoLogInit();
 }
 
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index e471d7f..287ca00 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -118,6 +118,7 @@ extern int	CommitDelay;
 extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
+extern char *undo_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
 
@@ -3350,6 +3351,17 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Sets the tablespace(s) to use for undo logs."),
+			NULL,
+			GUC_LIST_INPUT | GUC_LIST_QUOTE
+		},
+		&undo_tablespaces,
+		"",
+		check_undo_tablespaces, assign_undo_tablespaces, NULL
+	},
+
+	{
 		{"dynamic_library_path", PGC_SUSET, CLIENT_CONN_OTHER,
 			gettext_noop("Sets the path for dynamically loadable modules."),
 			gettext_noop("If a dynamically loadable module needs to be opened and "
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index ab5cb7f..a64d936 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -209,11 +209,13 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_undo",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
+	"base/undo",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..938150d 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -20,6 +20,7 @@
 #include "access/nbtxlog.h"
 #include "access/rmgr.h"
 #include "access/spgxlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 0bbe9879..9c6fca4 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_UNDOLOG_ID, "UndoLog", undolog_redo, undolog_desc, undolog_identify, NULL, NULL, NULL)
diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h
new file mode 100644
index 0000000..10bd502
--- /dev/null
+++ b/src/include/access/undolog.h
@@ -0,0 +1,405 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.h
+ *
+ * PostgreSQL undo log manager.  This module is responsible for lifecycle
+ * management of undo logs and backing files, associating undo logs with
+ * backends, allocating and managing space within undo logs.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_H
+#define UNDOLOG_H
+
+#include "access/xlogreader.h"
+#include "catalog/pg_class.h"
+#include "common/relpath.h"
+#include "storage/bufpage.h"
+
+#ifndef FRONTEND
+#include "storage/lwlock.h"
+#endif
+
+/* The type used to identify an undo log and position within it. */
+typedef uint64 UndoRecPtr;
+
+/* The type used for undo record lengths. */
+typedef uint16 UndoRecordSize;
+
+/* Undo log statuses. */
+typedef enum
+{
+	UNDO_LOG_STATUS_UNUSED = 0,
+	UNDO_LOG_STATUS_ACTIVE,
+	UNDO_LOG_STATUS_FULL,
+	UNDO_LOG_STATUS_DISCARDED
+} UndoLogStatus;
+
+/*
+ * Undo log persistence levels.  These have a one-to-one correspondence with
+ * relpersistence values, but are small integers so that we can use them as an
+ * index into the "logs" and "lognos" arrays.
+ */
+typedef enum
+{
+	UNDO_PERMANENT = 0,
+	UNDO_UNLOGGED = 1,
+	UNDO_TEMP = 2
+} UndoPersistence;
+
+#define UndoPersistenceLevels 3
+
+/*
+ * Convert from relpersistence ('p', 'u', 't') to an UndoPersistence
+ * enumerator.
+ */
+#define UndoPersistenceForRelPersistence(rp)						\
+	((rp) == RELPERSISTENCE_PERMANENT ? UNDO_PERMANENT :			\
+	 (rp) == RELPERSISTENCE_UNLOGGED ? UNDO_UNLOGGED : UNDO_TEMP)
+
+/*
+ * Convert from UndoPersistence to a relpersistence value.
+ */
+#define RelPersistenceForUndoPersistence(up)				\
+	((up) == UNDO_PERMANENT ? RELPERSISTENCE_PERMANENT :	\
+	 (up) == UNDO_UNLOGGED ? RELPERSISTENCE_UNLOGGED :		\
+	 RELPERSISTENCE_TEMP)
+
+/*
+ * Get the appropriate UndoPersistence value from a Relation.
+ */
+#define UndoPersistenceForRelation(rel)									\
+	(UndoPersistenceForRelPersistence((rel)->rd_rel->relpersistence))
+
+/* Type for offsets within undo logs */
+typedef uint64 UndoLogOffset;
+
+/* printf-family format string for UndoRecPtr. */
+#define UndoRecPtrFormat "%016" INT64_MODIFIER "X"
+
+/* printf-family format string for UndoLogOffset. */
+#define UndoLogOffsetFormat UINT64_FORMAT
+
+/* Number of blocks of BLCKSZ in an undo log segment file.  128 = 1MB. */
+#define UNDOSEG_SIZE 128
+
+/* Size of an undo log segment file in bytes. */
+#define UndoLogSegmentSize ((size_t) BLCKSZ * UNDOSEG_SIZE)
+
+/* The width of an undo log number in bits.  24 allows for 16.7m logs. */
+#define UndoLogNumberBits 24
+
+/* The maximum valid undo log number. */
+#define MaxUndoLogNumber ((1 << UndoLogNumberBits) - 1)
+
+/* The width of an undo log offset in bits.  40 allows for 1TB per log.*/
+#define UndoLogOffsetBits (64 - UndoLogNumberBits)
+
+/* Special value for undo record pointer which indicates that it is invalid. */
+#define	InvalidUndoRecPtr	((UndoRecPtr) 0)
+
+/* End-of-list value when building linked lists of undo logs. */
+#define InvalidUndoLogNumber -1
+
+/*
+ * This undo record pointer will be used in the transaction header this special
+ * value is the indication that currently we don't have the value of the the
+ * next transactions start point but it will be updated with a valid value
+ * in the future.
+ */
+#define SpecialUndoRecPtr	((UndoRecPtr) 0xFFFFFFFFFFFFFFFF)
+
+/*
+ * The maximum amount of data that can be stored in an undo log.  Can be set
+ * artificially low to test full log behavior.
+ */
+#define UndoLogMaxSize ((UndoLogOffset) 1 << UndoLogOffsetBits)
+
+/* Type for numbering undo logs. */
+typedef int UndoLogNumber;
+
+/* Extract the undo log number from an UndoRecPtr. */
+#define UndoRecPtrGetLogNo(urp)					\
+	((urp) >> UndoLogOffsetBits)
+
+/* Extract the offset from an UndoRecPtr. */
+#define UndoRecPtrGetOffset(urp)				\
+	((urp) & ((UINT64CONST(1) << UndoLogOffsetBits) - 1))
+
+/* Make an UndoRecPtr from an log number and offset. */
+#define MakeUndoRecPtr(logno, offset)			\
+	(((uint64) (logno) << UndoLogOffsetBits) | (offset))
+
+/* The number of unusable bytes in the header of each block. */
+#define UndoLogBlockHeaderSize SizeOfPageHeaderData
+
+/* The number of usable bytes we can store per block. */
+#define UndoLogUsableBytesPerPage (BLCKSZ - UndoLogBlockHeaderSize)
+
+/* The pseudo-database OID used for undo logs. */
+#define UndoLogDatabaseOid 9
+
+/* Length of undo checkpoint filename */
+#define UNDO_CHECKPOINT_FILENAME_LENGTH	16
+
+/*
+ * UndoRecPtrIsValid
+ *		True iff undoRecPtr is valid.
+ */
+#define UndoRecPtrIsValid(undoRecPtr) \
+	((bool) ((UndoRecPtr) (undoRecPtr) != InvalidUndoRecPtr))
+
+/* Extract the relnode for an undo log. */
+#define UndoRecPtrGetRelNode(urp)				\
+	UndoRecPtrGetLogNo(urp)
+
+/* The only valid fork number for undo log buffers. */
+#define UndoLogForkNum MAIN_FORKNUM
+
+/* Compute the block number that holds a given UndoRecPtr. */
+#define UndoRecPtrGetBlockNum(urp)				\
+	(UndoRecPtrGetOffset(urp) / BLCKSZ)
+
+/* Compute the offset of a given UndoRecPtr in the page that holds it. */
+#define UndoRecPtrGetPageOffset(urp)			\
+	(UndoRecPtrGetOffset(urp) % BLCKSZ)
+
+/* Compare two undo checkpoint files to find the oldest file. */
+#define UndoCheckPointFilenamePrecedes(file1, file2)	\
+	(strcmp(file1, file2) < 0)
+
+/* What is the offset of the i'th non-header byte? */
+#define UndoLogOffsetFromUsableByteNo(i)								\
+	(((i) / UndoLogUsableBytesPerPage) * BLCKSZ +						\
+	 UndoLogBlockHeaderSize +											\
+	 ((i) % UndoLogUsableBytesPerPage))
+
+/* How many non-header bytes are there before a given offset? */
+#define UndoLogOffsetToUsableByteNo(offset)				\
+	(((offset) % BLCKSZ - UndoLogBlockHeaderSize) +		\
+	 ((offset) / BLCKSZ) * UndoLogUsableBytesPerPage)
+
+/* Add 'n' usable bytes to offset stepping over headers to find new offset. */
+#define UndoLogOffsetPlusUsableBytes(offset, n)							\
+	UndoLogOffsetFromUsableByteNo(UndoLogOffsetToUsableByteNo(offset) + (n))
+
+/* Populate a RelFileNode from an UndoRecPtr. */
+#define UndoRecPtrAssignRelFileNode(rfn, urp)			\
+	do													\
+	{													\
+		(rfn).spcNode = UndoRecPtrGetTablespace(urp);	\
+		(rfn).dbNode = UndoLogDatabaseOid;				\
+		(rfn).relNode = UndoRecPtrGetRelNode(urp);		\
+	} while (false);
+
+/*
+ * Control metadata for an active undo log.  Lives in shared memory inside an
+ * UndoLogControl object, but also written to disk during checkpoints.
+ */
+typedef struct UndoLogMetaData
+{
+	UndoLogNumber logno;
+	UndoLogStatus status;
+	Oid		tablespace;
+	UndoPersistence persistence;	/* permanent, unlogged, temp? */
+	UndoLogOffset insert;			/* next insertion point (head) */
+	UndoLogOffset end;				/* one past end of highest segment */
+	UndoLogOffset discard;			/* oldest data needed (tail) */
+	UndoLogOffset last_xact_start;	/* last transactions start undo offset */
+
+	/*
+	 * If the same transaction is split over two undo logs then it stored the
+	 * previous log number, see file header comments of undorecord.c for its
+	 * usage.
+	 *
+	 * Fixme: See if we can find other way to handle it instead of keeping
+	 * previous log number.
+	 */
+	UndoLogNumber prevlogno;		/* Previous undo log number */
+	bool	is_first_rec;
+
+	/*
+	 * last undo record's length. We need to save this in undo meta and WAL
+	 * log so that the value can be preserved across restart so that the first
+	 * undo record after the restart can get this value properly.  This will be
+	 * used going to the previous record of the transaction during rollback.
+	 * In case the transaction have done some operation before checkpoint and
+	 * remaining after checkpoint in such case if we can't get the previous
+	 * record prevlen which which before checkpoint we can not properly
+	 * rollback.  And, undo worker is also fetch this value when rolling back
+	 * the last transaction in the undo log for locating the last undo record
+	 * of the transaction.
+	 */
+	uint16	prevlen;
+} UndoLogMetaData;
+
+#ifndef FRONTEND
+
+/*
+ * The in-memory control object for an undo log.  We have a fixed-sized array
+ * of these.
+ */
+typedef struct UndoLogControl
+{
+	/*
+	 * Protected by UndoLogLock and 'mutex'.  Both must be held to steal this
+	 * slot for another undolog.  Either may be held to prevent that from
+	 * happening.
+	 */
+	UndoLogNumber logno;			/* InvalidUndoLogNumber for unused slots */
+
+	/* Protected by UndoLogLock. */
+	UndoLogNumber next_free;		/* link for active unattached undo logs */
+
+	/* Protected by 'mutex'. */
+	LWLock	mutex;
+	UndoLogMetaData meta;			/* current meta-data */
+	XLogRecPtr      lsn;
+	bool	need_attach_wal_record;	/* need_attach_wal_record */
+	pid_t		pid;				/* InvalidPid for unattached */
+	TransactionId xid;
+
+	/* Protected by 'discard_lock'.  State used by undo workers. */
+	LWLock		discard_lock;		/* prevents discarding while reading */
+	TransactionId	oldest_xid;		/* cache of oldest transaction's xid */
+	uint32		oldest_xidepoch;
+	UndoRecPtr	oldest_data;
+
+} UndoLogControl;
+
+extern UndoLogControl *UndoLogGet(UndoLogNumber logno, bool missing_ok);
+extern UndoLogControl *UndoLogNext(UndoLogControl *log);
+extern bool AmAttachedToUndoLog(UndoLogControl *log);
+extern UndoRecPtr UndoLogGetFirstValidRecord(UndoLogControl *log, bool *full);
+
+/*
+ * Each backend maintains a small hash table mapping undo log numbers to
+ * UndoLogControl objects in shared memory.
+ *
+ * We also cache the tablespace here, since we need fast access to that when
+ * resolving UndoRecPtr to an buffer tag.  We could also reach that via
+ * control->meta.tablespace, but that can't be accessed without locking (since
+ * the UndoLogControl object might be recycled).  Since the tablespace for a
+ * given undo log is constant for the whole life of the undo log, there is no
+ * invalidation problem to worry about.
+ */
+typedef struct UndoLogTableEntry
+{
+	UndoLogNumber	number;
+	UndoLogControl *control;
+	Oid				tablespace;
+	char			status;
+} UndoLogTableEntry;
+
+/*
+ * Instantiate fast inline hash table access functions.  We use an identity
+ * hash function for speed, since we already have integers and don't expect
+ * many collisions.
+ */
+#define SH_PREFIX undologtable
+#define SH_ELEMENT_TYPE UndoLogTableEntry
+#define SH_KEY_TYPE UndoLogNumber
+#define SH_KEY number
+#define SH_HASH_KEY(tb, key) (key)
+#define SH_EQUAL(tb, a, b) ((a) == (b))
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+extern PGDLLIMPORT undologtable_hash *undologtable_cache;
+
+/*
+ * Find the OID of the tablespace that holds a given UndoRecPtr.  This is
+ * included in the header so it can be inlined by UndoRecPtrAssignRelFileNode.
+ */
+static inline Oid
+UndoRecPtrGetTablespace(UndoRecPtr urp)
+{
+	UndoLogNumber		logno = UndoRecPtrGetLogNo(urp);
+	UndoLogTableEntry  *entry;
+
+	/*
+	 * Fast path, for undo logs we've seen before.  This is safe because
+	 * tablespaces are constant for the lifetime of an undo log number.
+	 */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	if (likely(entry))
+		return entry->tablespace;
+
+	/*
+	 * Slow path: force cache entry to be created.  Raises an error if the
+	 * undo log has been entirely discarded, or hasn't been created yet.  That
+	 * is appropriate here, because this interface is designed for accessing
+	 * undo pages via bufmgr, and we should never be trying to access undo
+	 * pages that have been discarded.
+	 */
+	UndoLogGet(logno, false);
+
+	/*
+	 * We use the value from the newly created cache entry, because it's
+	 * cheaper than acquiring log->mutex and reading log->meta.tablespace.
+	 */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	return entry->tablespace;
+}
+#endif
+
+/* Space management. */
+extern UndoRecPtr UndoLogAllocate(size_t size,
+								  UndoPersistence level);
+extern UndoRecPtr UndoLogAllocateInRecovery(TransactionId xid,
+											size_t size,
+											UndoPersistence persistence);
+extern void UndoLogAdvance(UndoRecPtr insertion_point,
+						   size_t size,
+						   UndoPersistence persistence);
+extern void UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid);
+extern bool UndoLogIsDiscarded(UndoRecPtr point);
+
+/* Initialization interfaces. */
+extern void StartupUndoLogs(XLogRecPtr checkPointRedo);
+extern void UndoLogShmemInit(void);
+extern Size UndoLogShmemSize(void);
+extern void UndoLogInit(void);
+extern void UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace,
+							   char *path);
+extern void ResetUndoLogs(UndoPersistence persistence);
+
+/* Interface use by tablespace.c. */
+extern bool DropUndoLogsInTablespace(Oid tablespace);
+
+/* GUC interfaces. */
+extern void assign_undo_tablespaces(const char *newval, void *extra);
+
+/* Checkpointing interfaces. */
+extern void CheckPointUndoLogs(XLogRecPtr checkPointRedo,
+							   XLogRecPtr priorCheckPointRedo);
+
+extern void UndoLogSetLastXactStartPoint(UndoRecPtr point);
+extern UndoRecPtr UndoLogGetLastXactStartPoint(UndoLogNumber logno);
+extern UndoRecPtr UndoLogGetNextInsertPtr(UndoLogNumber logno,
+										  TransactionId xid);
+extern UndoRecPtr UndoLogGetLastRecordPtr(UndoLogNumber,
+										  TransactionId xid);
+extern void UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen);
+extern bool IsTransactionFirstRec(TransactionId xid);
+extern void UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen);
+extern uint16 UndoLogGetPrevLen(UndoLogNumber logno);
+extern void UndoLogSetLSN(XLogRecPtr lsn);
+void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno);
+/* Redo interface. */
+extern void undolog_redo(XLogReaderState *record);
+/* Discard the undo logs for temp tables */
+extern void TempUndoDiscard(UndoLogNumber);
+extern Oid UndoLogStateGetDatabaseId(void);
+
+/* Test-only interfacing. */
+extern void UndoLogDetachFull(void);
+
+#endif
diff --git a/src/include/access/undolog_xlog.h b/src/include/access/undolog_xlog.h
new file mode 100644
index 0000000..fe88ac5
--- /dev/null
+++ b/src/include/access/undolog_xlog.h
@@ -0,0 +1,72 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog_xlog.h
+ *	  undo log access XLOG definitions.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_XLOG_H
+#define UNDOLOG_XLOG_H
+
+#include "access/undolog.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+
+/* XLOG records */
+#define XLOG_UNDOLOG_CREATE		0x00
+#define XLOG_UNDOLOG_EXTEND		0x10
+#define XLOG_UNDOLOG_ATTACH		0x20
+#define XLOG_UNDOLOG_DISCARD	0x30
+#define XLOG_UNDOLOG_REWIND		0x40
+#define XLOG_UNDOLOG_META		0x50
+
+/* Create a new undo log. */
+typedef struct xl_undolog_create
+{
+	UndoLogNumber logno;
+	Oid		tablespace;
+	UndoPersistence persistence;
+} xl_undolog_create;
+
+/* Extend an undo log by adding a new segment. */
+typedef struct xl_undolog_extend
+{
+	UndoLogNumber logno;
+	UndoLogOffset end;
+} xl_undolog_extend;
+
+/* Record the undo log number used for a transaction. */
+typedef struct xl_undolog_attach
+{
+	TransactionId xid;
+	UndoLogNumber logno;
+	Oid				dbid;
+} xl_undolog_attach;
+
+/* Discard space, and possibly destroy or recycle undo log segments. */
+typedef struct xl_undolog_discard
+{
+	UndoLogNumber logno;
+	UndoLogOffset discard;
+	UndoLogOffset end;
+	TransactionId latestxid;	/* latest xid whose undolog are discarded. */
+	bool		  entirely_discarded;
+} xl_undolog_discard;
+
+/* Rewind insert location of the undo log. */
+typedef struct xl_undolog_rewind
+{
+	UndoLogNumber logno;
+	UndoLogOffset insert;
+	uint16		  prevlen;
+} xl_undolog_rewind;
+
+extern void undolog_desc(StringInfo buf,XLogReaderState *record);
+extern const char *undolog_identify(uint8 info);
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 4026018..b4c3ad9 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10038,4 +10038,11 @@
   proargnames => '{rootrelid,relid,parentrelid,isleaf,level}',
   prosrc => 'pg_partition_tree' }
 
+# undo logs
+{ oid => '5032', descr => 'list undo logs',
+  proname => 'pg_stat_get_undo_logs', procost => '1', prorows => '10', proretset => 't',
+  prorettype => 'record', proargtypes => '',
+  proallargtypes => '{oid,text,text,text,text,text,xid,int4,oid,text}', proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{logno,persistence,tablespace,discard,insert,end,xid,pid,prev_logno,status}', prosrc => 'pg_stat_get_undo_logs' },
+
 ]
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index b2dcb73..4305af6 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,8 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_TBM,
 	LWTRANCHE_PARALLEL_APPEND,
+	LWTRANCHE_UNDOLOG,
+	LWTRANCHE_UNDODISCARD,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index f462eab..217d80a 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -426,6 +426,8 @@ extern void GUC_check_errcode(int sqlerrcode);
 extern bool check_default_tablespace(char **newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra, GucSource source);
 extern void assign_temp_tablespaces(const char *newval, void *extra);
+extern bool check_undo_tablespaces(char **newval, void **extra, GucSource source);
+extern void assign_undo_tablespaces(const char *newval, void *extra);
 
 /* in catalog/namespace.c */
 extern bool check_search_path(char **newval, void **extra, GucSource source);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 735dd37..f3de192 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1918,6 +1918,17 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
+pg_stat_undo_logs| SELECT pg_stat_get_undo_logs.logno,
+    pg_stat_get_undo_logs.persistence,
+    pg_stat_get_undo_logs.tablespace,
+    pg_stat_get_undo_logs.discard,
+    pg_stat_get_undo_logs.insert,
+    pg_stat_get_undo_logs."end",
+    pg_stat_get_undo_logs.xid,
+    pg_stat_get_undo_logs.pid,
+    pg_stat_get_undo_logs.prev_logno,
+    pg_stat_get_undo_logs.status
+   FROM pg_stat_get_undo_logs() pg_stat_get_undo_logs(logno, persistence, tablespace, discard, insert, "end", xid, pid, prev_logno, status);
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
-- 
1.8.3.1

0003-undo-interface-v3.patchapplication/x-patch; name=0003-undo-interface-v3.patchDownload

From a14515a3e68cec81cfb59d727ef9822e42b20248 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 5 Nov 2018 00:54:17 -0800
Subject: [PATCH 3/4] undo-interface-v3

Provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

Dilip Kumar with help from Rafia Sabia. Based on an early prototype
for forming undo record by Robert Haas and Design input from Amit Kapila
---
 src/backend/access/transam/xact.c    |   24 +
 src/backend/access/transam/xlog.c    |   29 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1152 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  459 ++++++++++++++
 src/include/access/undoinsert.h      |  101 +++
 src/include/access/undorecord.h      |  216 +++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 1985 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e..6b7f7fa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -189,6 +189,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +916,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dce4c01..23f23e7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8511,6 +8511,35 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..2453cad
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1152 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Handling multilog -
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/* Maximum number of undo record that can be prepared before calling insert. */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId	prev_txid[UndoPersistenceLevels] = { 0 };
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber	logno;			/* Undo log number */
+	BlockNumber		blk;			/* block number */
+	Buffer			buf;			/* buffer allocated for the block */
+	bool			zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr urp;						/* undo record pointer */
+	UnpackedUndoRecord *urec;			/* undo record */
+	int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace  def_prepared[MAX_PREPARED_UNDO];
+static int prepare_idx;
+static int	max_prepare_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr	multi_prep_urp = InvalidUndoRecPtr;
+static bool	update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+	UndoRecPtr	prev_urecptr; /* prev txn's starting urecptr */
+	int			prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;	/* prev txn's first undo record.*/
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord* UndoGetOneRecord(UnpackedUndoRecord *urec,
+											UndoRecPtr urp, RelFileNode rnode,
+											UndoPersistence persistence);
+static void PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr,
+											 bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int InsertFindBufferSlot(RelFileNode rnode, BlockNumber blk,
+								ReadBufferMode rbm,
+								UndoPersistence persistence);
+static bool IsPrevTxnUndoDiscarded(UndoLogControl *log,
+								   UndoRecPtr prev_xact_urp);
+
+/*
+ * Check if previous transactions undo is already discarded.
+ *
+ * Caller should call this under log->discard_lock
+ */
+static bool
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is not yet initialized.  We have to check
+		 * UndoLogIsDiscarded and if it's already discarded then we have
+		 * nothing to do.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(prev_xact_urp))
+			return true;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (prev_xact_urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return true;
+	}
+
+	return false;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+void
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	prev_xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber	cur_blk;
+	RelFileNode	rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno, false);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno, false);
+	}
+
+	/*
+	 * TODO: For now we don't know how to build a transaction chain for
+	 * temporary undo logs.  That's because this log might have been used by a
+	 * different backend, and we can't access its buffers.  What should happen
+	 * is that the undo data should be automatically discarded when the other
+	 * backend detaches, but that code doesn't exist yet and the undo worker
+	 * can't do it either.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		prev_xact_urp = InvalidUndoRecPtr;
+	else
+		prev_xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * If previous transaction's urp is not valid means this backend is
+	 * preparing its first undo so fetch the information from the undo log
+	 * if it's still invalid urp means this is the first undo record for this
+	 * log and we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(prev_xact_urp))
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doen't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (IsPrevTxnUndoDiscarded(log, prev_xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, prev_xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(prev_xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(prev_xact_urp);
+
+	while (true)
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk,
+									  RBM_NORMAL,
+									  log->meta.persistence);
+		prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+		index++;
+
+		if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	prev_txn_info.uur.uur_next = urecptr;
+	prev_txn_info.prev_urecptr = prev_xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * PrepareUndoRecordUpdateTransInfo.  This must be called under the critical
+ * section.  This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(prev_txn_info.prev_urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	prev_urp = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	prev_urp = prev_txn_info.prev_urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doen't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (IsPrevTxnUndoDiscarded(log, prev_urp))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction
+	 * header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(prev_urp);
+
+	do
+	{
+		Buffer  buffer;
+		int		buf_idx;
+
+		buf_idx = prev_txn_info.prev_txn_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&prev_txn_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while(true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+InsertFindBufferSlot(RelFileNode rnode,
+					 BlockNumber blk,
+					 ReadBufferMode rbm,
+					 UndoPersistence persistence)
+{
+	int 	i;
+	Buffer 	buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g
+		 * when previous transaction start header is in previous undo log)
+		 * so compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+										GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+						UndoPersistence upersistence, TransactionId txid)
+{
+	UnpackedUndoRecord *urec;
+	UndoLogControl *log;
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	bool	need_start_undo = false;
+	bool	first_rec_in_recovery;
+	bool	log_switched = false;
+	int	i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/*
+	 * If this is the first undo record for this transaction then set the
+	 * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+	 * the space for the transaction header and the valid value of the uur_next
+	 * will be updated while preparing the first undo record of the next
+	 * transaction.
+	 */
+	first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+	if ((!InRecovery && prev_txid[upersistence] != txid) ||
+		first_rec_in_recovery)
+	{
+		need_start_undo = true;
+	}
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		if (need_start_undo && i == 0)
+		{
+			urec->uur_next = SpecialUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			/* During recovery, Fetch database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables as these are used only
+			 * with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+		}
+
+
+		/* calculate the size of the undo record. */
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If this is the first record of the log and not the first record of
+	 * the transaction i.e. same transaction continued from the previous log
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, that will make the record larger,
+	 * so we'll have to go back and recompute the size.
+	 */
+	if (!need_start_undo &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_start_undo = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+
+		goto resize;
+	}
+
+	/*
+	 * If transaction id is switched then update the previous transaction's
+	 * start undo record.
+	 */
+	if (first_rec_in_recovery ||
+		(!InRecovery && prev_txid[upersistence] != txid) ||
+		log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			PrepareUndoRecordUpdateTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+void
+UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,
+											 upersistence, txid);
+	if (max_prepare <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(max_prepare * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's
+	 * starting undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((max_prepare + 1) * MAX_BUFFER_PER_UNDO *
+						 sizeof(UndoBuffers));
+	max_prepare_undo = max_prepare;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+				  TransactionId xid)
+{
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	RelFileNode		rnode;
+	UndoRecordSize  cur_size = 0;
+	BlockNumber		cur_blk;
+	TransactionId	txid;
+	int				starting_byte;
+	int				index = 0;
+	int				bufidx;
+	ReadBufferMode	rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepare_undo)
+		return InvalidUndoRecPtr;
+
+	/*
+	 * If this is the first undo record for this top transaction add the
+	 * transaction information to the undo record.
+	 *
+	 * XXX there is also an option that instead of adding the information to
+	 * this record we can prepare a new record which only contain transaction
+	 * informations.
+	 */
+	if (xid == InvalidTransactionId)
+	{
+		/* we expect during recovery, we always have a valid transaction id. */
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping for
+		 * the top most transactions.
+		 */
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(multi_prep_urp))
+		urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+	else
+		urecptr = multi_prep_urp;
+
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(multi_prep_urp))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* FIXME: Should we just report error ? */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/* Undo record can not fit into this block so go to the next block. */
+		cur_blk++;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+	} while (cur_size < size);
+
+	/*
+	 * Save referenced of undo record pointer as well as undo record.
+	 * InsertPreparedUndo will use these to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int		idx;
+	int		flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int		idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will lock the buffers
+ * pinned in the previous step, write the actual undo record into them,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page	page;
+	int		starting_byte;
+	int		already_written;
+	int		bufidx = 0;
+	int		idx;
+	uint16	undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord	*uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+	uint16	prev_undolen;
+
+	Assert(prepare_idx > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		/*
+		 * We can read meta.prevlen without locking, because only we can write
+		 * to it.
+		 */
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+		prev_undolen = log->meta.prevlen;
+
+		/* store the previous undo record length in the header */
+		uur->uur_prevlen = prev_undolen;
+
+		/* if starting a new log then there is no prevlen to store */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/* if starting from a new page then include header in prevlen */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+				uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer  buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+			starting_byte = UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			undo_len += UndoLogBlockHeaderSize;
+
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while(true);
+
+		prev_undolen = undo_len;
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
+
+		if (UndoRecPtrIsValid(prev_txn_info.prev_urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int	i;
+	for (i = 0; i < buffer_idx; i++)
+	{
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	prev_txn_info.prev_urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	multi_prep_urp = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepare_undo limit is changed so free the allocated memory and reset
+	 * all the variable back to its default value.
+	 */
+	if (max_prepare_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepare_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer,  now caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if he wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord*
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer			 buffer = urec->uur_buffer;
+	Page			 page;
+	int				 starting_byte = UndoRecPtrGetPageOffset(urp);
+	int				 already_decoded = 0;
+	BlockNumber		 cur_blk;
+	bool			 is_undo_splited = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a previous buffer then no need to allocate new. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * FIXME: This can be optimized to just fetch header first and only
+		 * if matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_splited = true;
+
+		/*
+		 * Complete record is not fitting into one buffer so release the buffer
+		 * pin and also set invalid buffer in the undo record.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer. Otherwise just
+	 * unlock it.
+	 */
+	if (is_undo_splited)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord*
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode		 rnode, prevrnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int	logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/*
+		 * If we have a valid buffer pinned then just ensure that we want to
+		 * find the next tuple from the same block.  Otherwise release the
+		 * buffer and set it invalid
+		 */
+		if (BufferIsValid(urec->uur_buffer))
+		{
+			/*
+			 * Undo buffer will be changed if the next undo record belongs to a
+			 * different block or undo log.
+			 */
+			if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+				(prevrnode.relNode != rnode.relNode))
+			{
+				ReleaseBuffer(urec->uur_buffer);
+				urec->uur_buffer = InvalidBuffer;
+			}
+		}
+		else
+		{
+			/*
+			 * If there is not a valid buffer in urec->uur_buffer that means we
+			 * had copied the payload data and tuple data so free them.
+			 */
+			if (urec->uur_payload.data)
+				pfree(urec->uur_payload.data);
+			if (urec->uur_tuple.data)
+				pfree(urec->uur_tuple.data);
+		}
+
+		/* Reset the urec before fetching the tuple */
+		urec->uur_tuple.data = NULL;
+		urec->uur_tuple.len = 0;
+		urec->uur_payload.data = NULL;
+		urec->uur_payload.len = 0;
+		prevrnode = rnode;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecPtrIsValid(log->oldest_data))
+		{
+			/*
+			 * UndoDiscardInfo is not yet initialized. Hence, we've to check
+			 * UndoLogIsDiscarded and if it's already discarded then we have
+			 * nothing to do.
+			 */
+			LWLockRelease(&log->discard_lock);
+			if (UndoLogIsDiscarded(urp))
+			{
+				if (BufferIsValid(urec->uur_buffer))
+					ReleaseBuffer(urec->uur_buffer);
+				return NULL;
+			}
+
+			LWLockAcquire(&log->discard_lock, LW_SHARED);
+		}
+
+		/* Check if it's already discarded. */
+		if (urp < log->oldest_data)
+		{
+			LWLockRelease(&log->discard_lock);
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl	*prevlog, *log;
+
+		log = UndoLogGet(logno, false);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno, true);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr (logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree (urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..33bb153
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,459 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size	size;
+
+	/* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+	/* if (uur->uur_info == 0) */
+		UndoRecordSetInfo(uur);
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char   *writeptr = (char *) page + starting_byte;
+	char   *endptr = (char *) page + BLCKSZ;
+	int		my_bytes_written = *already_written;
+
+	if (uur->uur_info == 0)
+		UndoRecordSetInfo(uur);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption
+	 * that it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_relfilenode = uur->uur_relfilenode;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_tsid = uur->uur_tsid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before,
+		 * or caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_relfilenode == uur->uur_relfilenode);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_tsid == uur->uur_tsid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int		can_write;
+	int		remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing
+	 * to do except update *my_bytes_written, which we must do to ensure
+	 * that the next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+					  int *already_decoded, bool header_only)
+{
+	char	*readptr = (char *)page + starting_byte;
+	char	*endptr = (char *) page + BLCKSZ;
+	int		my_bytes_decoded = *already_decoded;
+	bool	is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_relfilenode = work_hdr.urec_relfilenode;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_tsid = work_rd.urec_tsid;
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any of
+		 * the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int		can_read;
+	int		remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+static void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_tsid != DEFAULTTABLESPACE_OID ||
+		uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..2b73f9b
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,101 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord* urec,
+											BlockNumber blkno,
+											OffsetNumber offset,
+											TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, UndoPersistence,
+				    TransactionId xid);
+
+/*
+ * Insert a previously-prepared undo record.  This will lock the buffers
+ * pinned in the previous step, write the actual undo record into them,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord* UndoFetchRecord(UndoRecPtr urp,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid,
+										   UndoRecPtr *urec_ptr_out,
+										   SatisfyUndoRecordCallback callback);
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+							   TransactionId xid, UndoPersistence upersistence);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+
+#endif   /* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..85642ad
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,216 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_relfilenode;		/* relfilenode for relation */
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older then RecentGlobalXmin, then we can consider the tuple
+	 * in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;			/* Transaction id */
+	CommandId	urec_cid;			/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+#define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	Oid			urec_tsid;		/* tablespace OID */
+	ForkNumber		urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+	uint32			urec_progress;  /* undo applying progress. */
+	uint32			urec_xidepoch;  /* epoch of the current transaction */
+	Oid				urec_dbid;		/* database id */
+	uint64			urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+#define urec_next_pos \
+	(SizeOfUndoRecordTransaction - SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;		/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordExpectedSize or InsertUndoRecord.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_relfilenode;	/* relfilenode for relation */
+	TransactionId uur_prevxid;		/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	Oid			uur_tsid;		/* tablespace OID */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id*/
+
+	/*
+	 * undo action apply progress 0 = not started, 1 = completed. In future it
+	 * can also be used to show the progress of how much undo has been applied
+	 * so far with some formulae but currently only 0 and 1 is used.
+	 */
+	uint32         uur_progress;
+	StringInfoData uur_payload;	/* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.  For the first call, the given page should be the one which
+ * the caller has determined to contain the current insertion point,
+ * starting_byte should be the byte offset within that page which corresponds
+ * to the current insertion point, and *already_written should be 0.  The
+ * return value will be true if the entire record is successfully written
+ * into that page, and false if not.  In either case, *already_written will
+ * be updated to the number of bytes written by all InsertUndoRecord calls
+ * for this record to date.  If this function is called again to continue
+ * writing the record, the previous value for *already_written should be
+ * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
+ * (since the record will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif   /* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 689c57c..73394c5 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e01d12e..8cfcd44 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -277,6 +277,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

0004-undo-interface-test-v3.patchapplication/x-patch; name=0004-undo-interface-test-v3.patchDownload

From 81f2fed5ced8ed74517836657d15a42dd21dd4fd Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Mon, 5 Nov 2018 02:46:43 -0800
Subject: [PATCH 2/2] undo-interface-test

---
 src/test/modules/Makefile                          |  1 +
 src/test/modules/test_undo_api/Makefile            | 21 ++++++
 .../test_undo_api/expected/test_undo_api.out       | 12 ++++
 .../modules/test_undo_api/sql/test_undo_api.sql    |  8 +++
 .../modules/test_undo_api/test_undo_api--1.0.sql   |  8 +++
 src/test/modules/test_undo_api/test_undo_api.c     | 84 ++++++++++++++++++++++
 .../modules/test_undo_api/test_undo_api.control    |  4 ++
 7 files changed, 138 insertions(+)
 create mode 100644 src/test/modules/test_undo_api/Makefile
 create mode 100644 src/test/modules/test_undo_api/expected/test_undo_api.out
 create mode 100644 src/test/modules/test_undo_api/sql/test_undo_api.sql
 create mode 100644 src/test/modules/test_undo_api/test_undo_api--1.0.sql
 create mode 100644 src/test/modules/test_undo_api/test_undo_api.c
 create mode 100644 src/test/modules/test_undo_api/test_undo_api.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 43323a6..e05fd00 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_undo \
+		  test_undo_api \
 		  worker_spi
 
 $(recurse)
diff --git a/src/test/modules/test_undo_api/Makefile b/src/test/modules/test_undo_api/Makefile
new file mode 100644
index 0000000..deb3816
--- /dev/null
+++ b/src/test/modules/test_undo_api/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_undo/Makefile
+
+MODULE_big = test_undo_api
+OBJS = test_undo_api.o
+PGFILEDESC = "test_undo_api - a test module for the undo api layer"
+
+EXTENSION = test_undo_api
+DATA = test_undo_api--1.0.sql
+
+REGRESS = test_undo_api
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_undo_api
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_undo_api/expected/test_undo_api.out b/src/test/modules/test_undo_api/expected/test_undo_api.out
new file mode 100644
index 0000000..995b517
--- /dev/null
+++ b/src/test/modules/test_undo_api/expected/test_undo_api.out
@@ -0,0 +1,12 @@
+CREATE EXTENSION test_undo_api;
+--
+-- This test will insert the data in the undo using undo api and after that
+-- it will fetch the data and verify that whether we have got the same data
+-- back or not.
+--
+SELECT test_undo_api(txid_current()::text::xid, 'permanent');
+ test_undo_api 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_undo_api/sql/test_undo_api.sql b/src/test/modules/test_undo_api/sql/test_undo_api.sql
new file mode 100644
index 0000000..4fb40ff
--- /dev/null
+++ b/src/test/modules/test_undo_api/sql/test_undo_api.sql
@@ -0,0 +1,8 @@
+CREATE EXTENSION test_undo_api;
+
+--
+-- This test will insert the data in the undo using undo api and after that
+-- it will fetch the data and verify that whether we have got the same data
+-- back or not.
+--
+SELECT test_undo_api(txid_current()::text::xid, 'permanent');
diff --git a/src/test/modules/test_undo_api/test_undo_api--1.0.sql b/src/test/modules/test_undo_api/test_undo_api--1.0.sql
new file mode 100644
index 0000000..3dd134b
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api--1.0.sql
@@ -0,0 +1,8 @@
+\echo Use "CREATE EXTENSION test_undo_api" to load this file. \quit
+
+CREATE FUNCTION test_undo_api(xid xid, persistence text)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+
diff --git a/src/test/modules/test_undo_api/test_undo_api.c b/src/test/modules/test_undo_api/test_undo_api.c
new file mode 100644
index 0000000..6026582
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api.c
@@ -0,0 +1,84 @@
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_class.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+
+#include <stdlib.h>
+#include <unistd.h>
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_undo_api);
+
+static UndoPersistence
+undo_persistence_from_text(text *t)
+{
+	char *str = text_to_cstring(t);
+
+	if (strcmp(str, "permanent") == 0)
+		return UNDO_PERMANENT;
+	else if (strcmp(str, "temporary") == 0)
+		return UNDO_TEMP;
+	else if (strcmp(str, "unlogged") == 0)
+		return UNDO_UNLOGGED;
+	else
+		elog(ERROR, "unknown undo persistence level: %s", str);
+}
+
+/*
+ * Prepare and insert data in undo storage and fetch it back to verify.
+ */
+Datum
+test_undo_api(PG_FUNCTION_ARGS)
+{
+	TransactionId xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(1));
+	char	*data = "test_data";
+	int		 len = strlen(data);
+	UnpackedUndoRecord	undorecord;
+	UnpackedUndoRecord *undorecord_out;
+	int	header_size = offsetof(UnpackedUndoRecord, uur_next) + sizeof(uint64);
+	UndoRecPtr	undo_ptr;
+
+	undorecord.uur_type = 0;
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_prevxid = FrozenTransactionId;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = 0;
+	undorecord.uur_tsid = 100;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = 0;
+	undorecord.uur_block = 1;
+	undorecord.uur_offset = 100;
+	initStringInfo(&undorecord.uur_tuple);
+	
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) data,
+						   len);
+	undo_ptr = PrepareUndoInsert(&undorecord, persistence, xid, NULL);
+	InsertPreparedUndo();
+	UnlockReleaseUndoBuffers();
+	
+	undorecord_out = UndoFetchRecord(undo_ptr, InvalidBlockNumber,
+									 InvalidOffsetNumber,
+									 InvalidTransactionId, NULL,
+									 NULL);
+
+	if (strncmp((char *) &undorecord, (char *) undorecord_out, header_size) != 0)
+		elog(ERROR, "undo header did not match");
+	if (strncmp(undorecord_out->uur_tuple.data, data, len) != 0)
+		elog(ERROR, "undo data did not match");
+
+	UndoRecordRelease(undorecord_out);
+	pfree(undorecord.uur_tuple.data);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_undo_api/test_undo_api.control b/src/test/modules/test_undo_api/test_undo_api.control
new file mode 100644
index 0000000..09df344
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api.control
@@ -0,0 +1,4 @@
+comment = 'test_undo_api'
+default_version = '1.0'
+module_pathname = '$libdir/test_undo_api'
+relocatable = true
-- 
1.8.3.1

#12

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#11)

Re: Undo logs

On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[review for undo record layer (0003-undo-interface-v3)]

I might sound repeating myself, but just to be clear, I was involved
in the design of this patch as well and I have given a few high-level
inputs for this patch. I have used this interface in the zheap
development, but haven't done any sort of detailed review which I am
doing now. I encourage others also to review this patch.

1.
 * NOTES:
+ * Handling multilog -
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.

I think before describing how the undo record is spread across
multiple logs, you can explain how it is laid out when that is not the
case. You can also explain how undo record headers are linked. I am
not sure file header is the best place or it should be mentioned in
README, but I think for now we can use file header for this purpose
and later we can move it to README if required.

2.
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */

+#define MAX_BUFFER_PER_UNDO 2

I think here the right question is what is the possibility of undo
record to be greater than BLCKSZ? For zheap, as of today, we don'
have any such requirement as the largest undo record is written for
update or multi_insert and in both cases we don't exceed the limit of
BLCKSZ. I guess some user other than zheap could probably have such
requirement and I don't think it is impossible to enhance this if we
have any requirement.

If anybody else has an opinion here, please feel to share it.

3.
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/* Maximum number of undo record that can be prepared before calling insert. */
+#define MAX_PREPARED_UNDO 2

I think it is better to define MAX_PREPARED_UNDO before
MAX_UNDO_BUFFERS as the first one is used in the definition of a
second.

4.
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+   TransactionId xid)

This function locks the buffer as well which is right as we need to do
that before critical section, but the function header comments doesn't
indicate it. You can modify it as:
"Upon return, the necessary undo buffers are pinned and locked."

Note that similar modification is required in .h file as well.

5.
+/*
+ * Insert a previously-prepared undo record.  This will lock the buffers
+ * pinned in the previous step, write the actual undo record into them,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)

Here, the comments are wrong as buffers are already locked in the
previous step. A similar change is required in .h file as well.

6.
+InsertPreparedUndo(void)
{
..
/*
+ * Try to insert the record into the current page. If it doesn't
+ * succeed then recall the routine with the next page.
+ */
+ if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+ {
+ undo_len += already_written;
+ MarkBufferDirty(buffer);
+ break;
+ }
+
+ MarkBufferDirty(buffer);
..
}

Here, you are writing into a shared buffer and marking it dirty, isn't
it a good idea to Assert for being in the critical section?

7.
+/* Maximum number of undo record that can be prepared before calling insert. */
+#define MAX_PREPARED_UNDO 2

/record/records

I think this definition doesn't define the maximum number of undo
records that can be prepared as the caller can use UndoSetPrepareSize
to change it. I think you can modify the comment as below or
something on those lines:
"This defines the number of undo records that can be prepared before
calling insert by default. If you need to prepare more than
MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
first."

8.
+ * Caller should call this under log->discard_lock
+ */
+static bool
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+ if (log->oldest_data == InvalidUndoRecPtr)
..

Isn't it a good idea to have an Assert that we already have discard_lock?

9.
+ UnpackedUndoRecord uur; /* prev txn's first undo record.*/
+} PreviousTxnInfo;

Extra space at the of comment is required.

10.
+/*
+ * Check if previous transactions undo is already discarded.
+ *
+ * Caller should call this under log->discard_lock
+ */
+static bool
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{

The name suggests that this function is doing something special for
the previous transaction whereas this it just checks whether undo is
discarded corresponding to a particular undo location. Isn't it
better if we name it as UndoRecordExists or UndoRecordIsValid? Then
explain in comments when do you consider particular record exists.

Another point to note is that you are not releasing the lock in all
paths, so it is better to mention in comments when will it be released
and when not.

11.
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+ if (log->oldest_data == InvalidUndoRecPtr)
+ {
+ /*
+ * oldest_data is not yet initialized.  We have to check
+ * UndoLogIsDiscarded and if it's already discarded then we have
+ * nothing to do.
+ */
+ LWLockRelease(&log->discard_lock);
+ if (UndoLogIsDiscarded(prev_xact_urp))
+ return true;

The comment in above code is just trying to write the code in words.
I think here you should tell why we need to call UndoLogIsDiscarded
when oldest_data is not initialized and or the scenario when
oldest_data will not be initialized.

12.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+void
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
..

I find this function name bit awkward. How about UndoRecordPrepareTransInfo?

13.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
{
..
+ /*
+ * TODO: For now we don't know how to build a transaction chain for
+ * temporary undo logs.  That's because this log might have been used by a
+ * different backend, and we can't access its buffers.  What should happen
+ * is that the undo data should be automatically discarded when the other
+ * backend detaches, but that code doesn't exist yet and the undo worker
+ * can't do it either.
+ */
+ if (log->meta.persistence == UNDO_TEMP)
+ return;

Aren't we already dealing with this case in the other patch [1]/messages/by-id/CAFiTN-t8fv-qYG9zynhS-1jRrvt_o5C-wCMRtzOsK8S=MXvKKw@mail.gmail.com?
Basically, I think we should discard it at commit time and or when the
backend is detached.

14.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
{
..
/*
+ * If previous transaction's urp is not valid means this backend is
+ * preparing its first undo so fetch the information from the undo log
+ * if it's still invalid urp means this is the first undo record for this
+ * log and we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
..

This comment is confusing. It appears to be saying same thing twice.
You can write it along something like:

"The absence of previous transaction's undo indicate that this backend
is preparing its first undo in which case we have nothing to update.".

15.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
/*
+ * Acquire the discard lock before accessing the undo record so that
+ * discard worker doen't remove the record while we are in process of
+ * reading it.
+ */

Typo doen't/doesn't.

I think you can use 'can't' instead of doesn't.

This is by no means a complete review, rather just noticed a few
things while reading the patch.

[1]: /messages/by-id/CAFiTN-t8fv-qYG9zynhS-1jRrvt_o5C-wCMRtzOsK8S=MXvKKw@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#13

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Amit Kapila (#10)

Re: Undo logs

On Wed, Oct 17, 2018 at 3:27 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Oct 15, 2018 at 6:09 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sun, Sep 2, 2018 at 12:19 AM Thomas Munro
<thomas.munro@enterprisedb.com> wrote:

I have also pushed a new WIP version of the lower level undo log
storage layer patch set to a public branch[1]. I'll leave the earlier
branch[2] there because the record-level patch posted by Dilip depends
on it for now.

Till now, I have mainly reviewed undo log allocation part. This is a
big patch and can take much more time to complete the review. I will
review the other parts of the patch later.

I think I see another issue with this patch. Basically, during
extend_undo_log, there is an assumption that no one could update
log->meta.end concurrently, but it
is not true as the same can be updated by UndoLogDiscard which can
lead to assertion failure in extend_undo_log.

+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
{
..
/*
+ * Create all the segments needed to increase 'end' to the requested
+ * size.  This is quite expensive, so we will try to avoid it completely
+ * by renaming files into place in UndoLogDiscard instead.
+ */
+ end = log->meta.end;
+ while (end < new_end)
+ {
+ allocate_empty_undo_segment(logno, log->meta.tablespace, end);
+ end += UndoLogSegmentSize;
+ }
..
+ Assert(end == new_end);
..
/*
+ * We didn't need to acquire the mutex to read 'end' above because only
+ * we write to it.  But we need the mutex to update it, because the
+ * checkpointer might read it concurrently.
+ *
+ * XXX It's possible for meta.end to be higher already during
+ * recovery, because of the timing of a checkpoint; in that case we did
+ * nothing above and we shouldn't update shmem here.  That interaction
+ * needs more analysis.
+ */
+ LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+ if (log->meta.end < end)
+ log->meta.end = end;
+ LWLockRelease(&log->mutex);
..
}

Assume, before we read log->meta.end in above code, concurrently,
discard process discards the undo and moves the end pointer to a later
location, then the above assertion will fail. Now, if discard happens
just after we read log->meta.end and before we do other stuff in this
function, then it will crash in recovery.

Can't we just remove this Assert?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#14

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#11)

Re: Undo logs

On Mon, Nov 5, 2018 at 5:09 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Sep 3, 2018 at 11:26 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Thomas has already posted the latest version of undo log patches on
'Cleaning up orphaned files using undo logs' thread[1]. So I have
rebased the undo-interface patch also. This patch also includes
latest defect fixes from the main zheap branch [2].

Hi Thomas,

The latest patch for undo log storage is not compiling on the head, I
think it needs to be rebased due to your commit related to "pg_pread()
and pg_pwrite() for data files and WAL"

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#15

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Amit Kapila (#12)

2 attachment(s)

Re: Undo logs

On Sat, Nov 10, 2018 at 9:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[review for undo record layer (0003-undo-interface-v3)]

I might sound repeating myself, but just to be clear, I was involved
in the design of this patch as well and I have given a few high-level
inputs for this patch. I have used this interface in the zheap
development, but haven't done any sort of detailed review which I am
doing now. I encourage others also to review this patch.

Thanks for the review, please find my reply inline.

1.
* NOTES:
+ * Handling multilog -
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
I think before describing how the undo record is spread across
multiple logs, you can explain how it is laid out when that is not the
case. You can also explain how undo record headers are linked. I am
not sure file header is the best place or it should be mentioned in
README, but I think for now we can use file header for this purpose
and later we can move it to README if required.

Added in the header.

2.
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO 2

I think here the right question is what is the possibility of undo
record to be greater than BLCKSZ? For zheap, as of today, we don'
have any such requirement as the largest undo record is written for
update or multi_insert and in both cases we don't exceed the limit of
BLCKSZ. I guess some user other than zheap could probably have such
requirement and I don't think it is impossible to enhance this if we
have any requirement.

If anybody else has an opinion here, please feel to share it.

Should we remove this FIXME or lets wait for some other opinion. As
of now I have kept it as it is.

3.
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/* Maximum number of undo record that can be prepared before calling insert. */
+#define MAX_PREPARED_UNDO 2

I think it is better to define MAX_PREPARED_UNDO before
MAX_UNDO_BUFFERS as the first one is used in the definition of a
second.

Done

4.
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+   TransactionId xid)
This function locks the buffer as well which is right as we need to do
that before critical section, but the function header comments doesn't
indicate it. You can modify it as:
"Upon return, the necessary undo buffers are pinned and locked."

Note that similar modification is required in .h file as well.

Done

5.
+/*
+ * Insert a previously-prepared undo record.  This will lock the buffers
+ * pinned in the previous step, write the actual undo record into them,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)

Here, the comments are wrong as buffers are already locked in the
previous step. A similar change is required in .h file as well.

Fixed

6.
+InsertPreparedUndo(void)
{
..
/*
+ * Try to insert the record into the current page. If it doesn't
+ * succeed then recall the routine with the next page.
+ */
+ if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+ {
+ undo_len += already_written;
+ MarkBufferDirty(buffer);
+ break;
+ }
+
+ MarkBufferDirty(buffer);
..
}

Here, you are writing into a shared buffer and marking it dirty, isn't
it a good idea to Assert for being in the critical section?

Done

7.
+/* Maximum number of undo record that can be prepared before calling insert. */
+#define MAX_PREPARED_UNDO 2
/record/records

I think this definition doesn't define the maximum number of undo
records that can be prepared as the caller can use UndoSetPrepareSize
to change it. I think you can modify the comment as below or
something on those lines:
"This defines the number of undo records that can be prepared before
calling insert by default. If you need to prepare more than
MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
first."

Fixed

8.
+ * Caller should call this under log->discard_lock
+ */
+static bool
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+ if (log->oldest_data == InvalidUndoRecPtr)
..

Isn't it a good idea to have an Assert that we already have discard_lock?

Done

9.
+ UnpackedUndoRecord uur; /* prev txn's first undo record.*/
+} PreviousTxnInfo;
Extra space at the of comment is required.

Done

10.
+/*
+ * Check if previous transactions undo is already discarded.
+ *
+ * Caller should call this under log->discard_lock
+ */
+static bool
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
The name suggests that this function is doing something special for
the previous transaction whereas this it just checks whether undo is
discarded corresponding to a particular undo location. Isn't it
better if we name it as UndoRecordExists or UndoRecordIsValid? Then
explain in comments when do you consider particular record exists.

Changed to UndoRecordIsValid

Another point to note is that you are not releasing the lock in all
paths, so it is better to mention in comments when will it be released
and when not.

Done

11.
+IsPrevTxnUndoDiscarded(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+ if (log->oldest_data == InvalidUndoRecPtr)
+ {
+ /*
+ * oldest_data is not yet initialized.  We have to check
+ * UndoLogIsDiscarded and if it's already discarded then we have
+ * nothing to do.
+ */
+ LWLockRelease(&log->discard_lock);
+ if (UndoLogIsDiscarded(prev_xact_urp))
+ return true;
The comment in above code is just trying to write the code in words.
I think here you should tell why we need to call UndoLogIsDiscarded
when oldest_data is not initialized and or the scenario when
oldest_data will not be initialized.

Fixed

12.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+void
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
..

I find this function name bit awkward. How about UndoRecordPrepareTransInfo?

Changed as per the suggestion

13.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
{
..
+ /*
+ * TODO: For now we don't know how to build a transaction chain for
+ * temporary undo logs.  That's because this log might have been used by a
+ * different backend, and we can't access its buffers.  What should happen
+ * is that the undo data should be automatically discarded when the other
+ * backend detaches, but that code doesn't exist yet and the undo worker
+ * can't do it either.
+ */
+ if (log->meta.persistence == UNDO_TEMP)
+ return;

Aren't we already dealing with this case in the other patch [1]?
Basically, I think we should discard it at commit time and or when the
backend is detached.

Changed

14.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
{
..
/*
+ * If previous transaction's urp is not valid means this backend is
+ * preparing its first undo so fetch the information from the undo log
+ * if it's still invalid urp means this is the first undo record for this
+ * log and we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
..
This comment is confusing. It appears to be saying same thing twice.
You can write it along something like:

"The absence of previous transaction's undo indicate that this backend
is preparing its first undo in which case we have nothing to update.".

Done as per the suggestion

15.
+PrepareUndoRecordUpdateTransInfo(UndoRecPtr urecptr, bool log_switched)
/*
+ * Acquire the discard lock before accessing the undo record so that
+ * discard worker doen't remove the record while we are in process of
+ * reading it.
+ */

Typo doen't/doesn't.

I think you can use 'can't' instead of doesn't.

Fixed

This is by no means a complete review, rather just noticed a few
things while reading the patch.

[1] - /messages/by-id/CAFiTN-t8fv-qYG9zynhS-1jRrvt_o5C-wCMRtzOsK8S=MXvKKw@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-undo-interface-v4.patchapplication/octet-stream; name=0003-undo-interface-v4.patchDownload

From 8f5653c3c39bc7bea3f685b16ee386a036991109 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 14 Nov 2018 00:09:04 -0800
Subject: [PATCH] Provide an interface for prepare, insert, or fetch the undo
 records. This layer will use undo-log-storage to reserve the space for the
 undo records and buffer management routine to write and read the undo
 records.

Dilip Kumar with help from Rafia Sabih based on an early prototype
for forming undo record by Robert Haas and design inputs from Amit Kapila
---
 src/backend/access/transam/xact.c    |   24 +
 src/backend/access/transam/xlog.c    |   29 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1172 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  459 +++++++++++++
 src/include/access/undoinsert.h      |  106 +++
 src/include/access/undorecord.h      |  216 +++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 2010 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e..6b7f7fa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -189,6 +189,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +916,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dce4c01..23f23e7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8511,6 +8511,35 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..4214771
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1172 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ *  Undo record are stored in sequential order in the undo log.  And, each
+ *  transaction's first undo record a.k.a. transaction header points to the next
+ *  transaction's start header.  Transaction headers are linked so that the
+ *  discard worker can read undo log transaction by transaction and avoid
+ *  reading each undo record.
+ *
+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId	prev_txid[UndoPersistenceLevels] = { 0 };
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber	logno;			/* Undo log number */
+	BlockNumber		blk;			/* block number */
+	Buffer			buf;			/* buffer allocated for the block */
+	bool			zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr urp;						/* undo record pointer */
+	UnpackedUndoRecord *urec;			/* undo record */
+	int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace  def_prepared[MAX_PREPARED_UNDO];
+static int prepare_idx;
+static int	max_prepare_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr	multi_prep_urp = InvalidUndoRecPtr;
+static bool	update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+	UndoRecPtr	prev_urecptr; /* prev txn's starting urecptr */
+	int			prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;	/* prev txn's first undo record. */
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord* UndoGetOneRecord(UnpackedUndoRecord *urec,
+											UndoRecPtr urp, RelFileNode rnode,
+											UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+											 bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int InsertFindBufferSlot(RelFileNode rnode, BlockNumber blk,
+								ReadBufferMode rbm,
+								UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl *log,
+							  UndoRecPtr prev_xact_urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+	Assert(LWLockHeldByMeInMode(log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or not
+		 * so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that the
+		 * doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(prev_xact_urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (prev_xact_urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	prev_xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber	cur_blk;
+	RelFileNode	rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno, false);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno, false);
+	}
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		prev_xact_urp = InvalidUndoRecPtr;
+	else
+		prev_xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(prev_xact_urp))
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, prev_xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, prev_xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(prev_xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(prev_xact_urp);
+
+	while (true)
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk,
+									  RBM_NORMAL,
+									  log->meta.persistence);
+		prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+		index++;
+
+		if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	prev_txn_info.uur.uur_next = urecptr;
+	prev_txn_info.prev_urecptr = prev_xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(prev_txn_info.prev_urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	prev_urp = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	prev_urp = prev_txn_info.prev_urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, prev_urp))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction
+	 * header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(prev_urp);
+
+	do
+	{
+		Buffer  buffer;
+		int		buf_idx;
+
+		buf_idx = prev_txn_info.prev_txn_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&prev_txn_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while(true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+InsertFindBufferSlot(RelFileNode rnode,
+					 BlockNumber blk,
+					 ReadBufferMode rbm,
+					 UndoPersistence persistence)
+{
+	int 	i;
+	Buffer 	buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g
+		 * when previous transaction start header is in previous undo log)
+		 * so compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+										GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+						UndoPersistence upersistence, TransactionId txid)
+{
+	UnpackedUndoRecord *urec;
+	UndoLogControl *log;
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	bool	need_start_undo = false;
+	bool	first_rec_in_recovery;
+	bool	log_switched = false;
+	int	i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/*
+	 * If this is the first undo record for this transaction then set the
+	 * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+	 * the space for the transaction header and the valid value of the uur_next
+	 * will be updated while preparing the first undo record of the next
+	 * transaction.
+	 */
+	first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+	if ((!InRecovery && prev_txid[upersistence] != txid) ||
+		first_rec_in_recovery)
+	{
+		need_start_undo = true;
+	}
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		if (need_start_undo && i == 0)
+		{
+			urec->uur_next = SpecialUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			/* During recovery, Fetch database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables as these are used only
+			 * with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+		}
+
+
+		/* calculate the size of the undo record. */
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If this is the first record of the log and not the first record of
+	 * the transaction i.e. same transaction continued from the previous log
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, that will make the record larger,
+	 * so we'll have to go back and recompute the size.
+	 */
+	if (!need_start_undo &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_start_undo = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+
+		goto resize;
+	}
+
+	/*
+	 * If transaction id is switched then update the previous transaction's
+	 * start undo record.
+	 */
+	if (first_rec_in_recovery ||
+		(!InRecovery && prev_txid[upersistence] != txid) ||
+		log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+void
+UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,
+											 upersistence, txid);
+	if (max_prepare <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(max_prepare * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's
+	 * starting undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((max_prepare + 1) * MAX_BUFFER_PER_UNDO *
+						 sizeof(UndoBuffers));
+	max_prepare_undo = max_prepare;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+				  TransactionId xid)
+{
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	RelFileNode		rnode;
+	UndoRecordSize  cur_size = 0;
+	BlockNumber		cur_blk;
+	TransactionId	txid;
+	int				starting_byte;
+	int				index = 0;
+	int				bufidx;
+	ReadBufferMode	rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepare_undo)
+		return InvalidUndoRecPtr;
+
+	/*
+	 * If this is the first undo record for this top transaction add the
+	 * transaction information to the undo record.
+	 *
+	 * XXX there is also an option that instead of adding the information to
+	 * this record we can prepare a new record which only contain transaction
+	 * informations.
+	 */
+	if (xid == InvalidTransactionId)
+	{
+		/* we expect during recovery, we always have a valid transaction id. */
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping for
+		 * the top most transactions.
+		 */
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(multi_prep_urp))
+		urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+	else
+		urecptr = multi_prep_urp;
+
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(multi_prep_urp))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* FIXME: Should we just report error ? */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/* Undo record can not fit into this block so go to the next block. */
+		cur_blk++;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+	} while (cur_size < size);
+
+	/*
+	 * Save referenced of undo record pointer as well as undo record.
+	 * InsertPreparedUndo will use these to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int		idx;
+	int		flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int		idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page	page;
+	int		starting_byte;
+	int		already_written;
+	int		bufidx = 0;
+	int		idx;
+	uint16	undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord	*uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+	uint16	prev_undolen;
+
+	Assert(prepare_idx > 0);
+
+	/* This must be called under a critical section. */
+	Assert(CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		/*
+		 * We can read meta.prevlen without locking, because only we can write
+		 * to it.
+		 */
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+		prev_undolen = log->meta.prevlen;
+
+		/* store the previous undo record length in the header */
+		uur->uur_prevlen = prev_undolen;
+
+		/* if starting a new log then there is no prevlen to store */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/* if starting from a new page then include header in prevlen */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+				uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer  buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+			starting_byte = UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			undo_len += UndoLogBlockHeaderSize;
+
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while(true);
+
+		prev_undolen = undo_len;
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
+
+		if (UndoRecPtrIsValid(prev_txn_info.prev_urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int	i;
+	for (i = 0; i < buffer_idx; i++)
+	{
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	prev_txn_info.prev_urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	multi_prep_urp = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepare_undo limit is changed so free the allocated memory and reset
+	 * all the variable back to its default value.
+	 */
+	if (max_prepare_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepare_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer,  now caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if he wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord*
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer			 buffer = urec->uur_buffer;
+	Page			 page;
+	int				 starting_byte = UndoRecPtrGetPageOffset(urp);
+	int				 already_decoded = 0;
+	BlockNumber		 cur_blk;
+	bool			 is_undo_splited = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a previous buffer then no need to allocate new. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * FIXME: This can be optimized to just fetch header first and only
+		 * if matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_splited = true;
+
+		/*
+		 * Complete record is not fitting into one buffer so release the buffer
+		 * pin and also set invalid buffer in the undo record.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer. Otherwise just
+	 * unlock it.
+	 */
+	if (is_undo_splited)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord*
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode		 rnode, prevrnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int	logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/*
+		 * If we have a valid buffer pinned then just ensure that we want to
+		 * find the next tuple from the same block.  Otherwise release the
+		 * buffer and set it invalid
+		 */
+		if (BufferIsValid(urec->uur_buffer))
+		{
+			/*
+			 * Undo buffer will be changed if the next undo record belongs to a
+			 * different block or undo log.
+			 */
+			if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+				(prevrnode.relNode != rnode.relNode))
+			{
+				ReleaseBuffer(urec->uur_buffer);
+				urec->uur_buffer = InvalidBuffer;
+			}
+		}
+		else
+		{
+			/*
+			 * If there is not a valid buffer in urec->uur_buffer that means we
+			 * had copied the payload data and tuple data so free them.
+			 */
+			if (urec->uur_payload.data)
+				pfree(urec->uur_payload.data);
+			if (urec->uur_tuple.data)
+				pfree(urec->uur_tuple.data);
+		}
+
+		/* Reset the urec before fetching the tuple */
+		urec->uur_tuple.data = NULL;
+		urec->uur_tuple.len = 0;
+		urec->uur_payload.data = NULL;
+		urec->uur_payload.len = 0;
+		prevrnode = rnode;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecPtrIsValid(log->oldest_data))
+		{
+			/*
+			 * UndoDiscardInfo is not yet initialized. Hence, we've to check
+			 * UndoLogIsDiscarded and if it's already discarded then we have
+			 * nothing to do.
+			 */
+			LWLockRelease(&log->discard_lock);
+			if (UndoLogIsDiscarded(urp))
+			{
+				if (BufferIsValid(urec->uur_buffer))
+					ReleaseBuffer(urec->uur_buffer);
+				return NULL;
+			}
+
+			LWLockAcquire(&log->discard_lock, LW_SHARED);
+		}
+
+		/* Check if it's already discarded. */
+		if (urp < log->oldest_data)
+		{
+			LWLockRelease(&log->discard_lock);
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl	*prevlog, *log;
+
+		log = UndoLogGet(logno, false);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno, true);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr (logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree (urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..33bb153
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,459 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size	size;
+
+	/* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+	/* if (uur->uur_info == 0) */
+		UndoRecordSetInfo(uur);
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char   *writeptr = (char *) page + starting_byte;
+	char   *endptr = (char *) page + BLCKSZ;
+	int		my_bytes_written = *already_written;
+
+	if (uur->uur_info == 0)
+		UndoRecordSetInfo(uur);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption
+	 * that it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_relfilenode = uur->uur_relfilenode;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_tsid = uur->uur_tsid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before,
+		 * or caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_relfilenode == uur->uur_relfilenode);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_tsid == uur->uur_tsid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int		can_write;
+	int		remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing
+	 * to do except update *my_bytes_written, which we must do to ensure
+	 * that the next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+					  int *already_decoded, bool header_only)
+{
+	char	*readptr = (char *)page + starting_byte;
+	char	*endptr = (char *) page + BLCKSZ;
+	int		my_bytes_decoded = *already_decoded;
+	bool	is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_relfilenode = work_hdr.urec_relfilenode;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_tsid = work_rd.urec_tsid;
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any of
+		 * the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int		can_read;
+	int		remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+static void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_tsid != DEFAULTTABLESPACE_OID ||
+		uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..a2bf7cc
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord* urec,
+											BlockNumber blkno,
+											OffsetNumber offset,
+											TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, UndoPersistence,
+					TransactionId xid);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord* UndoFetchRecord(UndoRecPtr urp,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid,
+										   UndoRecPtr *urec_ptr_out,
+										   SatisfyUndoRecordCallback callback);
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+							   TransactionId xid, UndoPersistence upersistence);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+
+#endif   /* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..85642ad
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,216 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_relfilenode;		/* relfilenode for relation */
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older then RecentGlobalXmin, then we can consider the tuple
+	 * in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;			/* Transaction id */
+	CommandId	urec_cid;			/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+#define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	Oid			urec_tsid;		/* tablespace OID */
+	ForkNumber		urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+	uint32			urec_progress;  /* undo applying progress. */
+	uint32			urec_xidepoch;  /* epoch of the current transaction */
+	Oid				urec_dbid;		/* database id */
+	uint64			urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+#define urec_next_pos \
+	(SizeOfUndoRecordTransaction - SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;		/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordExpectedSize or InsertUndoRecord.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_relfilenode;	/* relfilenode for relation */
+	TransactionId uur_prevxid;		/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	Oid			uur_tsid;		/* tablespace OID */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id*/
+
+	/*
+	 * undo action apply progress 0 = not started, 1 = completed. In future it
+	 * can also be used to show the progress of how much undo has been applied
+	 * so far with some formulae but currently only 0 and 1 is used.
+	 */
+	uint32         uur_progress;
+	StringInfoData uur_payload;	/* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.  For the first call, the given page should be the one which
+ * the caller has determined to contain the current insertion point,
+ * starting_byte should be the byte offset within that page which corresponds
+ * to the current insertion point, and *already_written should be 0.  The
+ * return value will be true if the entire record is successfully written
+ * into that page, and false if not.  In either case, *already_written will
+ * be updated to the number of bytes written by all InsertUndoRecord calls
+ * for this record to date.  If this function is called again to continue
+ * writing the record, the previous value for *already_written should be
+ * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
+ * (since the record will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif   /* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 689c57c..73394c5 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e01d12e..8cfcd44 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -277,6 +277,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

0004-undo-interface-test-v4.patchapplication/octet-stream; name=0004-undo-interface-test-v4.patchDownload

From 8f5653c3c39bc7bea3f685b16ee386a036991109 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 14 Nov 2018 00:09:04 -0800
Subject: [PATCH] Provide an interface for prepare, insert, or fetch the undo
 records. This layer will use undo-log-storage to reserve the space for the
 undo records and buffer management routine to write and read the undo
 records.

Dilip Kumar with help from Rafia Sabia based on an early prototype
for forming undo record by Robert Haas and design inputs from Amit Kapila
---
 src/backend/access/transam/xact.c    |   24 +
 src/backend/access/transam/xlog.c    |   29 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1172 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  459 +++++++++++++
 src/include/access/undoinsert.h      |  106 +++
 src/include/access/undorecord.h      |  216 +++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 2010 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e..6b7f7fa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -189,6 +189,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +916,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dce4c01..23f23e7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8511,6 +8511,35 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..4214771
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1172 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ *  Undo record are stored in sequential order in the undo log.  And, each
+ *  transaction's first undo record a.k.a. transaction header points to the next
+ *  transaction's start header.  Transaction headers are linked so that the
+ *  discard worker can read undo log transaction by transaction and avoid
+ *  reading each undo record.
+ *
+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId	prev_txid[UndoPersistenceLevels] = { 0 };
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber	logno;			/* Undo log number */
+	BlockNumber		blk;			/* block number */
+	Buffer			buf;			/* buffer allocated for the block */
+	bool			zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr urp;						/* undo record pointer */
+	UnpackedUndoRecord *urec;			/* undo record */
+	int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace  def_prepared[MAX_PREPARED_UNDO];
+static int prepare_idx;
+static int	max_prepare_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr	multi_prep_urp = InvalidUndoRecPtr;
+static bool	update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+	UndoRecPtr	prev_urecptr; /* prev txn's starting urecptr */
+	int			prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;	/* prev txn's first undo record. */
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord* UndoGetOneRecord(UnpackedUndoRecord *urec,
+											UndoRecPtr urp, RelFileNode rnode,
+											UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+											 bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int InsertFindBufferSlot(RelFileNode rnode, BlockNumber blk,
+								ReadBufferMode rbm,
+								UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl *log,
+							  UndoRecPtr prev_xact_urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+	Assert(LWLockHeldByMeInMode(log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or not
+		 * so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that the
+		 * doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(prev_xact_urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (prev_xact_urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	prev_xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber	cur_blk;
+	RelFileNode	rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno, false);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno, false);
+	}
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		prev_xact_urp = InvalidUndoRecPtr;
+	else
+		prev_xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(prev_xact_urp))
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, prev_xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, prev_xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(prev_xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(prev_xact_urp);
+
+	while (true)
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk,
+									  RBM_NORMAL,
+									  log->meta.persistence);
+		prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+		index++;
+
+		if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	prev_txn_info.uur.uur_next = urecptr;
+	prev_txn_info.prev_urecptr = prev_xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(prev_txn_info.prev_urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	prev_urp = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	prev_urp = prev_txn_info.prev_urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, prev_urp))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction
+	 * header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(prev_urp);
+
+	do
+	{
+		Buffer  buffer;
+		int		buf_idx;
+
+		buf_idx = prev_txn_info.prev_txn_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&prev_txn_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while(true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+InsertFindBufferSlot(RelFileNode rnode,
+					 BlockNumber blk,
+					 ReadBufferMode rbm,
+					 UndoPersistence persistence)
+{
+	int 	i;
+	Buffer 	buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g
+		 * when previous transaction start header is in previous undo log)
+		 * so compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+										GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+						UndoPersistence upersistence, TransactionId txid)
+{
+	UnpackedUndoRecord *urec;
+	UndoLogControl *log;
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	bool	need_start_undo = false;
+	bool	first_rec_in_recovery;
+	bool	log_switched = false;
+	int	i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/*
+	 * If this is the first undo record for this transaction then set the
+	 * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+	 * the space for the transaction header and the valid value of the uur_next
+	 * will be updated while preparing the first undo record of the next
+	 * transaction.
+	 */
+	first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+	if ((!InRecovery && prev_txid[upersistence] != txid) ||
+		first_rec_in_recovery)
+	{
+		need_start_undo = true;
+	}
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		if (need_start_undo && i == 0)
+		{
+			urec->uur_next = SpecialUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			/* During recovery, Fetch database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables as these are used only
+			 * with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+		}
+
+
+		/* calculate the size of the undo record. */
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If this is the first record of the log and not the first record of
+	 * the transaction i.e. same transaction continued from the previous log
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, that will make the record larger,
+	 * so we'll have to go back and recompute the size.
+	 */
+	if (!need_start_undo &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_start_undo = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+
+		goto resize;
+	}
+
+	/*
+	 * If transaction id is switched then update the previous transaction's
+	 * start undo record.
+	 */
+	if (first_rec_in_recovery ||
+		(!InRecovery && prev_txid[upersistence] != txid) ||
+		log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+void
+UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,
+											 upersistence, txid);
+	if (max_prepare <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(max_prepare * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's
+	 * starting undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((max_prepare + 1) * MAX_BUFFER_PER_UNDO *
+						 sizeof(UndoBuffers));
+	max_prepare_undo = max_prepare;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+				  TransactionId xid)
+{
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	RelFileNode		rnode;
+	UndoRecordSize  cur_size = 0;
+	BlockNumber		cur_blk;
+	TransactionId	txid;
+	int				starting_byte;
+	int				index = 0;
+	int				bufidx;
+	ReadBufferMode	rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepare_undo)
+		return InvalidUndoRecPtr;
+
+	/*
+	 * If this is the first undo record for this top transaction add the
+	 * transaction information to the undo record.
+	 *
+	 * XXX there is also an option that instead of adding the information to
+	 * this record we can prepare a new record which only contain transaction
+	 * informations.
+	 */
+	if (xid == InvalidTransactionId)
+	{
+		/* we expect during recovery, we always have a valid transaction id. */
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping for
+		 * the top most transactions.
+		 */
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(multi_prep_urp))
+		urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+	else
+		urecptr = multi_prep_urp;
+
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(multi_prep_urp))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* FIXME: Should we just report error ? */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/* Undo record can not fit into this block so go to the next block. */
+		cur_blk++;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+	} while (cur_size < size);
+
+	/*
+	 * Save referenced of undo record pointer as well as undo record.
+	 * InsertPreparedUndo will use these to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int		idx;
+	int		flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int		idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page	page;
+	int		starting_byte;
+	int		already_written;
+	int		bufidx = 0;
+	int		idx;
+	uint16	undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord	*uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+	uint16	prev_undolen;
+
+	Assert(prepare_idx > 0);
+
+	/* This must be called under a critical section. */
+	Assert(CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		/*
+		 * We can read meta.prevlen without locking, because only we can write
+		 * to it.
+		 */
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+		prev_undolen = log->meta.prevlen;
+
+		/* store the previous undo record length in the header */
+		uur->uur_prevlen = prev_undolen;
+
+		/* if starting a new log then there is no prevlen to store */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/* if starting from a new page then include header in prevlen */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+				uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer  buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+			starting_byte = UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			undo_len += UndoLogBlockHeaderSize;
+
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while(true);
+
+		prev_undolen = undo_len;
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
+
+		if (UndoRecPtrIsValid(prev_txn_info.prev_urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int	i;
+	for (i = 0; i < buffer_idx; i++)
+	{
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	prev_txn_info.prev_urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	multi_prep_urp = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepare_undo limit is changed so free the allocated memory and reset
+	 * all the variable back to its default value.
+	 */
+	if (max_prepare_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepare_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer,  now caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if he wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord*
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer			 buffer = urec->uur_buffer;
+	Page			 page;
+	int				 starting_byte = UndoRecPtrGetPageOffset(urp);
+	int				 already_decoded = 0;
+	BlockNumber		 cur_blk;
+	bool			 is_undo_splited = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a previous buffer then no need to allocate new. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * FIXME: This can be optimized to just fetch header first and only
+		 * if matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_splited = true;
+
+		/*
+		 * Complete record is not fitting into one buffer so release the buffer
+		 * pin and also set invalid buffer in the undo record.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer. Otherwise just
+	 * unlock it.
+	 */
+	if (is_undo_splited)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord*
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode		 rnode, prevrnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int	logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/*
+		 * If we have a valid buffer pinned then just ensure that we want to
+		 * find the next tuple from the same block.  Otherwise release the
+		 * buffer and set it invalid
+		 */
+		if (BufferIsValid(urec->uur_buffer))
+		{
+			/*
+			 * Undo buffer will be changed if the next undo record belongs to a
+			 * different block or undo log.
+			 */
+			if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+				(prevrnode.relNode != rnode.relNode))
+			{
+				ReleaseBuffer(urec->uur_buffer);
+				urec->uur_buffer = InvalidBuffer;
+			}
+		}
+		else
+		{
+			/*
+			 * If there is not a valid buffer in urec->uur_buffer that means we
+			 * had copied the payload data and tuple data so free them.
+			 */
+			if (urec->uur_payload.data)
+				pfree(urec->uur_payload.data);
+			if (urec->uur_tuple.data)
+				pfree(urec->uur_tuple.data);
+		}
+
+		/* Reset the urec before fetching the tuple */
+		urec->uur_tuple.data = NULL;
+		urec->uur_tuple.len = 0;
+		urec->uur_payload.data = NULL;
+		urec->uur_payload.len = 0;
+		prevrnode = rnode;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecPtrIsValid(log->oldest_data))
+		{
+			/*
+			 * UndoDiscardInfo is not yet initialized. Hence, we've to check
+			 * UndoLogIsDiscarded and if it's already discarded then we have
+			 * nothing to do.
+			 */
+			LWLockRelease(&log->discard_lock);
+			if (UndoLogIsDiscarded(urp))
+			{
+				if (BufferIsValid(urec->uur_buffer))
+					ReleaseBuffer(urec->uur_buffer);
+				return NULL;
+			}
+
+			LWLockAcquire(&log->discard_lock, LW_SHARED);
+		}
+
+		/* Check if it's already discarded. */
+		if (urp < log->oldest_data)
+		{
+			LWLockRelease(&log->discard_lock);
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl	*prevlog, *log;
+
+		log = UndoLogGet(logno, false);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno, true);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr (logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree (urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..33bb153
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,459 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size	size;
+
+	/* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+	/* if (uur->uur_info == 0) */
+		UndoRecordSetInfo(uur);
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char   *writeptr = (char *) page + starting_byte;
+	char   *endptr = (char *) page + BLCKSZ;
+	int		my_bytes_written = *already_written;
+
+	if (uur->uur_info == 0)
+		UndoRecordSetInfo(uur);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption
+	 * that it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_relfilenode = uur->uur_relfilenode;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_tsid = uur->uur_tsid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before,
+		 * or caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_relfilenode == uur->uur_relfilenode);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_tsid == uur->uur_tsid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int		can_write;
+	int		remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing
+	 * to do except update *my_bytes_written, which we must do to ensure
+	 * that the next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+					  int *already_decoded, bool header_only)
+{
+	char	*readptr = (char *)page + starting_byte;
+	char	*endptr = (char *) page + BLCKSZ;
+	int		my_bytes_decoded = *already_decoded;
+	bool	is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_relfilenode = work_hdr.urec_relfilenode;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_tsid = work_rd.urec_tsid;
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any of
+		 * the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int		can_read;
+	int		remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+static void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_tsid != DEFAULTTABLESPACE_OID ||
+		uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..a2bf7cc
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord* urec,
+											BlockNumber blkno,
+											OffsetNumber offset,
+											TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, UndoPersistence,
+					TransactionId xid);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord* UndoFetchRecord(UndoRecPtr urp,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid,
+										   UndoRecPtr *urec_ptr_out,
+										   SatisfyUndoRecordCallback callback);
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+							   TransactionId xid, UndoPersistence upersistence);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+
+#endif   /* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..85642ad
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,216 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_relfilenode;		/* relfilenode for relation */
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older then RecentGlobalXmin, then we can consider the tuple
+	 * in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;			/* Transaction id */
+	CommandId	urec_cid;			/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+#define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	Oid			urec_tsid;		/* tablespace OID */
+	ForkNumber		urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+	uint32			urec_progress;  /* undo applying progress. */
+	uint32			urec_xidepoch;  /* epoch of the current transaction */
+	Oid				urec_dbid;		/* database id */
+	uint64			urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+#define urec_next_pos \
+	(SizeOfUndoRecordTransaction - SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;		/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordExpectedSize or InsertUndoRecord.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_relfilenode;	/* relfilenode for relation */
+	TransactionId uur_prevxid;		/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	Oid			uur_tsid;		/* tablespace OID */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id*/
+
+	/*
+	 * undo action apply progress 0 = not started, 1 = completed. In future it
+	 * can also be used to show the progress of how much undo has been applied
+	 * so far with some formulae but currently only 0 and 1 is used.
+	 */
+	uint32         uur_progress;
+	StringInfoData uur_payload;	/* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.  For the first call, the given page should be the one which
+ * the caller has determined to contain the current insertion point,
+ * starting_byte should be the byte offset within that page which corresponds
+ * to the current insertion point, and *already_written should be 0.  The
+ * return value will be true if the entire record is successfully written
+ * into that page, and false if not.  In either case, *already_written will
+ * be updated to the number of bytes written by all InsertUndoRecord calls
+ * for this record to date.  If this function is called again to continue
+ * writing the record, the previous value for *already_written should be
+ * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
+ * (since the record will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif   /* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 689c57c..73394c5 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e01d12e..8cfcd44 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -277,6 +277,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

#16

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#15)

Re: Undo logs

On Wed, Nov 14, 2018 at 2:42 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Nov 10, 2018 at 9:27 AM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Nov 5, 2018 at 5:10 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

[review for undo record layer (0003-undo-interface-v3)]

I might sound repeating myself, but just to be clear, I was involved
in the design of this patch as well and I have given a few high-level
inputs for this patch. I have used this interface in the zheap
development, but haven't done any sort of detailed review which I am
doing now. I encourage others also to review this patch.

Thanks for the review, please find my reply inline.
1.
* NOTES:
+ * Handling multilog -
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
I think before describing how the undo record is spread across
multiple logs, you can explain how it is laid out when that is not the
case. You can also explain how undo record headers are linked. I am
not sure file header is the best place or it should be mentioned in
README, but I think for now we can use file header for this purpose
and later we can move it to README if required.
Added in the header.
2.
+/*
+ * FIXME:  Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO 2

I think here the right question is what is the possibility of undo
record to be greater than BLCKSZ? For zheap, as of today, we don'
have any such requirement as the largest undo record is written for
update or multi_insert and in both cases we don't exceed the limit of
BLCKSZ. I guess some user other than zheap could probably have such
requirement and I don't think it is impossible to enhance this if we
have any requirement.

If anybody else has an opinion here, please feel to share it.
Should we remove this FIXME or lets wait for some other opinion. As
of now I have kept it as it is.

I think you can keep it with XXX instead of Fixme as there is nothing to fix.

Both the patches 0003-undo-interface-v4.patch and
0004-undo-interface-test-v4.patch appears to be same except for the
name?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#17

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Amit Kapila (#16)

2 attachment(s)

Re: Undo logs

On Wed, Nov 14, 2018 at 2:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think you can keep it with XXX instead of Fixme as there is nothing to fix.

Changed

Both the patches 0003-undo-interface-v4.patch and
0004-undo-interface-test-v4.patch appears to be same except for the
name?

My bad, please find the updated patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0004-undo-interface-test-v5.patchapplication/octet-stream; name=0004-undo-interface-test-v5.patchDownload

From 11f184a99e385e72b5f900e34d0ae288dcc58fcf Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 14 Nov 2018 02:05:23 -0800
Subject: [PATCH] undo-interface-test-v4

Provides test module for undo-interface routines.
---
 src/test/modules/Makefile                          |  1 +
 src/test/modules/test_undo_api/Makefile            | 21 ++++++
 .../test_undo_api/expected/test_undo_api.out       | 12 +++
 .../modules/test_undo_api/sql/test_undo_api.sql    |  8 ++
 .../modules/test_undo_api/test_undo_api--1.0.sql   |  8 ++
 src/test/modules/test_undo_api/test_undo_api.c     | 85 ++++++++++++++++++++++
 .../modules/test_undo_api/test_undo_api.control    |  4 +
 7 files changed, 139 insertions(+)
 create mode 100644 src/test/modules/test_undo_api/Makefile
 create mode 100644 src/test/modules/test_undo_api/expected/test_undo_api.out
 create mode 100644 src/test/modules/test_undo_api/sql/test_undo_api.sql
 create mode 100644 src/test/modules/test_undo_api/test_undo_api--1.0.sql
 create mode 100644 src/test/modules/test_undo_api/test_undo_api.c
 create mode 100644 src/test/modules/test_undo_api/test_undo_api.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 43323a6..e05fd00 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_undo \
+		  test_undo_api \
 		  worker_spi
 
 $(recurse)
diff --git a/src/test/modules/test_undo_api/Makefile b/src/test/modules/test_undo_api/Makefile
new file mode 100644
index 0000000..deb3816
--- /dev/null
+++ b/src/test/modules/test_undo_api/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_undo/Makefile
+
+MODULE_big = test_undo_api
+OBJS = test_undo_api.o
+PGFILEDESC = "test_undo_api - a test module for the undo api layer"
+
+EXTENSION = test_undo_api
+DATA = test_undo_api--1.0.sql
+
+REGRESS = test_undo_api
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_undo_api
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_undo_api/expected/test_undo_api.out b/src/test/modules/test_undo_api/expected/test_undo_api.out
new file mode 100644
index 0000000..995b517
--- /dev/null
+++ b/src/test/modules/test_undo_api/expected/test_undo_api.out
@@ -0,0 +1,12 @@
+CREATE EXTENSION test_undo_api;
+--
+-- This test will insert the data in the undo using undo api and after that
+-- it will fetch the data and verify that whether we have got the same data
+-- back or not.
+--
+SELECT test_undo_api(txid_current()::text::xid, 'permanent');
+ test_undo_api 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_undo_api/sql/test_undo_api.sql b/src/test/modules/test_undo_api/sql/test_undo_api.sql
new file mode 100644
index 0000000..4fb40ff
--- /dev/null
+++ b/src/test/modules/test_undo_api/sql/test_undo_api.sql
@@ -0,0 +1,8 @@
+CREATE EXTENSION test_undo_api;
+
+--
+-- This test will insert the data in the undo using undo api and after that
+-- it will fetch the data and verify that whether we have got the same data
+-- back or not.
+--
+SELECT test_undo_api(txid_current()::text::xid, 'permanent');
diff --git a/src/test/modules/test_undo_api/test_undo_api--1.0.sql b/src/test/modules/test_undo_api/test_undo_api--1.0.sql
new file mode 100644
index 0000000..3dd134b
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api--1.0.sql
@@ -0,0 +1,8 @@
+\echo Use "CREATE EXTENSION test_undo_api" to load this file. \quit
+
+CREATE FUNCTION test_undo_api(xid xid, persistence text)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+
diff --git a/src/test/modules/test_undo_api/test_undo_api.c b/src/test/modules/test_undo_api/test_undo_api.c
new file mode 100644
index 0000000..d8f3a06
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api.c
@@ -0,0 +1,85 @@
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_class.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+
+#include <stdlib.h>
+#include <unistd.h>
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_undo_api);
+
+static UndoPersistence
+undo_persistence_from_text(text *t)
+{
+	char *str = text_to_cstring(t);
+
+	if (strcmp(str, "permanent") == 0)
+		return UNDO_PERMANENT;
+	else if (strcmp(str, "temporary") == 0)
+		return UNDO_TEMP;
+	else if (strcmp(str, "unlogged") == 0)
+		return UNDO_UNLOGGED;
+	else
+		elog(ERROR, "unknown undo persistence level: %s", str);
+}
+
+/*
+ * Prepare and insert data in undo storage and fetch it back to verify.
+ */
+Datum
+test_undo_api(PG_FUNCTION_ARGS)
+{
+	TransactionId xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(1));
+	char	*data = "test_data";
+	int		 len = strlen(data);
+	UnpackedUndoRecord	undorecord;
+	UnpackedUndoRecord *undorecord_out;
+	int	header_size = offsetof(UnpackedUndoRecord, uur_next) + sizeof(uint64);
+	UndoRecPtr	undo_ptr;
+
+	undorecord.uur_type = 0;
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_prevxid = FrozenTransactionId;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = 0;
+	undorecord.uur_tsid = 100;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = 0;
+	undorecord.uur_block = 1;
+	undorecord.uur_offset = 100;
+	undorecord.uur_payload.len = 0;
+	initStringInfo(&undorecord.uur_tuple);
+	
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) data,
+						   len);
+	undo_ptr = PrepareUndoInsert(&undorecord, persistence, xid);
+	InsertPreparedUndo();
+	UnlockReleaseUndoBuffers();
+	
+	undorecord_out = UndoFetchRecord(undo_ptr, InvalidBlockNumber,
+									 InvalidOffsetNumber,
+									 InvalidTransactionId, NULL,
+									 NULL);
+
+	if (strncmp((char *) &undorecord, (char *) undorecord_out, header_size) != 0)
+		elog(ERROR, "undo header did not match");
+	if (strncmp(undorecord_out->uur_tuple.data, data, len) != 0)
+		elog(ERROR, "undo data did not match");
+
+	UndoRecordRelease(undorecord_out);
+	pfree(undorecord.uur_tuple.data);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_undo_api/test_undo_api.control b/src/test/modules/test_undo_api/test_undo_api.control
new file mode 100644
index 0000000..09df344
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api.control
@@ -0,0 +1,4 @@
+comment = 'test_undo_api'
+default_version = '1.0'
+module_pathname = '$libdir/test_undo_api'
+relocatable = true
-- 
1.8.3.1

0003-undo-interface-v5.patchapplication/octet-stream; name=0003-undo-interface-v5.patchDownload

From e5630549ed64a687551ca0830eb2b1514a6a1e45 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 14 Nov 2018 01:17:03 -0800
Subject: [PATCH] undo-interface-v3

Provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

Dilip Kumar with help from Rafia Sabih based on an early prototype
for forming undo record by Robert Haas and design inputs from Amit Kapila
---
 src/backend/access/transam/xact.c    |   24 +
 src/backend/access/transam/xlog.c    |   29 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1172 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  459 +++++++++++++
 src/include/access/undoinsert.h      |  106 +++
 src/include/access/undorecord.h      |  216 +++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 2010 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e..6b7f7fa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -189,6 +189,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +916,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dce4c01..23f23e7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8511,6 +8511,35 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..4cfc58d
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1172 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ *  Undo record are stored in sequential order in the undo log.  And, each
+ *  transaction's first undo record a.k.a. transaction header points to the next
+ *  transaction's start header.  Transaction headers are linked so that the
+ *  discard worker can read undo log transaction by transaction and avoid
+ *  reading each undo record.
+ *
+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId	prev_txid[UndoPersistenceLevels] = { 0 };
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber	logno;			/* Undo log number */
+	BlockNumber		blk;			/* block number */
+	Buffer			buf;			/* buffer allocated for the block */
+	bool			zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr urp;						/* undo record pointer */
+	UnpackedUndoRecord *urec;			/* undo record */
+	int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace  def_prepared[MAX_PREPARED_UNDO];
+static int prepare_idx;
+static int	max_prepare_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr	multi_prep_urp = InvalidUndoRecPtr;
+static bool	update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+	UndoRecPtr	prev_urecptr; /* prev txn's starting urecptr */
+	int			prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;	/* prev txn's first undo record. */
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord* UndoGetOneRecord(UnpackedUndoRecord *urec,
+											UndoRecPtr urp, RelFileNode rnode,
+											UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+											 bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int InsertFindBufferSlot(RelFileNode rnode, BlockNumber blk,
+								ReadBufferMode rbm,
+								UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl *log,
+							  UndoRecPtr prev_xact_urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+	Assert(LWLockHeldByMeInMode(log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or not
+		 * so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that the
+		 * doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(prev_xact_urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (prev_xact_urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	prev_xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber	cur_blk;
+	RelFileNode	rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno, false);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno, false);
+	}
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		prev_xact_urp = InvalidUndoRecPtr;
+	else
+		prev_xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(prev_xact_urp))
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, prev_xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, prev_xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(prev_xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(prev_xact_urp);
+
+	while (true)
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk,
+									  RBM_NORMAL,
+									  log->meta.persistence);
+		prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+		index++;
+
+		if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	prev_txn_info.uur.uur_next = urecptr;
+	prev_txn_info.prev_urecptr = prev_xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(prev_txn_info.prev_urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	prev_urp = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	prev_urp = prev_txn_info.prev_urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, prev_urp))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction
+	 * header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(prev_urp);
+
+	do
+	{
+		Buffer  buffer;
+		int		buf_idx;
+
+		buf_idx = prev_txn_info.prev_txn_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&prev_txn_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while(true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+InsertFindBufferSlot(RelFileNode rnode,
+					 BlockNumber blk,
+					 ReadBufferMode rbm,
+					 UndoPersistence persistence)
+{
+	int 	i;
+	Buffer 	buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g
+		 * when previous transaction start header is in previous undo log)
+		 * so compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+										GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+						UndoPersistence upersistence, TransactionId txid)
+{
+	UnpackedUndoRecord *urec;
+	UndoLogControl *log;
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	bool	need_start_undo = false;
+	bool	first_rec_in_recovery;
+	bool	log_switched = false;
+	int	i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/*
+	 * If this is the first undo record for this transaction then set the
+	 * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+	 * the space for the transaction header and the valid value of the uur_next
+	 * will be updated while preparing the first undo record of the next
+	 * transaction.
+	 */
+	first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+	if ((!InRecovery && prev_txid[upersistence] != txid) ||
+		first_rec_in_recovery)
+	{
+		need_start_undo = true;
+	}
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		if (need_start_undo && i == 0)
+		{
+			urec->uur_next = SpecialUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			/* During recovery, Fetch database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables as these are used only
+			 * with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+		}
+
+
+		/* calculate the size of the undo record. */
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If this is the first record of the log and not the first record of
+	 * the transaction i.e. same transaction continued from the previous log
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, that will make the record larger,
+	 * so we'll have to go back and recompute the size.
+	 */
+	if (!need_start_undo &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_start_undo = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+
+		goto resize;
+	}
+
+	/*
+	 * If transaction id is switched then update the previous transaction's
+	 * start undo record.
+	 */
+	if (first_rec_in_recovery ||
+		(!InRecovery && prev_txid[upersistence] != txid) ||
+		log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+void
+UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,
+											 upersistence, txid);
+	if (max_prepare <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(max_prepare * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's
+	 * starting undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((max_prepare + 1) * MAX_BUFFER_PER_UNDO *
+						 sizeof(UndoBuffers));
+	max_prepare_undo = max_prepare;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+				  TransactionId xid)
+{
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	RelFileNode		rnode;
+	UndoRecordSize  cur_size = 0;
+	BlockNumber		cur_blk;
+	TransactionId	txid;
+	int				starting_byte;
+	int				index = 0;
+	int				bufidx;
+	ReadBufferMode	rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepare_undo)
+		return InvalidUndoRecPtr;
+
+	/*
+	 * If this is the first undo record for this top transaction add the
+	 * transaction information to the undo record.
+	 *
+	 * XXX there is also an option that instead of adding the information to
+	 * this record we can prepare a new record which only contain transaction
+	 * informations.
+	 */
+	if (xid == InvalidTransactionId)
+	{
+		/* we expect during recovery, we always have a valid transaction id. */
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping for
+		 * the top most transactions.
+		 */
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(multi_prep_urp))
+		urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+	else
+		urecptr = multi_prep_urp;
+
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(multi_prep_urp))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* FIXME: Should we just report error ? */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/* Undo record can not fit into this block so go to the next block. */
+		cur_blk++;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+	} while (cur_size < size);
+
+	/*
+	 * Save referenced of undo record pointer as well as undo record.
+	 * InsertPreparedUndo will use these to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int		idx;
+	int		flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int		idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page	page;
+	int		starting_byte;
+	int		already_written;
+	int		bufidx = 0;
+	int		idx;
+	uint16	undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord	*uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+	uint16	prev_undolen;
+
+	Assert(prepare_idx > 0);
+
+	/* This must be called under a critical section. */
+	Assert(CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		/*
+		 * We can read meta.prevlen without locking, because only we can write
+		 * to it.
+		 */
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+		prev_undolen = log->meta.prevlen;
+
+		/* store the previous undo record length in the header */
+		uur->uur_prevlen = prev_undolen;
+
+		/* if starting a new log then there is no prevlen to store */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/* if starting from a new page then include header in prevlen */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+				uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer  buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+			starting_byte = UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			undo_len += UndoLogBlockHeaderSize;
+
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while(true);
+
+		prev_undolen = undo_len;
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
+
+		if (UndoRecPtrIsValid(prev_txn_info.prev_urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int	i;
+	for (i = 0; i < buffer_idx; i++)
+	{
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	prev_txn_info.prev_urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	multi_prep_urp = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepare_undo limit is changed so free the allocated memory and reset
+	 * all the variable back to its default value.
+	 */
+	if (max_prepare_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepare_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer,  now caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if he wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord*
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer			 buffer = urec->uur_buffer;
+	Page			 page;
+	int				 starting_byte = UndoRecPtrGetPageOffset(urp);
+	int				 already_decoded = 0;
+	BlockNumber		 cur_blk;
+	bool			 is_undo_splited = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a previous buffer then no need to allocate new. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * FIXME: This can be optimized to just fetch header first and only
+		 * if matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_splited = true;
+
+		/*
+		 * Complete record is not fitting into one buffer so release the buffer
+		 * pin and also set invalid buffer in the undo record.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer. Otherwise just
+	 * unlock it.
+	 */
+	if (is_undo_splited)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord*
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode		 rnode, prevrnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int	logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/*
+		 * If we have a valid buffer pinned then just ensure that we want to
+		 * find the next tuple from the same block.  Otherwise release the
+		 * buffer and set it invalid
+		 */
+		if (BufferIsValid(urec->uur_buffer))
+		{
+			/*
+			 * Undo buffer will be changed if the next undo record belongs to a
+			 * different block or undo log.
+			 */
+			if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+				(prevrnode.relNode != rnode.relNode))
+			{
+				ReleaseBuffer(urec->uur_buffer);
+				urec->uur_buffer = InvalidBuffer;
+			}
+		}
+		else
+		{
+			/*
+			 * If there is not a valid buffer in urec->uur_buffer that means we
+			 * had copied the payload data and tuple data so free them.
+			 */
+			if (urec->uur_payload.data)
+				pfree(urec->uur_payload.data);
+			if (urec->uur_tuple.data)
+				pfree(urec->uur_tuple.data);
+		}
+
+		/* Reset the urec before fetching the tuple */
+		urec->uur_tuple.data = NULL;
+		urec->uur_tuple.len = 0;
+		urec->uur_payload.data = NULL;
+		urec->uur_payload.len = 0;
+		prevrnode = rnode;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecPtrIsValid(log->oldest_data))
+		{
+			/*
+			 * UndoDiscardInfo is not yet initialized. Hence, we've to check
+			 * UndoLogIsDiscarded and if it's already discarded then we have
+			 * nothing to do.
+			 */
+			LWLockRelease(&log->discard_lock);
+			if (UndoLogIsDiscarded(urp))
+			{
+				if (BufferIsValid(urec->uur_buffer))
+					ReleaseBuffer(urec->uur_buffer);
+				return NULL;
+			}
+
+			LWLockAcquire(&log->discard_lock, LW_SHARED);
+		}
+
+		/* Check if it's already discarded. */
+		if (urp < log->oldest_data)
+		{
+			LWLockRelease(&log->discard_lock);
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl	*prevlog, *log;
+
+		log = UndoLogGet(logno, false);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno, true);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr (logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree (urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..33bb153
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,459 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size	size;
+
+	/* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+	/* if (uur->uur_info == 0) */
+		UndoRecordSetInfo(uur);
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char   *writeptr = (char *) page + starting_byte;
+	char   *endptr = (char *) page + BLCKSZ;
+	int		my_bytes_written = *already_written;
+
+	if (uur->uur_info == 0)
+		UndoRecordSetInfo(uur);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption
+	 * that it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_relfilenode = uur->uur_relfilenode;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_tsid = uur->uur_tsid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before,
+		 * or caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_relfilenode == uur->uur_relfilenode);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_tsid == uur->uur_tsid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int		can_write;
+	int		remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing
+	 * to do except update *my_bytes_written, which we must do to ensure
+	 * that the next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+					  int *already_decoded, bool header_only)
+{
+	char	*readptr = (char *)page + starting_byte;
+	char	*endptr = (char *) page + BLCKSZ;
+	int		my_bytes_decoded = *already_decoded;
+	bool	is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_relfilenode = work_hdr.urec_relfilenode;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_tsid = work_rd.urec_tsid;
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any of
+		 * the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int		can_read;
+	int		remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+static void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_tsid != DEFAULTTABLESPACE_OID ||
+		uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..a2bf7cc
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord* urec,
+											BlockNumber blkno,
+											OffsetNumber offset,
+											TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, UndoPersistence,
+					TransactionId xid);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord* UndoFetchRecord(UndoRecPtr urp,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid,
+										   UndoRecPtr *urec_ptr_out,
+										   SatisfyUndoRecordCallback callback);
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+							   TransactionId xid, UndoPersistence upersistence);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+
+#endif   /* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..85642ad
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,216 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_relfilenode;		/* relfilenode for relation */
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older then RecentGlobalXmin, then we can consider the tuple
+	 * in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;			/* Transaction id */
+	CommandId	urec_cid;			/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+#define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	Oid			urec_tsid;		/* tablespace OID */
+	ForkNumber		urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+	uint32			urec_progress;  /* undo applying progress. */
+	uint32			urec_xidepoch;  /* epoch of the current transaction */
+	Oid				urec_dbid;		/* database id */
+	uint64			urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+#define urec_next_pos \
+	(SizeOfUndoRecordTransaction - SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;		/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordExpectedSize or InsertUndoRecord.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_relfilenode;	/* relfilenode for relation */
+	TransactionId uur_prevxid;		/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	Oid			uur_tsid;		/* tablespace OID */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id*/
+
+	/*
+	 * undo action apply progress 0 = not started, 1 = completed. In future it
+	 * can also be used to show the progress of how much undo has been applied
+	 * so far with some formulae but currently only 0 and 1 is used.
+	 */
+	uint32         uur_progress;
+	StringInfoData uur_payload;	/* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.  For the first call, the given page should be the one which
+ * the caller has determined to contain the current insertion point,
+ * starting_byte should be the byte offset within that page which corresponds
+ * to the current insertion point, and *already_written should be 0.  The
+ * return value will be true if the entire record is successfully written
+ * into that page, and false if not.  In either case, *already_written will
+ * be updated to the number of bytes written by all InsertUndoRecord calls
+ * for this record to date.  If this function is called again to continue
+ * writing the record, the previous value for *already_written should be
+ * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
+ * (since the record will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif   /* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 689c57c..73394c5 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e01d12e..8cfcd44 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -277,6 +277,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

#18

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#17)

1 attachment(s)

Re: Undo logs

On Wed, Nov 14, 2018 at 3:48 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Nov 14, 2018 at 2:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think you can keep it with XXX instead of Fixme as there is nothing to fix.

Changed

Both the patches 0003-undo-interface-v4.patch and
0004-undo-interface-test-v4.patch appears to be same except for the
name?

My bad, please find the updated patch.

There was some problem in a assert and one comment was not aligned
properly so I have fixed that in the latest patch.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-undo-interface-v6.patchapplication/octet-stream; name=0003-undo-interface-v6.patchDownload

From 1718d0e3666884d533abc9519bf7df597b980a52 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 14 Nov 2018 01:17:03 -0800
Subject: [PATCH] undo-interface-v3

Provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

Dilip Kumar with help from Rafia Sabih based on an early prototype
for forming undo record by Robert Haas and design inputs from Amit Kapila
---
 src/backend/access/transam/xact.c    |   24 +
 src/backend/access/transam/xlog.c    |   30 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1172 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  459 +++++++++++++
 src/include/access/undoinsert.h      |  106 +++
 src/include/access/undorecord.h      |  216 +++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 2011 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e..6b7f7fa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -189,6 +189,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +916,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dce4c01..36c161e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8511,6 +8511,36 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/*
+	 * Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..935b3ad
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1172 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ *  Undo record are stored in sequential order in the undo log.  And, each
+ *  transaction's first undo record a.k.a. transaction header points to the next
+ *  transaction's start header.  Transaction headers are linked so that the
+ *  discard worker can read undo log transaction by transaction and avoid
+ *  reading each undo record.
+ *
+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId	prev_txid[UndoPersistenceLevels] = { 0 };
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber	logno;			/* Undo log number */
+	BlockNumber		blk;			/* block number */
+	Buffer			buf;			/* buffer allocated for the block */
+	bool			zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr urp;						/* undo record pointer */
+	UnpackedUndoRecord *urec;			/* undo record */
+	int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace  def_prepared[MAX_PREPARED_UNDO];
+static int prepare_idx;
+static int	max_prepare_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr	multi_prep_urp = InvalidUndoRecPtr;
+static bool	update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+	UndoRecPtr	prev_urecptr; /* prev txn's starting urecptr */
+	int			prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;	/* prev txn's first undo record. */
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord* UndoGetOneRecord(UnpackedUndoRecord *urec,
+											UndoRecPtr urp, RelFileNode rnode,
+											UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+											 bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int InsertFindBufferSlot(RelFileNode rnode, BlockNumber blk,
+								ReadBufferMode rbm,
+								UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl *log,
+							  UndoRecPtr prev_xact_urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or not
+		 * so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that the
+		 * doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(prev_xact_urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (prev_xact_urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	prev_xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber	cur_blk;
+	RelFileNode	rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno, false);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno, false);
+	}
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		prev_xact_urp = InvalidUndoRecPtr;
+	else
+		prev_xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(prev_xact_urp))
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, prev_xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, prev_xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(prev_xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(prev_xact_urp);
+
+	while (true)
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk,
+									  RBM_NORMAL,
+									  log->meta.persistence);
+		prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+		index++;
+
+		if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	prev_txn_info.uur.uur_next = urecptr;
+	prev_txn_info.prev_urecptr = prev_xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(prev_txn_info.prev_urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	prev_urp = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	prev_urp = prev_txn_info.prev_urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, prev_urp))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction
+	 * header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(prev_urp);
+
+	do
+	{
+		Buffer  buffer;
+		int		buf_idx;
+
+		buf_idx = prev_txn_info.prev_txn_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&prev_txn_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while(true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+InsertFindBufferSlot(RelFileNode rnode,
+					 BlockNumber blk,
+					 ReadBufferMode rbm,
+					 UndoPersistence persistence)
+{
+	int 	i;
+	Buffer 	buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g
+		 * when previous transaction start header is in previous undo log)
+		 * so compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+										GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+						UndoPersistence upersistence, TransactionId txid)
+{
+	UnpackedUndoRecord *urec;
+	UndoLogControl *log;
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	bool	need_start_undo = false;
+	bool	first_rec_in_recovery;
+	bool	log_switched = false;
+	int	i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/*
+	 * If this is the first undo record for this transaction then set the
+	 * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+	 * the space for the transaction header and the valid value of the uur_next
+	 * will be updated while preparing the first undo record of the next
+	 * transaction.
+	 */
+	first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+	if ((!InRecovery && prev_txid[upersistence] != txid) ||
+		first_rec_in_recovery)
+	{
+		need_start_undo = true;
+	}
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		if (need_start_undo && i == 0)
+		{
+			urec->uur_next = SpecialUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			/* During recovery, Fetch database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables as these are used only
+			 * with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+		}
+
+
+		/* calculate the size of the undo record. */
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If this is the first record of the log and not the first record of
+	 * the transaction i.e. same transaction continued from the previous log
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, that will make the record larger,
+	 * so we'll have to go back and recompute the size.
+	 */
+	if (!need_start_undo &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_start_undo = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+
+		goto resize;
+	}
+
+	/*
+	 * If transaction id is switched then update the previous transaction's
+	 * start undo record.
+	 */
+	if (first_rec_in_recovery ||
+		(!InRecovery && prev_txid[upersistence] != txid) ||
+		log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+void
+UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,
+											 upersistence, txid);
+	if (max_prepare <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(max_prepare * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's
+	 * starting undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((max_prepare + 1) * MAX_BUFFER_PER_UNDO *
+						 sizeof(UndoBuffers));
+	max_prepare_undo = max_prepare;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+				  TransactionId xid)
+{
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	RelFileNode		rnode;
+	UndoRecordSize  cur_size = 0;
+	BlockNumber		cur_blk;
+	TransactionId	txid;
+	int				starting_byte;
+	int				index = 0;
+	int				bufidx;
+	ReadBufferMode	rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepare_undo)
+		return InvalidUndoRecPtr;
+
+	/*
+	 * If this is the first undo record for this top transaction add the
+	 * transaction information to the undo record.
+	 *
+	 * XXX there is also an option that instead of adding the information to
+	 * this record we can prepare a new record which only contain transaction
+	 * informations.
+	 */
+	if (xid == InvalidTransactionId)
+	{
+		/* we expect during recovery, we always have a valid transaction id. */
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping for
+		 * the top most transactions.
+		 */
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(multi_prep_urp))
+		urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+	else
+		urecptr = multi_prep_urp;
+
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(multi_prep_urp))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* FIXME: Should we just report error ? */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/* Undo record can not fit into this block so go to the next block. */
+		cur_blk++;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+	} while (cur_size < size);
+
+	/*
+	 * Save referenced of undo record pointer as well as undo record.
+	 * InsertPreparedUndo will use these to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int		idx;
+	int		flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int		idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page	page;
+	int		starting_byte;
+	int		already_written;
+	int		bufidx = 0;
+	int		idx;
+	uint16	undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord	*uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+	uint16	prev_undolen;
+
+	Assert(prepare_idx > 0);
+
+	/* This must be called under a critical section. */
+	Assert(CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		/*
+		 * We can read meta.prevlen without locking, because only we can write
+		 * to it.
+		 */
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+		prev_undolen = log->meta.prevlen;
+
+		/* store the previous undo record length in the header */
+		uur->uur_prevlen = prev_undolen;
+
+		/* if starting a new log then there is no prevlen to store */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/* if starting from a new page then include header in prevlen */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+				uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer  buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+			starting_byte = UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			undo_len += UndoLogBlockHeaderSize;
+
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while(true);
+
+		prev_undolen = undo_len;
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
+
+		if (UndoRecPtrIsValid(prev_txn_info.prev_urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int	i;
+	for (i = 0; i < buffer_idx; i++)
+	{
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	prev_txn_info.prev_urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	multi_prep_urp = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepare_undo limit is changed so free the allocated memory and reset
+	 * all the variable back to its default value.
+	 */
+	if (max_prepare_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepare_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer,  now caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if he wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord*
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer			 buffer = urec->uur_buffer;
+	Page			 page;
+	int				 starting_byte = UndoRecPtrGetPageOffset(urp);
+	int				 already_decoded = 0;
+	BlockNumber		 cur_blk;
+	bool			 is_undo_splited = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a previous buffer then no need to allocate new. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * FIXME: This can be optimized to just fetch header first and only
+		 * if matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_splited = true;
+
+		/*
+		 * Complete record is not fitting into one buffer so release the buffer
+		 * pin and also set invalid buffer in the undo record.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer. Otherwise just
+	 * unlock it.
+	 */
+	if (is_undo_splited)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord*
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode		 rnode, prevrnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int	logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/*
+		 * If we have a valid buffer pinned then just ensure that we want to
+		 * find the next tuple from the same block.  Otherwise release the
+		 * buffer and set it invalid
+		 */
+		if (BufferIsValid(urec->uur_buffer))
+		{
+			/*
+			 * Undo buffer will be changed if the next undo record belongs to a
+			 * different block or undo log.
+			 */
+			if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+				(prevrnode.relNode != rnode.relNode))
+			{
+				ReleaseBuffer(urec->uur_buffer);
+				urec->uur_buffer = InvalidBuffer;
+			}
+		}
+		else
+		{
+			/*
+			 * If there is not a valid buffer in urec->uur_buffer that means we
+			 * had copied the payload data and tuple data so free them.
+			 */
+			if (urec->uur_payload.data)
+				pfree(urec->uur_payload.data);
+			if (urec->uur_tuple.data)
+				pfree(urec->uur_tuple.data);
+		}
+
+		/* Reset the urec before fetching the tuple */
+		urec->uur_tuple.data = NULL;
+		urec->uur_tuple.len = 0;
+		urec->uur_payload.data = NULL;
+		urec->uur_payload.len = 0;
+		prevrnode = rnode;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecPtrIsValid(log->oldest_data))
+		{
+			/*
+			 * UndoDiscardInfo is not yet initialized. Hence, we've to check
+			 * UndoLogIsDiscarded and if it's already discarded then we have
+			 * nothing to do.
+			 */
+			LWLockRelease(&log->discard_lock);
+			if (UndoLogIsDiscarded(urp))
+			{
+				if (BufferIsValid(urec->uur_buffer))
+					ReleaseBuffer(urec->uur_buffer);
+				return NULL;
+			}
+
+			LWLockAcquire(&log->discard_lock, LW_SHARED);
+		}
+
+		/* Check if it's already discarded. */
+		if (urp < log->oldest_data)
+		{
+			LWLockRelease(&log->discard_lock);
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl	*prevlog, *log;
+
+		log = UndoLogGet(logno, false);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno, true);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr (logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree (urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..33bb153
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,459 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size	size;
+
+	/* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+	/* if (uur->uur_info == 0) */
+		UndoRecordSetInfo(uur);
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char   *writeptr = (char *) page + starting_byte;
+	char   *endptr = (char *) page + BLCKSZ;
+	int		my_bytes_written = *already_written;
+
+	if (uur->uur_info == 0)
+		UndoRecordSetInfo(uur);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption
+	 * that it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_relfilenode = uur->uur_relfilenode;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_tsid = uur->uur_tsid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before,
+		 * or caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_relfilenode == uur->uur_relfilenode);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_tsid == uur->uur_tsid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int		can_write;
+	int		remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing
+	 * to do except update *my_bytes_written, which we must do to ensure
+	 * that the next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+					  int *already_decoded, bool header_only)
+{
+	char	*readptr = (char *)page + starting_byte;
+	char	*endptr = (char *) page + BLCKSZ;
+	int		my_bytes_decoded = *already_decoded;
+	bool	is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_relfilenode = work_hdr.urec_relfilenode;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_tsid = work_rd.urec_tsid;
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any of
+		 * the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int		can_read;
+	int		remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+static void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_tsid != DEFAULTTABLESPACE_OID ||
+		uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..a2bf7cc
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord* urec,
+											BlockNumber blkno,
+											OffsetNumber offset,
+											TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, UndoPersistence,
+					TransactionId xid);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord* UndoFetchRecord(UndoRecPtr urp,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid,
+										   UndoRecPtr *urec_ptr_out,
+										   SatisfyUndoRecordCallback callback);
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+							   TransactionId xid, UndoPersistence upersistence);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+
+#endif   /* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..85642ad
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,216 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_relfilenode;		/* relfilenode for relation */
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older then RecentGlobalXmin, then we can consider the tuple
+	 * in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;			/* Transaction id */
+	CommandId	urec_cid;			/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+#define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	Oid			urec_tsid;		/* tablespace OID */
+	ForkNumber		urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+	uint32			urec_progress;  /* undo applying progress. */
+	uint32			urec_xidepoch;  /* epoch of the current transaction */
+	Oid				urec_dbid;		/* database id */
+	uint64			urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+#define urec_next_pos \
+	(SizeOfUndoRecordTransaction - SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;		/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordExpectedSize or InsertUndoRecord.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_relfilenode;	/* relfilenode for relation */
+	TransactionId uur_prevxid;		/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	Oid			uur_tsid;		/* tablespace OID */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id*/
+
+	/*
+	 * undo action apply progress 0 = not started, 1 = completed. In future it
+	 * can also be used to show the progress of how much undo has been applied
+	 * so far with some formulae but currently only 0 and 1 is used.
+	 */
+	uint32         uur_progress;
+	StringInfoData uur_payload;	/* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.  For the first call, the given page should be the one which
+ * the caller has determined to contain the current insertion point,
+ * starting_byte should be the byte offset within that page which corresponds
+ * to the current insertion point, and *already_written should be 0.  The
+ * return value will be true if the entire record is successfully written
+ * into that page, and false if not.  In either case, *already_written will
+ * be updated to the number of bytes written by all InsertUndoRecord calls
+ * for this record to date.  If this function is called again to continue
+ * writing the record, the previous value for *already_written should be
+ * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
+ * (since the record will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif   /* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 689c57c..73394c5 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e01d12e..8cfcd44 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -277,6 +277,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

#19

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#18)

2 attachment(s)

Re: Undo logs

On Thu, Nov 15, 2018 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Updated patch (merged latest code from the zheap main branch [1]https://github.com/EnterpriseDB/zheap).
The main chain is related to removing relfilenode and tablespace id
from the undo record and store reloid.
Earlier, we kept it thinking that we will perform rollback without
database connection but that's not the case now.

[1]: https://github.com/EnterpriseDB/zheap

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-undo-interface-v7.patchapplication/octet-stream; name=0003-undo-interface-v7.patchDownload

From fa1102c595a13249ffc0f04d16a8a4d03112e914 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 14 Nov 2018 01:17:03 -0800
Subject: [PATCH 1/3] undo-interface-v3

Provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

Dilip Kumar with help from Rafia Sabih based on an early prototype
for forming undo record by Robert Haas and design inputs from Amit Kapila
---
 src/backend/access/transam/xact.c    |   24 +
 src/backend/access/transam/xlog.c    |   30 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1172 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  455 +++++++++++++
 src/include/access/undoinsert.h      |  106 +++
 src/include/access/undorecord.h      |  214 +++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 2005 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e..6b7f7fa 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -189,6 +189,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +916,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dce4c01..36c161e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8511,6 +8511,36 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/*
+	 * Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..0fcdf98
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1172 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ *  Undo record are stored in sequential order in the undo log.  And, each
+ *  transaction's first undo record a.k.a. transaction header points to the next
+ *  transaction's start header.  Transaction headers are linked so that the
+ *  discard worker can read undo log transaction by transaction and avoid
+ *  reading each undo record.
+ *
+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId	prev_txid[UndoPersistenceLevels] = { 0 };
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber	logno;			/* Undo log number */
+	BlockNumber		blk;			/* block number */
+	Buffer			buf;			/* buffer allocated for the block */
+	bool			zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr urp;						/* undo record pointer */
+	UnpackedUndoRecord *urec;			/* undo record */
+	int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace  def_prepared[MAX_PREPARED_UNDO];
+static int prepare_idx;
+static int	max_prepare_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr	multi_prep_urp = InvalidUndoRecPtr;
+static bool	update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+	UndoRecPtr	prev_urecptr; /* prev txn's starting urecptr */
+	int			prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;	/* prev txn's first undo record. */
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord* UndoGetOneRecord(UnpackedUndoRecord *urec,
+											UndoRecPtr urp, RelFileNode rnode,
+											UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+											 bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int InsertFindBufferSlot(RelFileNode rnode, BlockNumber blk,
+								ReadBufferMode rbm,
+								UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl *log,
+							  UndoRecPtr prev_xact_urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl *log, UndoRecPtr prev_xact_urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or not
+		 * so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that the
+		 * doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(prev_xact_urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (prev_xact_urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	prev_xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber	cur_blk;
+	RelFileNode	rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno, false);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno, false);
+	}
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		prev_xact_urp = InvalidUndoRecPtr;
+	else
+		prev_xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(prev_xact_urp))
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, prev_xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, prev_xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(prev_xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(prev_xact_urp);
+
+	while (true)
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk,
+									  RBM_NORMAL,
+									  log->meta.persistence);
+		prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+		index++;
+
+		if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	prev_txn_info.uur.uur_next = urecptr;
+	prev_txn_info.prev_urecptr = prev_xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(prev_txn_info.prev_urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	prev_urp = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	prev_urp = prev_txn_info.prev_urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, prev_urp))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction
+	 * header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(prev_urp);
+
+	do
+	{
+		Buffer  buffer;
+		int		buf_idx;
+
+		buf_idx = prev_txn_info.prev_txn_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&prev_txn_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while(true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+InsertFindBufferSlot(RelFileNode rnode,
+					 BlockNumber blk,
+					 ReadBufferMode rbm,
+					 UndoPersistence persistence)
+{
+	int 	i;
+	Buffer 	buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g
+		 * when previous transaction start header is in previous undo log)
+		 * so compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+										GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+						UndoPersistence upersistence, TransactionId txid)
+{
+	UnpackedUndoRecord *urec;
+	UndoLogControl *log;
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	bool	need_start_undo = false;
+	bool	first_rec_in_recovery;
+	bool	log_switched = false;
+	int	i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/*
+	 * If this is the first undo record for this transaction then set the
+	 * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+	 * the space for the transaction header and the valid value of the uur_next
+	 * will be updated while preparing the first undo record of the next
+	 * transaction.
+	 */
+	first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+	if ((!InRecovery && prev_txid[upersistence] != txid) ||
+		first_rec_in_recovery)
+	{
+		need_start_undo = true;
+	}
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		if (need_start_undo && i == 0)
+		{
+			urec->uur_next = SpecialUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			/* During recovery, Fetch database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables as these are used only
+			 * with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+		}
+
+
+		/* calculate the size of the undo record. */
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If this is the first record of the log and not the first record of
+	 * the transaction i.e. same transaction continued from the previous log
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, that will make the record larger,
+	 * so we'll have to go back and recompute the size.
+	 */
+	if (!need_start_undo &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_start_undo = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+
+		goto resize;
+	}
+
+	/*
+	 * If transaction id is switched then update the previous transaction's
+	 * start undo record.
+	 */
+	if (first_rec_in_recovery ||
+		(!InRecovery && prev_txid[upersistence] != txid) ||
+		log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+void
+UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,
+											 upersistence, txid);
+	if (max_prepare <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(max_prepare * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's
+	 * starting undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((max_prepare + 1) * MAX_BUFFER_PER_UNDO *
+						 sizeof(UndoBuffers));
+	max_prepare_undo = max_prepare;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+				  TransactionId xid)
+{
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	RelFileNode		rnode;
+	UndoRecordSize  cur_size = 0;
+	BlockNumber		cur_blk;
+	TransactionId	txid;
+	int				starting_byte;
+	int				index = 0;
+	int				bufidx;
+	ReadBufferMode	rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepare_undo)
+		return InvalidUndoRecPtr;
+
+	/*
+	 * If this is the first undo record for this top transaction add the
+	 * transaction information to the undo record.
+	 *
+	 * XXX there is also an option that instead of adding the information to
+	 * this record we can prepare a new record which only contain transaction
+	 * informations.
+	 */
+	if (xid == InvalidTransactionId)
+	{
+		/* we expect during recovery, we always have a valid transaction id. */
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping for
+		 * the top most transactions.
+		 */
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(multi_prep_urp))
+		urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+	else
+		urecptr = multi_prep_urp;
+
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(multi_prep_urp))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = InsertFindBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* FIXME: Should we just report error ? */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/* Undo record can not fit into this block so go to the next block. */
+		cur_blk++;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+	} while (cur_size < size);
+
+	/*
+	 * Save referenced of undo record pointer as well as undo record.
+	 * InsertPreparedUndo will use these to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int		idx;
+	int		flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int		idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page	page;
+	int		starting_byte;
+	int		already_written;
+	int		bufidx = 0;
+	int		idx;
+	uint16	undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord	*uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+	uint16	prev_undolen;
+
+	Assert(prepare_idx > 0);
+
+	/* This must be called under a critical section. */
+	Assert(InRecovery || CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		/*
+		 * We can read meta.prevlen without locking, because only we can write
+		 * to it.
+		 */
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+		prev_undolen = log->meta.prevlen;
+
+		/* store the previous undo record length in the header */
+		uur->uur_prevlen = prev_undolen;
+
+		/* if starting a new log then there is no prevlen to store */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/* if starting from a new page then include header in prevlen */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+				uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer  buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+			starting_byte = UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			undo_len += UndoLogBlockHeaderSize;
+
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while(true);
+
+		prev_undolen = undo_len;
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
+
+		if (UndoRecPtrIsValid(prev_txn_info.prev_urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int	i;
+	for (i = 0; i < buffer_idx; i++)
+	{
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	prev_txn_info.prev_urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	multi_prep_urp = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepare_undo limit is changed so free the allocated memory and reset
+	 * all the variable back to its default value.
+	 */
+	if (max_prepare_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepare_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer,  now caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if he wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord*
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer			 buffer = urec->uur_buffer;
+	Page			 page;
+	int				 starting_byte = UndoRecPtrGetPageOffset(urp);
+	int				 already_decoded = 0;
+	BlockNumber		 cur_blk;
+	bool			 is_undo_splited = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a previous buffer then no need to allocate new. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * FIXME: This can be optimized to just fetch header first and only
+		 * if matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_splited = true;
+
+		/*
+		 * Complete record is not fitting into one buffer so release the buffer
+		 * pin and also set invalid buffer in the undo record.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer. Otherwise just
+	 * unlock it.
+	 */
+	if (is_undo_splited)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord*
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode		 rnode, prevrnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int	logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/*
+		 * If we have a valid buffer pinned then just ensure that we want to
+		 * find the next tuple from the same block.  Otherwise release the
+		 * buffer and set it invalid
+		 */
+		if (BufferIsValid(urec->uur_buffer))
+		{
+			/*
+			 * Undo buffer will be changed if the next undo record belongs to a
+			 * different block or undo log.
+			 */
+			if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+				(prevrnode.relNode != rnode.relNode))
+			{
+				ReleaseBuffer(urec->uur_buffer);
+				urec->uur_buffer = InvalidBuffer;
+			}
+		}
+		else
+		{
+			/*
+			 * If there is not a valid buffer in urec->uur_buffer that means we
+			 * had copied the payload data and tuple data so free them.
+			 */
+			if (urec->uur_payload.data)
+				pfree(urec->uur_payload.data);
+			if (urec->uur_tuple.data)
+				pfree(urec->uur_tuple.data);
+		}
+
+		/* Reset the urec before fetching the tuple */
+		urec->uur_tuple.data = NULL;
+		urec->uur_tuple.len = 0;
+		urec->uur_payload.data = NULL;
+		urec->uur_payload.len = 0;
+		prevrnode = rnode;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecPtrIsValid(log->oldest_data))
+		{
+			/*
+			 * UndoDiscardInfo is not yet initialized. Hence, we've to check
+			 * UndoLogIsDiscarded and if it's already discarded then we have
+			 * nothing to do.
+			 */
+			LWLockRelease(&log->discard_lock);
+			if (UndoLogIsDiscarded(urp))
+			{
+				if (BufferIsValid(urec->uur_buffer))
+					ReleaseBuffer(urec->uur_buffer);
+				return NULL;
+			}
+
+			LWLockAcquire(&log->discard_lock, LW_SHARED);
+		}
+
+		/* Check if it's already discarded. */
+		if (urp < log->oldest_data)
+		{
+			LWLockRelease(&log->discard_lock);
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl	*prevlog, *log;
+
+		log = UndoLogGet(logno, false);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno, true);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr (logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree (urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..51a4f7f
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,455 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size	size;
+
+	/* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+	/* if (uur->uur_info == 0) */
+		UndoRecordSetInfo(uur);
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char   *writeptr = (char *) page + starting_byte;
+	char   *endptr = (char *) page + BLCKSZ;
+	int		my_bytes_written = *already_written;
+
+	if (uur->uur_info == 0)
+		UndoRecordSetInfo(uur);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption
+	 * that it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_reloid = uur->uur_reloid;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before,
+		 * or caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_reloid == uur->uur_reloid);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int		can_write;
+	int		remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing
+	 * to do except update *my_bytes_written, which we must do to ensure
+	 * that the next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+					  int *already_decoded, bool header_only)
+{
+	char	*readptr = (char *)page + starting_byte;
+	char	*endptr = (char *) page + BLCKSZ;
+	int		my_bytes_decoded = *already_decoded;
+	bool	is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_reloid = work_hdr.urec_reloid;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any of
+		 * the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int		can_read;
+	int		remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+static void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..a2bf7cc
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,106 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord* urec,
+											BlockNumber blkno,
+											OffsetNumber offset,
+											TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, UndoPersistence,
+					TransactionId xid);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord* UndoFetchRecord(UndoRecPtr urp,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid,
+										   UndoRecPtr *urec_ptr_out,
+										   SatisfyUndoRecordCallback callback);
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+							   TransactionId xid, UndoPersistence upersistence);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+
+#endif   /* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..005daa9
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,214 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid		urec_reloid;		/* relation OID */
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older then RecentGlobalXmin, then we can consider the tuple
+	 * in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;			/* Transaction id */
+	CommandId	urec_cid;			/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+#define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	ForkNumber		urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+	uint32			urec_progress;  /* undo applying progress. */
+	uint32			urec_xidepoch;  /* epoch of the current transaction */
+	Oid				urec_dbid;		/* database id */
+	uint64			urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+#define urec_next_pos \
+	(SizeOfUndoRecordTransaction - SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;		/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordExpectedSize or InsertUndoRecord.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_reloid;	/* relation OID */
+	TransactionId uur_prevxid;		/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id*/
+
+	/*
+	 * undo action apply progress 0 = not started, 1 = completed. In future it
+	 * can also be used to show the progress of how much undo has been applied
+	 * so far with some formulae but currently only 0 and 1 is used.
+	 */
+	uint32         uur_progress;
+	StringInfoData uur_payload;	/* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.  For the first call, the given page should be the one which
+ * the caller has determined to contain the current insertion point,
+ * starting_byte should be the byte offset within that page which corresponds
+ * to the current insertion point, and *already_written should be 0.  The
+ * return value will be true if the entire record is successfully written
+ * into that page, and false if not.  In either case, *already_written will
+ * be updated to the number of bytes written by all InsertUndoRecord calls
+ * for this record to date.  If this function is called again to continue
+ * writing the record, the previous value for *already_written should be
+ * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
+ * (since the record will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif   /* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 689c57c..73394c5 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e01d12e..8cfcd44 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -277,6 +277,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

0004-undo-interface-test-v7.patchapplication/octet-stream; name=0004-undo-interface-test-v7.patchDownload

From 10bb185f50b760da3a1ddeb6d32577ffa6206271 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Thu, 15 Nov 2018 02:18:48 -0800
Subject: [PATCH] undo-interface-test-v6

Provides test module for undo-interface routines.
---
 src/test/modules/Makefile                          |  1 +
 src/test/modules/test_undo_api/Makefile            | 21 ++++++
 .../test_undo_api/expected/test_undo_api.out       | 12 +++
 .../modules/test_undo_api/sql/test_undo_api.sql    |  8 ++
 .../modules/test_undo_api/test_undo_api--1.0.sql   |  8 ++
 src/test/modules/test_undo_api/test_undo_api.c     | 86 ++++++++++++++++++++++
 .../modules/test_undo_api/test_undo_api.control    |  4 +
 7 files changed, 140 insertions(+)
 create mode 100644 src/test/modules/test_undo_api/Makefile
 create mode 100644 src/test/modules/test_undo_api/expected/test_undo_api.out
 create mode 100644 src/test/modules/test_undo_api/sql/test_undo_api.sql
 create mode 100644 src/test/modules/test_undo_api/test_undo_api--1.0.sql
 create mode 100644 src/test/modules/test_undo_api/test_undo_api.c
 create mode 100644 src/test/modules/test_undo_api/test_undo_api.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 43323a6..e05fd00 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -19,6 +19,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_undo \
+		  test_undo_api \
 		  worker_spi
 
 $(recurse)
diff --git a/src/test/modules/test_undo_api/Makefile b/src/test/modules/test_undo_api/Makefile
new file mode 100644
index 0000000..deb3816
--- /dev/null
+++ b/src/test/modules/test_undo_api/Makefile
@@ -0,0 +1,21 @@
+# src/test/modules/test_undo/Makefile
+
+MODULE_big = test_undo_api
+OBJS = test_undo_api.o
+PGFILEDESC = "test_undo_api - a test module for the undo api layer"
+
+EXTENSION = test_undo_api
+DATA = test_undo_api--1.0.sql
+
+REGRESS = test_undo_api
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_undo_api
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_undo_api/expected/test_undo_api.out b/src/test/modules/test_undo_api/expected/test_undo_api.out
new file mode 100644
index 0000000..995b517
--- /dev/null
+++ b/src/test/modules/test_undo_api/expected/test_undo_api.out
@@ -0,0 +1,12 @@
+CREATE EXTENSION test_undo_api;
+--
+-- This test will insert the data in the undo using undo api and after that
+-- it will fetch the data and verify that whether we have got the same data
+-- back or not.
+--
+SELECT test_undo_api(txid_current()::text::xid, 'permanent');
+ test_undo_api 
+---------------
+ 
+(1 row)
+
diff --git a/src/test/modules/test_undo_api/sql/test_undo_api.sql b/src/test/modules/test_undo_api/sql/test_undo_api.sql
new file mode 100644
index 0000000..4fb40ff
--- /dev/null
+++ b/src/test/modules/test_undo_api/sql/test_undo_api.sql
@@ -0,0 +1,8 @@
+CREATE EXTENSION test_undo_api;
+
+--
+-- This test will insert the data in the undo using undo api and after that
+-- it will fetch the data and verify that whether we have got the same data
+-- back or not.
+--
+SELECT test_undo_api(txid_current()::text::xid, 'permanent');
diff --git a/src/test/modules/test_undo_api/test_undo_api--1.0.sql b/src/test/modules/test_undo_api/test_undo_api--1.0.sql
new file mode 100644
index 0000000..3dd134b
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api--1.0.sql
@@ -0,0 +1,8 @@
+\echo Use "CREATE EXTENSION test_undo_api" to load this file. \quit
+
+CREATE FUNCTION test_undo_api(xid xid, persistence text)
+RETURNS void
+AS 'MODULE_PATHNAME'
+LANGUAGE C;
+
+
diff --git a/src/test/modules/test_undo_api/test_undo_api.c b/src/test/modules/test_undo_api/test_undo_api.c
new file mode 100644
index 0000000..a67eddc
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api.c
@@ -0,0 +1,86 @@
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_class.h"
+#include "fmgr.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "storage/bufmgr.h"
+#include "utils/builtins.h"
+
+#include <stdlib.h>
+#include <unistd.h>
+
+PG_MODULE_MAGIC;
+
+PG_FUNCTION_INFO_V1(test_undo_api);
+
+static UndoPersistence
+undo_persistence_from_text(text *t)
+{
+	char *str = text_to_cstring(t);
+
+	if (strcmp(str, "permanent") == 0)
+		return UNDO_PERMANENT;
+	else if (strcmp(str, "temporary") == 0)
+		return UNDO_TEMP;
+	else if (strcmp(str, "unlogged") == 0)
+		return UNDO_UNLOGGED;
+	else
+		elog(ERROR, "unknown undo persistence level: %s", str);
+}
+
+/*
+ * Prepare and insert data in undo storage and fetch it back to verify.
+ */
+Datum
+test_undo_api(PG_FUNCTION_ARGS)
+{
+	TransactionId xid = DatumGetTransactionId(PG_GETARG_DATUM(0));
+	UndoPersistence persistence = undo_persistence_from_text(PG_GETARG_TEXT_PP(1));
+	char	*data = "test_data";
+	int		 len = strlen(data);
+	UnpackedUndoRecord	undorecord;
+	UnpackedUndoRecord *undorecord_out;
+	int	header_size = offsetof(UnpackedUndoRecord, uur_next) + sizeof(uint64);
+	UndoRecPtr	undo_ptr;
+
+	undorecord.uur_type = 0;
+	undorecord.uur_info = 0;
+	undorecord.uur_prevlen = 0;
+	undorecord.uur_prevxid = FrozenTransactionId;
+	undorecord.uur_xid = xid;
+	undorecord.uur_cid = 0;
+	undorecord.uur_fork = MAIN_FORKNUM;
+	undorecord.uur_blkprev = 0;
+	undorecord.uur_block = 1;
+	undorecord.uur_offset = 100;
+	undorecord.uur_payload.len = 0;
+	initStringInfo(&undorecord.uur_tuple);
+
+	appendBinaryStringInfo(&undorecord.uur_tuple,
+						   (char *) data,
+						   len);
+	undo_ptr = PrepareUndoInsert(&undorecord, persistence, xid);
+	START_CRIT_SECTION();
+	InsertPreparedUndo();
+	END_CRIT_SECTION();
+	UnlockReleaseUndoBuffers();
+
+	undorecord_out = UndoFetchRecord(undo_ptr, InvalidBlockNumber,
+									 InvalidOffsetNumber,
+									 InvalidTransactionId, NULL,
+									 NULL);
+
+	if (strncmp((char *) &undorecord, (char *) undorecord_out, header_size) != 0)
+		elog(ERROR, "undo header did not match");
+	if (strncmp(undorecord_out->uur_tuple.data, data, len) != 0)
+		elog(ERROR, "undo data did not match");
+
+	UndoRecordRelease(undorecord_out);
+	pfree(undorecord.uur_tuple.data);
+
+	PG_RETURN_VOID();
+}
diff --git a/src/test/modules/test_undo_api/test_undo_api.control b/src/test/modules/test_undo_api/test_undo_api.control
new file mode 100644
index 0000000..09df344
--- /dev/null
+++ b/src/test/modules/test_undo_api/test_undo_api.control
@@ -0,0 +1,4 @@
+comment = 'test_undo_api'
+default_version = '1.0'
+module_pathname = '$libdir/test_undo_api'
+relocatable = true
-- 
1.8.3.1

#20

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#19)

Re: Undo logs

On Fri, Nov 16, 2018 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Nov 15, 2018 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Updated patch (merged latest code from the zheap main branch [1]).

Review comments:
-------------------------------
1.
UndoRecordPrepareTransInfo()
{
..
+ /*
+ * The absence of previous transaction's undo indicate that this backend
+ * is preparing its first undo in which case we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
..
}

It is expected that the caller of UndoRecPtrIsValid should have
discard lock, but I don't see that how this the call from this place
ensures the same?

2.
UndoRecordPrepareTransInfo()
{
..
/*
+ * The absence of previous transaction's undo indicate that this backend
+ * is preparing its first undo in which case we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
+
+ /*
+ * Acquire the discard lock before accessing the undo record so that
+ * discard worker doesn't remove the record while we are in process of
+ * reading it.
+ */
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+ if (!UndoRecordIsValid(log, prev_xact_urp))
+ return;
..
}

I don't understand this logic where you are checking the same
information with and without a lock, is there any reason for same? It
seems we don't need the first call to UndoRecPtrIsValid is not
required.

3.
UndoRecordPrepareTransInfo()
{
..
+ while (true)
+ {
+ bufidx = InsertFindBufferSlot(rnode, cur_blk,
+   RBM_NORMAL,
+   log->meta.persistence);
+ prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+ buffer = undo_buffer[bufidx].buf;
+ page = BufferGetPage(buffer);
+ index++;
+
+ if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+ &already_decoded, true))
+ break;
+
+ starting_byte = UndoLogBlockHeaderSize;
+ cur_blk++;
+ }

Can you write some commentary on what this code is doing?

There is no need to use index++; as a separate statement, you can do
it while assigning the buffer in that index.

4.
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+ UndoRecPtr prev_xact_urp;

I think you can simply name this variable as xact_urp. All this and
related prev_* terminology used for variables seems confusing to me. I
understand that you are trying to update the last transactions undo
record information, but you can explain that via comments. Keeping
such information as part of variable names not only makes their length
longer, but is also confusing.

5.
/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+ UndoRecPtr prev_urecptr; /* prev txn's starting urecptr */
+ int prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+ UnpackedUndoRecord uur; /* prev txn's first undo record. */
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;

Due to reasons mentioned in point-4, lets name the structure and it's
variables as below:

typedef struct XactUndoRecordInfo
{
UndoRecPtr start_urecptr; /* prev txn's starting urecptr */
int idx_undo_buffers[MAX_BUFFER_PER_UNDO];
UnpackedUndoRecord first_uur; /* prev txn's first undo record. */
} XactUndoRecordInfo;

static XactUndoRecordInfo xact_ur_info;

6.
+static int
+InsertFindBufferSlot(RelFileNode rnode,

The name of this function is not clear, can we change it to
UndoGetBufferSlot or UndoGetBuffer?

7.
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+ UndoPersistence upersistence, TransactionId txid)
{
..
+ /*
+ * If this is the first undo record for this transaction then set the
+ * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+ * the space for the transaction header and the valid value of the uur_next
+ * will be updated while preparing the first undo record of the next
+ * transaction.
+ */
+ first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
..
}

I think it will be better if we move this comment few lines down:
+ if (need_start_undo && i == 0)
+ {
+ urec->uur_next = SpecialUndoRecPtr;

BTW, is the only reason set a special value (SpecialUndoRecPtr) for
uur_next is for allocating transaction header? If so, can't we
directly set the corresponding flag (UREC_INFO_TRANSACTION) in
uur_info and then remove it from UndoRecordSetInfo?

I think it would have been better if there is one central location to
set uur_info, but as that is becoming tricky,
we should not try to add some special stuff to make it possible.

8.
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+ Size size;
+
+ /* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+ /* if (uur->uur_info == 0) */
+ UndoRecordSetInfo(uur);

Can't we move UndoRecordSetInfo in it's caller
(UndoRecordAllocateMulti)? It seems another caller of this function
doesn't expect this. If we do that way, then we can have an Assert
for non-zero uur_info in UndoRecordExpectedSize.

9.
bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+ int starting_byte, int *already_written, bool header_only)
+{
+ char   *writeptr = (char *) page + starting_byte;
+ char   *endptr = (char *) page + BLCKSZ;
+ int my_bytes_written = *already_written;
+
+ if (uur->uur_info == 0)
+ UndoRecordSetInfo(uur);

Do we really need UndoRecordSetInfo here? If not, then just add an
assert for non-zero uur_info?

10
UndoRecordAllocateMulti()
{
..
else
+ {
+ /*
+ * It is okay to initialize these variables as these are used only
+ * with the first record of transaction.
+ */
+ urec->uur_next = InvalidUndoRecPtr;
+ urec->uur_xidepoch = 0;
+ urec->uur_dbid = 0;
+ urec->uur_progress = 0;
+ }
+
+
+ /* calculate the size of the undo record. */
+ size += UndoRecordExpectedSize(urec);
+ }

Remove one extra line before comment "calculate the size of ..".

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#21

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Amit Kapila (#20)

1 attachment(s)

Re: Undo logs

On Sat, Nov 17, 2018 at 5:12 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Fri, Nov 16, 2018 at 9:46 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Thu, Nov 15, 2018 at 12:14 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Updated patch (merged latest code from the zheap main branch [1]).
Review comments:
-------------------------------
1.
UndoRecordPrepareTransInfo()
{
..
+ /*
+ * The absence of previous transaction's undo indicate that this backend
+ * is preparing its first undo in which case we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
..
}
It is expected that the caller of UndoRecPtrIsValid should have
discard lock, but I don't see that how this the call from this place
ensures the same?

I think its duplicate code, made mistake while merging from the zheap branch

2.
UndoRecordPrepareTransInfo()
{
..
/*
+ * The absence of previous transaction's undo indicate that this backend
+ * is preparing its first undo in which case we have nothing to update.
+ */
+ if (!UndoRecPtrIsValid(prev_xact_urp))
+ return;
+
+ /*
+ * Acquire the discard lock before accessing the undo record so that
+ * discard worker doesn't remove the record while we are in process of
+ * reading it.
+ */
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+ if (!UndoRecordIsValid(log, prev_xact_urp))
+ return;
..
}

Removed

3.
UndoRecordPrepareTransInfo()
{
..
+ while (true)
+ {
+ bufidx = InsertFindBufferSlot(rnode, cur_blk,
+   RBM_NORMAL,
+   log->meta.persistence);
+ prev_txn_info.prev_txn_undo_buffers[index] = bufidx;
+ buffer = undo_buffer[bufidx].buf;
+ page = BufferGetPage(buffer);
+ index++;
+
+ if (UnpackUndoRecord(&prev_txn_info.uur, page, starting_byte,
+ &already_decoded, true))
+ break;
+
+ starting_byte = UndoLogBlockHeaderSize;
+ cur_blk++;
+ }

Can you write some commentary on what this code is doing?

Done

There is no need to use index++; as a separate statement, you can do
it while assigning the buffer in that index.

Done

4.
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+ UndoRecPtr prev_xact_urp;
I think you can simply name this variable as xact_urp. All this and
related prev_* terminology used for variables seems confusing to me. I
understand that you are trying to update the last transactions undo
record information, but you can explain that via comments. Keeping
such information as part of variable names not only makes their length
longer, but is also confusing.
5.
/*
+ * Structure to hold the previous transaction's undo update information.
+ */
+typedef struct PreviousTxnUndoRecord
+{
+ UndoRecPtr prev_urecptr; /* prev txn's starting urecptr */
+ int prev_txn_undo_buffers[MAX_BUFFER_PER_UNDO];
+ UnpackedUndoRecord uur; /* prev txn's first undo record. */
+} PreviousTxnInfo;
+
+static PreviousTxnInfo prev_txn_info;
Due to reasons mentioned in point-4, lets name the structure and it's
variables as below:

typedef struct XactUndoRecordInfo
{
UndoRecPtr start_urecptr; /* prev txn's starting urecptr */
int idx_undo_buffers[MAX_BUFFER_PER_UNDO];
UnpackedUndoRecord first_uur; /* prev txn's first undo record. */
} XactUndoRecordInfo;

static XactUndoRecordInfo xact_ur_info;

Done, but I have kept start_urecptr as urecptr and first_uur as uur
and explained in comment.

6.
+static int
+InsertFindBufferSlot(RelFileNode rnode,
The name of this function is not clear, can we change it to
UndoGetBufferSlot or UndoGetBuffer?

Changed to UndoGetBufferSlot

7.
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+ UndoPersistence upersistence, TransactionId txid)
{
..
+ /*
+ * If this is the first undo record for this transaction then set the
+ * uur_next to the SpecialUndoRecPtr.  This is the indication to allocate
+ * the space for the transaction header and the valid value of the uur_next
+ * will be updated while preparing the first undo record of the next
+ * transaction.
+ */
+ first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
..
}

Done

I think it will be better if we move this comment few lines down:
+ if (need_start_undo && i == 0)
+ {
+ urec->uur_next = SpecialUndoRecPtr;
BTW, is the only reason set a special value (SpecialUndoRecPtr) for
uur_next is for allocating transaction header? If so, can't we
directly set the corresponding flag (UREC_INFO_TRANSACTION) in
uur_info and then remove it from UndoRecordSetInfo?

yeah, Done that way.

I think it would have been better if there is one central location to
set uur_info, but as that is becoming tricky,
we should not try to add some special stuff to make it possible.
8.
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+ Size size;
+
+ /* Fixme : Temporary hack to allow zheap to set some value for uur_info. */
+ /* if (uur->uur_info == 0) */
+ UndoRecordSetInfo(uur);
Can't we move UndoRecordSetInfo in it's caller
(UndoRecordAllocateMulti)? It seems another caller of this function
doesn't expect this. If we do that way, then we can have an Assert
for non-zero uur_info in UndoRecordExpectedSize.

Done that way

9.
bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+ int starting_byte, int *already_written, bool header_only)
+{
+ char   *writeptr = (char *) page + starting_byte;
+ char   *endptr = (char *) page + BLCKSZ;
+ int my_bytes_written = *already_written;
+
+ if (uur->uur_info == 0)
+ UndoRecordSetInfo(uur);

Do we really need UndoRecordSetInfo here? If not, then just add an
assert for non-zero uur_info?

Done

10
UndoRecordAllocateMulti()
{
..
else
+ {
+ /*
+ * It is okay to initialize these variables as these are used only
+ * with the first record of transaction.
+ */
+ urec->uur_next = InvalidUndoRecPtr;
+ urec->uur_xidepoch = 0;
+ urec->uur_dbid = 0;
+ urec->uur_progress = 0;
+ }
+
+
+ /* calculate the size of the undo record. */
+ size += UndoRecordExpectedSize(urec);
+ }

Remove one extra line before comment "calculate the size of ..".

Fixed

Along with that I have merged latest changes in zheap branch committed
by Rafia Sabih for cleaning up the undo buffer information in abort
path.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-undo-interface-v8.patchapplication/octet-stream; name=0003-undo-interface-v8.patchDownload

From 4e5b00bbc881f8f3e0ac75e77c45ccde5c5c4b02 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilip.kumar@enterprisedb.com>
Date: Wed, 14 Nov 2018 01:17:03 -0800
Subject: [PATCH] undo-interface-v8

Provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

Dilip Kumar with help from Rafia Sabih based on an early prototype
for forming undo record by Robert Haas and design inputs from Amit Kapila
---
 src/backend/access/transam/xact.c    |   27 +
 src/backend/access/transam/xlog.c    |   30 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1196 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  449 +++++++++++++
 src/include/access/undoinsert.h      |  108 +++
 src/include/access/undorecord.h      |  219 +++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 2033 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e..337442f 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/undoinsert.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -66,6 +67,7 @@
 #include "utils/timestamp.h"
 #include "pg_trace.h"
 
+#define	AtAbort_ResetUndoBuffers() ResetUndoBuffers()
 
 /*
  *	User-tweakable parameters
@@ -189,6 +191,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +918,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
@@ -2627,6 +2653,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtAbort_ResetUndoBuffers();
 		pgstat_report_xact_timestamp(0);
 	}
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dce4c01..36c161e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8511,6 +8511,36 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/*
+	 * Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..78aee0a
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1196 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ *  Undo record are stored in sequential order in the undo log.  And, each
+ *  transaction's first undo record a.k.a. transaction header points to the next
+ *  transaction's start header.  Transaction headers are linked so that the
+ *  discard worker can read undo log transaction by transaction and avoid
+ *  reading each undo record.
+ *
+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId	prev_txid[UndoPersistenceLevels] = { 0 };
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber	logno;			/* Undo log number */
+	BlockNumber		blk;			/* block number */
+	Buffer			buf;			/* buffer allocated for the block */
+	bool			zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr urp;						/* undo record pointer */
+	UnpackedUndoRecord *urec;			/* undo record */
+	int undo_buffer_idx[MAX_BUFFER_PER_UNDO]; /* undo_buffer array index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace  def_prepared[MAX_PREPARED_UNDO];
+static int prepare_idx;
+static int	max_prepare_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr	multi_prep_urp = InvalidUndoRecPtr;
+static bool	update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.  This
+ * is populated while current transaction is updating its undo record pointer
+ * in previous transactions first undo record.
+ */
+typedef struct XactUndoRecordInfo
+{
+	UndoRecPtr	urecptr;		/* txn's start urecptr */
+	int			idx_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;		/* undo record header */
+} XactUndoRecordInfo;
+
+static XactUndoRecordInfo xact_urec_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord* UndoGetOneRecord(UnpackedUndoRecord *urec,
+											UndoRecPtr urp, RelFileNode rnode,
+											UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+											 bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int UndoGetBufferSlot(RelFileNode rnode, BlockNumber blk,
+							 ReadBufferMode rbm,
+							 UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl *log,
+							  UndoRecPtr urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl *log, UndoRecPtr urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or not
+		 * so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that the
+		 * doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber	cur_blk;
+	RelFileNode	rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno, false);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno, false);
+	}
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		xact_urp = InvalidUndoRecPtr;
+	else
+		xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 * UndoRecordIsValid will release the lock if it returns false.
+	 */
+	if (!UndoRecordIsValid(log, xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(xact_urp);
+
+	/*
+	 * Read undo record header in by calling UnpackUndoRecord, if the undo
+	 * record header is splited across buffers then we need to read the complete
+	 * header by invoking UnpackUndoRecord multiple times.
+	 */
+	while (true)
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk,
+								   RBM_NORMAL,
+								   log->meta.persistence);
+		xact_urec_info.idx_undo_buffers[index++] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+
+		if (UnpackUndoRecord(&xact_urec_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		/* Could not fetch the complete header so go to the next block. */
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	xact_urec_info.uur.uur_next = urecptr;
+	xact_urec_info.urecptr = xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(xact_urec_info.urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	urec_ptr = xact_urec_info.urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, urec_ptr))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction
+	 * header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(urec_ptr);
+
+	do
+	{
+		Buffer  buffer;
+		int		buf_idx;
+
+		buf_idx = xact_urec_info.idx_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&xact_urec_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while(true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+UndoGetBufferSlot(RelFileNode rnode,
+				  BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence)
+{
+	int 	i;
+	Buffer 	buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g
+		 * when previous transaction start header is in previous undo log)
+		 * so compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+										GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+						UndoPersistence upersistence, TransactionId txid)
+{
+	UnpackedUndoRecord *urec;
+	UndoLogControl *log;
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	bool	need_start_undo = false;
+	bool	first_rec_in_recovery;
+	bool	log_switched = false;
+	int	i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+	if ((!InRecovery && prev_txid[upersistence] != txid) ||
+		first_rec_in_recovery)
+	{
+		need_start_undo = true;
+	}
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		/*
+		 * If this is the first undo record of the transaction then initialize
+		 * the transaction header fields of the undorecord. Also, set the flag
+		 * in the uur_info to indicate that this record contains the transaction
+		 * header so allocate the space for the same.
+		 */
+		if (need_start_undo && i == 0)
+		{
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			/* During recovery, Fetch database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+
+			/* Set uur_info to include the transaction header. */
+			urec->uur_info |= UREC_INFO_TRANSACTION;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables as these are used only
+			 * with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+		}
+
+		/*
+		 * Set uur_info for an UnpackedUndoRecord appropriately based on which
+		 * fields are set and calculate the size of the undo record based on the
+		 * uur_info.
+		 */
+		UndoRecordSetInfo(urec);
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If this is the first record of the log and not the first record of
+	 * the transaction i.e. same transaction continued from the previous log
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, that will make the record larger,
+	 * so we'll have to go back and recompute the size.
+	 */
+	if (!need_start_undo &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_start_undo = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+
+		goto resize;
+	}
+
+	/*
+	 * If transaction id is switched then update the previous transaction's
+	 * start undo record.
+	 */
+	if (first_rec_in_recovery ||
+		(!InRecovery && prev_txid[upersistence] != txid) ||
+		log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+void
+UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,
+											 upersistence, txid);
+	if (max_prepare <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(max_prepare * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's
+	 * starting undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((max_prepare + 1) * MAX_BUFFER_PER_UNDO *
+						 sizeof(UndoBuffers));
+	max_prepare_undo = max_prepare;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+				  TransactionId xid)
+{
+	UndoRecordSize	size;
+	UndoRecPtr		urecptr;
+	RelFileNode		rnode;
+	UndoRecordSize  cur_size = 0;
+	BlockNumber		cur_blk;
+	TransactionId	txid;
+	int				starting_byte;
+	int				index = 0;
+	int				bufidx;
+	ReadBufferMode	rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepare_undo)
+		return InvalidUndoRecPtr;
+
+	/*
+	 * If this is the first undo record for this top transaction add the
+	 * transaction information to the undo record.
+	 *
+	 * XXX there is also an option that instead of adding the information to
+	 * this record we can prepare a new record which only contain transaction
+	 * informations.
+	 */
+	if (xid == InvalidTransactionId)
+	{
+		/* we expect during recovery, we always have a valid transaction id. */
+		Assert (!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping for
+		 * the top most transactions.
+		 */
+		Assert (InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(multi_prep_urp))
+		urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+	else
+		urecptr = multi_prep_urp;
+
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(multi_prep_urp))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* FIXME: Should we just report error ? */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/* Undo record can not fit into this block so go to the next block. */
+		cur_blk++;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+	} while (cur_size < size);
+
+	/*
+	 * Save referenced of undo record pointer as well as undo record.
+	 * InsertPreparedUndo will use these to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int		idx;
+	int		flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int		idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page	page;
+	int		starting_byte;
+	int		already_written;
+	int		bufidx = 0;
+	int		idx;
+	uint16	undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord	*uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+	uint16	prev_undolen;
+
+	Assert(prepare_idx > 0);
+
+	/* This must be called under a critical section. */
+	Assert(InRecovery || CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		/*
+		 * We can read meta.prevlen without locking, because only we can write
+		 * to it.
+		 */
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+		prev_undolen = log->meta.prevlen;
+
+		/* store the previous undo record length in the header */
+		uur->uur_prevlen = prev_undolen;
+
+		/* if starting a new log then there is no prevlen to store */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/* if starting from a new page then include header in prevlen */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+				uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer  buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+			starting_byte = UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			undo_len += UndoLogBlockHeaderSize;
+
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while(true);
+
+		prev_undolen = undo_len;
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
+
+		if (UndoRecPtrIsValid(xact_urec_info.urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ *  Reset the global variables related to undo buffers. This is required at the
+ *  transaction abort or releasing undo buffers
+ */
+void
+ResetUndoBuffers(void)
+{
+	int	i;
+	for (i = 0; i < buffer_idx; i++)
+	{
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	xact_urec_info.urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	multi_prep_urp = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepare_undo limit is changed so free the allocated memory and reset
+	 * all the variable back to its default value.
+	 */
+	if (max_prepare_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepare_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int	i;
+	for (i = 0; i < buffer_idx; i++)
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+
+	ResetUndoBuffers();
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer,  now caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if he wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord*
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer			 buffer = urec->uur_buffer;
+	Page			 page;
+	int				 starting_byte = UndoRecPtrGetPageOffset(urp);
+	int				 already_decoded = 0;
+	BlockNumber		 cur_blk;
+	bool			 is_undo_splited = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a previous buffer then no need to allocate new. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * FIXME: This can be optimized to just fetch header first and only
+		 * if matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_splited = true;
+
+		/*
+		 * Complete record is not fitting into one buffer so release the buffer
+		 * pin and also set invalid buffer in the undo record.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer. Otherwise just
+	 * unlock it.
+	 */
+	if (is_undo_splited)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord*
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode		 rnode, prevrnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int	logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/*
+		 * If we have a valid buffer pinned then just ensure that we want to
+		 * find the next tuple from the same block.  Otherwise release the
+		 * buffer and set it invalid
+		 */
+		if (BufferIsValid(urec->uur_buffer))
+		{
+			/*
+			 * Undo buffer will be changed if the next undo record belongs to a
+			 * different block or undo log.
+			 */
+			if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+				(prevrnode.relNode != rnode.relNode))
+			{
+				ReleaseBuffer(urec->uur_buffer);
+				urec->uur_buffer = InvalidBuffer;
+			}
+		}
+		else
+		{
+			/*
+			 * If there is not a valid buffer in urec->uur_buffer that means we
+			 * had copied the payload data and tuple data so free them.
+			 */
+			if (urec->uur_payload.data)
+				pfree(urec->uur_payload.data);
+			if (urec->uur_tuple.data)
+				pfree(urec->uur_tuple.data);
+		}
+
+		/* Reset the urec before fetching the tuple */
+		urec->uur_tuple.data = NULL;
+		urec->uur_tuple.len = 0;
+		urec->uur_payload.data = NULL;
+		urec->uur_payload.len = 0;
+		prevrnode = rnode;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecPtrIsValid(log->oldest_data))
+		{
+			/*
+			 * UndoDiscardInfo is not yet initialized. Hence, we've to check
+			 * UndoLogIsDiscarded and if it's already discarded then we have
+			 * nothing to do.
+			 */
+			LWLockRelease(&log->discard_lock);
+			if (UndoLogIsDiscarded(urp))
+			{
+				if (BufferIsValid(urec->uur_buffer))
+					ReleaseBuffer(urec->uur_buffer);
+				return NULL;
+			}
+
+			LWLockAcquire(&log->discard_lock, LW_SHARED);
+		}
+
+		/* Check if it's already discarded. */
+		if (urp < log->oldest_data)
+		{
+			LWLockRelease(&log->discard_lock);
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl	*prevlog, *log;
+
+		log = UndoLogGet(logno, false);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno, true);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr (logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree (urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..578ebcb
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,449 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size	size;
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char   *writeptr = (char *) page + starting_byte;
+	char   *endptr = (char *) page + BLCKSZ;
+	int		my_bytes_written = *already_written;
+
+	Assert (uur->uur_info != 0);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption
+	 * that it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_reloid = uur->uur_reloid;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before,
+		 * or caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_reloid == uur->uur_reloid);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int		can_write;
+	int		remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing
+	 * to do except update *my_bytes_written, which we must do to ensure
+	 * that the next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+					  int *already_decoded, bool header_only)
+{
+	char	*readptr = (char *)page + starting_byte;
+	char	*endptr = (char *) page + BLCKSZ;
+	int		my_bytes_decoded = *already_decoded;
+	bool	is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_reloid = work_hdr.urec_reloid;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							&readptr, endptr,
+							&my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any of
+		 * the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int		can_read;
+	int		remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..be8914c
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,108 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord* urec,
+											BlockNumber blkno,
+											OffsetNumber offset,
+											TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, UndoPersistence,
+					TransactionId xid);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord* UndoFetchRecord(UndoRecPtr urp,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid,
+										   UndoRecPtr *urec_ptr_out,
+										   SatisfyUndoRecordCallback callback);
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+							   TransactionId xid, UndoPersistence upersistence);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+/* Reset globals related to undo buffers */
+extern void ResetUndoBuffers(void);
+
+#endif   /* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..8ca3bda
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,219 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid		urec_reloid;		/* relation OID */
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older then RecentGlobalXmin, then we can consider the tuple
+	 * in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;			/* Transaction id */
+	CommandId	urec_cid;			/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+#define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	ForkNumber		urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+	uint32			urec_progress;  /* undo applying progress. */
+	uint32			urec_xidepoch;  /* epoch of the current transaction */
+	Oid				urec_dbid;		/* database id */
+	uint64			urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+#define urec_next_pos \
+	(SizeOfUndoRecordTransaction - SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;		/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordExpectedSize or InsertUndoRecord.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_reloid;	/* relation OID */
+	TransactionId uur_prevxid;		/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id*/
+
+	/*
+	 * undo action apply progress 0 = not started, 1 = completed. In future it
+	 * can also be used to show the progress of how much undo has been applied
+	 * so far with some formulae but currently only 0 and 1 is used.
+	 */
+	uint32         uur_progress;
+	StringInfoData uur_payload;	/* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.  For the first call, the given page should be the one which
+ * the caller has determined to contain the current insertion point,
+ * starting_byte should be the byte offset within that page which corresponds
+ * to the current insertion point, and *already_written should be 0.  The
+ * return value will be true if the entire record is successfully written
+ * into that page, and false if not.  In either case, *already_written will
+ * be updated to the number of bytes written by all InsertUndoRecord calls
+ * for this record to date.  If this function is called again to continue
+ * writing the record, the previous value for *already_written should be
+ * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
+ * (since the record will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif   /* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 689c57c..73394c5 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e01d12e..8cfcd44 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -277,6 +277,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

#22

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#21)

1 attachment(s)

Re: Undo logs

On Tue, Nov 20, 2018 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Along with that I have merged latest changes in zheap branch committed
by Rafia Sabih for cleaning up the undo buffer information in abort
path.

Thanks, few more comments:

1.
@@ -2627,6 +2653,7 @@ AbortTransaction(void)
AtEOXact_HashTables(false);
AtEOXact_PgStat(false);
AtEOXact_ApplyLauncher(false);
+ AtAbort_ResetUndoBuffers();

Don't we need similar handling in AbortSubTransaction?

2.
 Read undo record header in by calling UnpackUndoRecord, if the undo
+ * record header is splited across buffers then we need to read the complete
+ * header by invoking UnpackUndoRecord multiple times.

/splited/splitted. You can just use split here.

3.
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+ uint32 urec_progress;  /* undo applying progress. */
+ uint32 urec_xidepoch;  /* epoch of the current transaction */
+ Oid urec_dbid; /* database id */
+ uint64 urec_next; /* urec pointer of the next transaction */
+} UndoRecordTransaction;

/it will/It will.
BTW, which field(s) in the above structure stores the size of the undo?

4.
+ /*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * fields are set and calculate the size of the undo record based on the
+ * uur_info.
+ */

Can we rephrase it as "calculate the size of the undo record based on
the info required"?

5.
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)

Can we change the later sentence as "This step is performed after
exiting any critical section where we have performed undo action."?

6.
+InsertUndoRecord()
{
..
+ Assert (uur->uur_info != 0);

Add a comment above Assert "The undo record must contain a valid information."

6.
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+ UndoPersistence upersistence, TransactionId txid)
{
..
+ first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+ if ((!InRecovery && prev_txid[upersistence] != txid) ||
+ first_rec_in_recovery)
+ {
+ need_start_undo = true;
+ }

Here, I think we can avoid using two boolean variables
(first_rec_in_recovery and need_start_undo). Also, this same check is
used in this function twice. I have tried to simplify it in the
attached. Can you check and let me know if that sounds okay to you?

7.
UndoRecordAllocateMulti
{
..
/*
+ * If this is the first undo record of the transaction then initialize
+ * the transaction header fields of the undorecord. Also, set the flag
+ * in the uur_info to indicate that this record contains the transaction
+ * header so allocate the space for the same.
+ */
+ if (need_start_undo && i == 0)
+ {
+ urec->uur_next = InvalidUndoRecPtr;
+ urec->uur_xidepoch = GetEpochForXid(txid);
+ urec->uur_progress = 0;
+
+ /* During recovery, Fetch database id from the undo log state. */
+ if (InRecovery)
+ urec->uur_dbid = UndoLogStateGetDatabaseId();
+ else
+ urec->uur_dbid = MyDatabaseId;
+
+ /* Set uur_info to include the transaction header. */
+ urec->uur_info |= UREC_INFO_TRANSACTION;
+ }
..
}

It seems here you have written the code in your comments. I have
changed it in the attached delta patch.

8.
UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+    TransactionId xid, UndoPersistence upersistence)
+{
..
..
+ multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,

Can we rename this variable as prepared_urec_ptr or prepared_urp?

9.
+void
+UndoSetPrepareSize(int max_prepare,

I think it will be better to use nrecords instead of 'max_prepare'
similar to how you have it in UndoRecordAllocateMulti()

10.
+ if (!UndoRecPtrIsValid(multi_prep_urp))
+ urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+ else
+ urecptr = multi_prep_urp;
+
+ size = UndoRecordExpectedSize(urec);
..
..
+ if (UndoRecPtrIsValid(multi_prep_urp))
+ {
+ UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+ insert = UndoLogOffsetPlusUsableBytes(insert, size);
+ multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+ }

Can't we use urecptr instead of multi_prep_urp in above code after
urecptr is initialized?

11.
+static int max_prepare_undo = MAX_PREPARED_UNDO;

Let's change the name of this variable as max_prepared_undo. Already
changed in attached delta patch

12.
PrepareUndoInsert()
{
..
+ /* Already reached maximum prepared limit. */
+ if (prepare_idx == max_prepare_undo)
+ return InvalidUndoRecPtr;
..
}

I think in the above condition, we should have elog, otherwise,
callers need to be prepared to handle it.

13.
UndoRecordAllocateMulti()

How about naming it as UndoRecordAllocate as this is used to allocate
even a single undo record?

14.
If not already done, can you pgindent the new code added by this patch?

Attached is a delta patch on top of your previous patch containing
some fixes as memtioned above and few other minor changes and cleanup.
If you find changes okay, kindly include them in your next version.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-undo-interface-v8-delta-amit.patchapplication/octet-stream; name=0003-undo-interface-v8-delta-amit.patchDownload

diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
index 78aee0a0cc..fcdc70c639 100644
--- a/src/backend/access/undo/undoinsert.c
+++ b/src/backend/access/undo/undoinsert.c
@@ -113,7 +113,7 @@ typedef struct PreparedUndoSpace
 
 static PreparedUndoSpace  def_prepared[MAX_PREPARED_UNDO];
 static int prepare_idx;
-static int	max_prepare_undo = MAX_PREPARED_UNDO;
+static int	max_prepared_undo = MAX_PREPARED_UNDO;
 static UndoRecPtr	multi_prep_urp = InvalidUndoRecPtr;
 static bool	update_prev_header = false;
 
@@ -442,8 +442,7 @@ UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
 	UndoLogControl *log;
 	UndoRecordSize	size;
 	UndoRecPtr		urecptr;
-	bool	need_start_undo = false;
-	bool	first_rec_in_recovery;
+	bool	need_xact_hdr = false;
 	bool	log_switched = false;
 	int	i;
 
@@ -451,13 +450,10 @@ UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
 	if (nrecords <= 0)
 		elog(ERROR, "cannot allocate space for zero undo records");
 
-	first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
-
-	if ((!InRecovery && prev_txid[upersistence] != txid) ||
-		first_rec_in_recovery)
-	{
-		need_start_undo = true;
-	}
+	/* Is this the first undo record of the transaction? */
+	if ((InRecovery && IsTransactionFirstRec(txid)) ||
+		(!InRecovery && prev_txid[upersistence] != txid))
+		need_xact_hdr = true;
 
 resize:
 	size = 0;
@@ -467,18 +463,16 @@ resize:
 		urec = undorecords + i;
 
 		/*
-		 * If this is the first undo record of the transaction then initialize
-		 * the transaction header fields of the undorecord. Also, set the flag
-		 * in the uur_info to indicate that this record contains the transaction
-		 * header so allocate the space for the same.
+		 * Prepare the transacion header for the first undo record of
+		 * transaction.
 		 */
-		if (need_start_undo && i == 0)
+		if (need_xact_hdr && i == 0)
 		{
 			urec->uur_next = InvalidUndoRecPtr;
 			urec->uur_xidepoch = GetEpochForXid(txid);
 			urec->uur_progress = 0;
 
-			/* During recovery, Fetch database id from the undo log state. */
+			/* During recovery, get the database id from the undo log state. */
 			if (InRecovery)
 				urec->uur_dbid = UndoLogStateGetDatabaseId();
 			else
@@ -490,8 +484,8 @@ resize:
 		else
 		{
 			/*
-			 * It is okay to initialize these variables as these are used only
-			 * with the first record of transaction.
+			 * It is okay to initialize these variables with invalid values
+			 * as these are used only with the first record of transaction.
 			 */
 			urec->uur_next = InvalidUndoRecPtr;
 			urec->uur_xidepoch = 0;
@@ -514,11 +508,17 @@ resize:
 		urecptr = UndoLogAllocate(size, upersistence);
 
 	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+	/*
+	 * By now, we must be attached to some undo log unless we are in
+	 * recovery.
+	 */
 	Assert(AmAttachedToUndoLog(log) || InRecovery);
 
 	/*
-	 * If this is the first record of the log and not the first record of
-	 * the transaction i.e. same transaction continued from the previous log
+	 * We can consider that the log as switched if this is the first record of
+	 * the log and not the first record of the transaction i.e. same
+	 * transaction continued from the previous log.
 	 */
 	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
 		log->meta.prevlogno != InvalidUndoLogNumber)
@@ -528,26 +528,19 @@ resize:
 	 * If we've rewound all the way back to the start of the transaction by
 	 * rolling back the first subtransaction (which we can't detect until
 	 * after we've allocated some space), we'll need a new transaction header.
-	 * If we weren't already generating one, that will make the record larger,
-	 * so we'll have to go back and recompute the size.
+	 * If we weren't already generating one, then do it now.
 	 */
-	if (!need_start_undo &&
+	if (!need_xact_hdr &&
 		(log->meta.insert == log->meta.last_xact_start ||
 		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
 	{
-		need_start_undo = true;
+		need_xact_hdr = true;
 		urec->uur_info = 0;		/* force recomputation of info bits */
-
 		goto resize;
 	}
 
-	/*
-	 * If transaction id is switched then update the previous transaction's
-	 * start undo record.
-	 */
-	if (first_rec_in_recovery ||
-		(!InRecovery && prev_txid[upersistence] != txid) ||
-		log_switched)
+	/* Update the previous transaction's start undo record, if required. */
+	if (need_xact_hdr || log_switched)
 	{
 		/* Don't update our own start header. */
 		if (log->meta.last_xact_start != log->meta.insert)
@@ -567,9 +560,12 @@ resize:
 }
 
 /*
- * Call UndoSetPrepareSize to set the value of how many maximum prepared can
- * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
- * then it will allocate extra memory to hold the extra prepared undo.
+ * Call UndoSetPrepareSize to set the value of how many undo records can be
+ * prepared before we can insert them.  If the size is greater than
+ * MAX_PREPARED_UNDO then it will allocate extra memory to hold the extra
+ * prepared undo.
+ *
+ * This is normally used when more than one undo record needs to be prepared.
  */
 void
 UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
@@ -602,19 +598,20 @@ UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
 	 */
 	undo_buffer = palloc0((max_prepare + 1) * MAX_BUFFER_PER_UNDO *
 						 sizeof(UndoBuffers));
-	max_prepare_undo = max_prepare;
+	max_prepared_undo = max_prepare;
 }
 
 /*
  * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
  * intended to insert.  Upon return, the necessary undo buffers are pinned and
  * locked.
+ *
  * This should be done before any critical section is established, since it
  * can fail.
  *
- * If not in recovery, 'xid' should refer to the top transaction id because
- * undo log only stores mapping for the top most transactions.
- * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
  */
 UndoRecPtr
 PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
@@ -632,7 +629,7 @@ PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
 	ReadBufferMode	rbm;
 
 	/* Already reached maximum prepared limit. */
-	if (prepare_idx == max_prepare_undo)
+	if (prepare_idx == max_prepared_undo)
 		return InvalidUndoRecPtr;
 
 	/*
@@ -883,16 +880,16 @@ ResetUndoBuffers(void)
 	multi_prep_urp = InvalidUndoRecPtr;
 
 	/*
-	 * max_prepare_undo limit is changed so free the allocated memory and reset
+	 * max_prepared_undo limit is changed so free the allocated memory and reset
 	 * all the variable back to its default value.
 	 */
-	if (max_prepare_undo > MAX_PREPARED_UNDO)
+	if (max_prepared_undo > MAX_PREPARED_UNDO)
 	{
 		pfree(undo_buffer);
 		pfree(prepared_undo);
 		undo_buffer = def_buffers;
 		prepared_undo = def_prepared;
-		max_prepare_undo = MAX_PREPARED_UNDO;
+		max_prepared_undo = MAX_PREPARED_UNDO;
 	}
 }

#23

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Amit Kapila (#22)

1 attachment(s)

Re: Undo logs

On Mon, Nov 26, 2018 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Nov 20, 2018 at 7:37 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

Along with that I have merged latest changes in zheap branch committed
by Rafia Sabih for cleaning up the undo buffer information in abort
path.

Thanks, few more comments:

1.
@@ -2627,6 +2653,7 @@ AbortTransaction(void)
AtEOXact_HashTables(false);
AtEOXact_PgStat(false);
AtEOXact_ApplyLauncher(false);
+ AtAbort_ResetUndoBuffers();

Don't we need similar handling in AbortSubTransaction?

Yeah we do. I have fixed.

2.
Read undo record header in by calling UnpackUndoRecord, if the undo
+ * record header is splited across buffers then we need to read the complete
+ * header by invoking UnpackUndoRecord multiple times.

/splited/splitted. You can just use split here.

Fixed

3.
+/*
+ * Identifying information for a transaction to which this undo belongs.
+ * it will also store the total size of the undo for this transaction.
+ */
+typedef struct UndoRecordTransaction
+{
+ uint32 urec_progress;  /* undo applying progress. */
+ uint32 urec_xidepoch;  /* epoch of the current transaction */
+ Oid urec_dbid; /* database id */
+ uint64 urec_next; /* urec pointer of the next transaction */
+} UndoRecordTransaction;

/it will/It will.
BTW, which field(s) in the above structure stores the size of the undo?

Fixed

4.
+ /*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * fields are set and calculate the size of the undo record based on the
+ * uur_info.
+ */
Can we rephrase it as "calculate the size of the undo record based on
the info required"?

Fixed

5.
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)
Can we change the later sentence as "This step is performed after
exiting any critical section where we have performed undo action."?

Done, I mentioned This step is performed after exiting any critical
section where we have prepared undo record.

6.
+InsertUndoRecord()
{
..
+ Assert (uur->uur_info != 0);

Add a comment above Assert "The undo record must contain a valid information."

Done

6.
+UndoRecordAllocateMulti(UnpackedUndoRecord *undorecords, int nrecords,
+ UndoPersistence upersistence, TransactionId txid)
{
..
+ first_rec_in_recovery = InRecovery && IsTransactionFirstRec(txid);
+
+ if ((!InRecovery && prev_txid[upersistence] != txid) ||
+ first_rec_in_recovery)
+ {
+ need_start_undo = true;
+ }
Here, I think we can avoid using two boolean variables
(first_rec_in_recovery and need_start_undo). Also, this same check is
used in this function twice. I have tried to simplify it in the
attached. Can you check and let me know if that sounds okay to you?

I have taken your changes

7.
UndoRecordAllocateMulti
{
..
/*
+ * If this is the first undo record of the transaction then initialize
+ * the transaction header fields of the undorecord. Also, set the flag
+ * in the uur_info to indicate that this record contains the transaction
+ * header so allocate the space for the same.
+ */
+ if (need_start_undo && i == 0)
+ {
+ urec->uur_next = InvalidUndoRecPtr;
+ urec->uur_xidepoch = GetEpochForXid(txid);
+ urec->uur_progress = 0;
+
+ /* During recovery, Fetch database id from the undo log state. */
+ if (InRecovery)
+ urec->uur_dbid = UndoLogStateGetDatabaseId();
+ else
+ urec->uur_dbid = MyDatabaseId;
+
+ /* Set uur_info to include the transaction header. */
+ urec->uur_info |= UREC_INFO_TRANSACTION;
+ }
..
}

It seems here you have written the code in your comments. I have
changed it in the attached delta patch.

Taken you changes

8.
UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+    TransactionId xid, UndoPersistence upersistence)
+{
..
..
+ multi_prep_urp = UndoRecordAllocateMulti(undorecords, max_prepare,

Can we rename this variable as prepared_urec_ptr or prepared_urp?

Done

9.
+void
+UndoSetPrepareSize(int max_prepare,
I think it will be better to use nrecords instead of 'max_prepare'
similar to how you have it in UndoRecordAllocateMulti()

Done

10.
+ if (!UndoRecPtrIsValid(multi_prep_urp))
+ urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+ else
+ urecptr = multi_prep_urp;
+
+ size = UndoRecordExpectedSize(urec);
..
..
+ if (UndoRecPtrIsValid(multi_prep_urp))
+ {
+ UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+ insert = UndoLogOffsetPlusUsableBytes(insert, size);
+ multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+ }

Can't we use urecptr instead of multi_prep_urp in above code after
urecptr is initialized?

I think we can't because urecptr is just the current pointer we are
going to return but multi_prep_urp is the static variable we need to
update so that
the next prepare can calculate urecptr from this location.

11.
+static int max_prepare_undo = MAX_PREPARED_UNDO;

Let's change the name of this variable as max_prepared_undo. Already
changed in attached delta patch

12.
PrepareUndoInsert()
{
..
+ /* Already reached maximum prepared limit. */
+ if (prepare_idx == max_prepare_undo)
+ return InvalidUndoRecPtr;
..
}
I think in the above condition, we should have elog, otherwise,
callers need to be prepared to handle it.

Done

13.
UndoRecordAllocateMulti()

How about naming it as UndoRecordAllocate as this is used to allocate
even a single undo record?

Done

14.
If not already done, can you pgindent the new code added by this patch?

Done

Attached is a delta patch on top of your previous patch containing
some fixes as memtioned above and few other minor changes and cleanup.
If you find changes okay, kindly include them in your next version.

I have taken your changes.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-undo-interface-v9.patchapplication/octet-stream; name=0003-undo-interface-v9.patchDownload

From 3bfc6a928b285d5ed88607a8cea1b25df31d2325 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Wed, 28 Nov 2018 05:04:55 -0800
Subject: [PATCH] undo-interface-v9

Provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

Dilip Kumar with help from Rafia Sabih based on an early prototype
for forming undo record by Robert Haas and design inputs from Amit Kapila

Reviewed by Amit Kapila
---
 src/backend/access/transam/xact.c    |   28 +
 src/backend/access/transam/xlog.c    |   30 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1194 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  451 +++++++++++++
 src/include/access/undoinsert.h      |  109 ++++
 src/include/access/undorecord.h      |  222 +++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 2038 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index a979d7e..d716753 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/undoinsert.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -66,6 +67,7 @@
 #include "utils/timestamp.h"
 #include "pg_trace.h"
 
+#define	AtAbort_ResetUndoBuffers() ResetUndoBuffers()
 
 /*
  *	User-tweakable parameters
@@ -189,6 +191,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +918,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
@@ -2627,6 +2653,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtAbort_ResetUndoBuffers();
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4811,6 +4838,7 @@ AbortSubTransaction(void)
 		AtEOSubXact_PgStat(false, s->nestingLevel);
 		AtSubAbort_Snapshot(s->nestingLevel);
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
+		AtAbort_ResetUndoBuffers();
 	}
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index dce4c01..36c161e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8511,6 +8511,36 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/*
+	 * Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..ccbcc66
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1194 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ *  Undo record are stored in sequential order in the undo log.  And, each
+ *  transaction's first undo record a.k.a. transaction header points to the next
+ *  transaction's start header.  Transaction headers are linked so that the
+ *  discard worker can read undo log transaction by transaction and avoid
+ *  reading each undo record.
+ *
+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId prev_txid[UndoPersistenceLevels] = {0};
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber logno;		/* Undo log number */
+	BlockNumber blk;			/* block number */
+	Buffer		buf;			/* buffer allocated for the block */
+	bool		zero;			/* new block full of zeroes */
+}			UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr	urp;			/* undo record pointer */
+	UnpackedUndoRecord *urec;	/* undo record */
+	int			undo_buffer_idx[MAX_BUFFER_PER_UNDO];	/* undo_buffer array
+														 * index */
+}			PreparedUndoSpace;
+
+static PreparedUndoSpace def_prepared[MAX_PREPARED_UNDO];
+static int	prepare_idx;
+static int	max_prepared_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr prepared_urec_ptr = InvalidUndoRecPtr;
+static bool update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace * prepared_undo = def_prepared;
+static UndoBuffers * undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.  This
+ * is populated while current transaction is updating its undo record pointer
+ * in previous transactions first undo record.
+ */
+typedef struct XactUndoRecordInfo
+{
+	UndoRecPtr	urecptr;		/* txn's start urecptr */
+	int			idx_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;		/* undo record header */
+}			XactUndoRecordInfo;
+
+static XactUndoRecordInfo xact_urec_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
+				 UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+						   bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int UndoGetBufferSlot(RelFileNode rnode, BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl * log,
+				  UndoRecPtr urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or
+		 * not so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that
+		 * the doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber cur_blk;
+	RelFileNode rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno, false);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno, false);
+	}
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		xact_urp = InvalidUndoRecPtr;
+	else
+		xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 * UndoRecordIsValid will release the lock if it returns false.
+	 */
+	if (!UndoRecordIsValid(log, xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(xact_urp);
+
+	/*
+	 * Read undo record header in by calling UnpackUndoRecord, if the undo
+	 * record header is split across buffers then we need to read the complete
+	 * header by invoking UnpackUndoRecord multiple times.
+	 */
+	while (true)
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk,
+								   RBM_NORMAL,
+								   log->meta.persistence);
+		xact_urec_info.idx_undo_buffers[index++] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+
+		if (UnpackUndoRecord(&xact_urec_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		/* Could not fetch the complete header so go to the next block. */
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	xact_urec_info.uur.uur_next = urecptr;
+	xact_urec_info.urecptr = xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(xact_urec_info.urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	urec_ptr = xact_urec_info.urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, urec_ptr))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(urec_ptr);
+
+	do
+	{
+		Buffer		buffer;
+		int			buf_idx;
+
+		buf_idx = xact_urec_info.idx_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&xact_urec_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while (true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+UndoGetBufferSlot(RelFileNode rnode,
+				  BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence)
+{
+	int			i;
+	Buffer		buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g when
+		 * previous transaction start header is in previous undo log) so
+		 * compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+																	   GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+				   UndoPersistence upersistence, TransactionId txid)
+{
+	UnpackedUndoRecord *urec;
+	UndoLogControl *log;
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	bool		need_xact_hdr = false;
+	bool		log_switched = false;
+	int			i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/* Is this the first undo record of the transaction? */
+	if ((InRecovery && IsTransactionFirstRec(txid)) ||
+		(!InRecovery && prev_txid[upersistence] != txid))
+		need_xact_hdr = true;
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		/*
+		 * Prepare the transacion header for the first undo record of
+		 * transaction.
+		 */
+		if (need_xact_hdr && i == 0)
+		{
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			/* During recovery, get the database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+
+			/* Set uur_info to include the transaction header. */
+			urec->uur_info |= UREC_INFO_TRANSACTION;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables with invalid values as
+			 * these are used only with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+		}
+
+		/* Calculate the size of the undo record based on the info required. */
+		UndoRecordSetInfo(urec);
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+	/*
+	 * By now, we must be attached to some undo log unless we are in recovery.
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * We can consider that the log as switched if this is the first record of
+	 * the log and not the first record of the transaction i.e. same
+	 * transaction continued from the previous log.
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, then do it now.
+	 */
+	if (!need_xact_hdr &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_xact_hdr = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+		goto resize;
+	}
+
+	/* Update the previous transaction's start undo record, if required. */
+	if (need_xact_hdr || log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many undo records can be
+ * prepared before we can insert them.  If the size is greater than
+ * MAX_PREPARED_UNDO then it will allocate extra memory to hold the extra
+ * prepared undo.
+ *
+ * This is normally used when more than one undo record needs to be prepared.
+ */
+void
+UndoSetPrepareSize(int nrecords, UnpackedUndoRecord *undorecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	prepared_urec_ptr = UndoRecordAllocate(undorecords, nrecords, upersistence,
+										   txid);
+	if (nrecords <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(nrecords * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's starting
+	 * undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((nrecords + 1) * MAX_BUFFER_PER_UNDO *
+						  sizeof(UndoBuffers));
+	max_prepared_undo = nrecords;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ *
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+				  TransactionId xid)
+{
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	RelFileNode rnode;
+	UndoRecordSize cur_size = 0;
+	BlockNumber cur_blk;
+	TransactionId txid;
+	int			starting_byte;
+	int			index = 0;
+	int			bufidx;
+	ReadBufferMode rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepared_undo)
+		elog(ERROR, "Already reached the maximum prepared limit.");
+
+	/*
+	 * If this is the first undo record for this top transaction add the
+	 * transaction information to the undo record.
+	 *
+	 * XXX there is also an option that instead of adding the information to
+	 * this record we can prepare a new record which only contain transaction
+	 * informations.
+	 */
+	if (xid == InvalidTransactionId)
+	{
+		/* we expect during recovery, we always have a valid transaction id. */
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping
+		 * for the top most transactions.
+		 */
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(prepared_urec_ptr))
+		urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
+	else
+		urecptr = prepared_urec_ptr;
+
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(prepared_urec_ptr))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(prepared_urec_ptr);
+
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		prepared_urec_ptr = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* FIXME: Should we just report error ? */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/* Undo record can not fit into this block so go to the next block. */
+		cur_blk++;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+	} while (cur_size < size);
+
+	/*
+	 * Save referenced of undo record pointer as well as undo record.
+	 * InsertPreparedUndo will use these to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int			idx;
+	int			flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int			idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page		page;
+	int			starting_byte;
+	int			already_written;
+	int			bufidx = 0;
+	int			idx;
+	uint16		undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord *uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+	uint16		prev_undolen;
+
+	Assert(prepare_idx > 0);
+
+	/* This must be called under a critical section. */
+	Assert(InRecovery || CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		/*
+		 * We can read meta.prevlen without locking, because only we can write
+		 * to it.
+		 */
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+		prev_undolen = log->meta.prevlen;
+
+		/* store the previous undo record length in the header */
+		uur->uur_prevlen = prev_undolen;
+
+		/* if starting a new log then there is no prevlen to store */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/* if starting from a new page then include header in prevlen */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+			uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer		buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+			starting_byte = UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			undo_len += UndoLogBlockHeaderSize;
+
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while (true);
+
+		prev_undolen = undo_len;
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
+
+		if (UndoRecPtrIsValid(xact_urec_info.urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ *  Reset the global variables related to undo buffers. This is required at the
+ *  transaction abort or releasing undo buffers
+ */
+void
+ResetUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+	{
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	xact_urec_info.urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	prepared_urec_ptr = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepared_undo limit is changed so free the allocated memory and
+	 * reset all the variable back to its default value.
+	 */
+	if (max_prepared_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepared_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+
+	ResetUndoBuffers();
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer,  now caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if he wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord *
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer		buffer = urec->uur_buffer;
+	Page		page;
+	int			starting_byte = UndoRecPtrGetPageOffset(urp);
+	int			already_decoded = 0;
+	BlockNumber cur_blk;
+	bool		is_undo_splited = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a previous buffer then no need to allocate new. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * FIXME: This can be optimized to just fetch header first and only if
+		 * matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_splited = true;
+
+		/*
+		 * Complete record is not fitting into one buffer so release the
+		 * buffer pin and also set invalid buffer in the undo record.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer. Otherwise just
+	 * unlock it.
+	 */
+	if (is_undo_splited)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr * urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode rnode,
+				prevrnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int			logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/*
+		 * If we have a valid buffer pinned then just ensure that we want to
+		 * find the next tuple from the same block.  Otherwise release the
+		 * buffer and set it invalid
+		 */
+		if (BufferIsValid(urec->uur_buffer))
+		{
+			/*
+			 * Undo buffer will be changed if the next undo record belongs to
+			 * a different block or undo log.
+			 */
+			if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+				(prevrnode.relNode != rnode.relNode))
+			{
+				ReleaseBuffer(urec->uur_buffer);
+				urec->uur_buffer = InvalidBuffer;
+			}
+		}
+		else
+		{
+			/*
+			 * If there is not a valid buffer in urec->uur_buffer that means
+			 * we had copied the payload data and tuple data so free them.
+			 */
+			if (urec->uur_payload.data)
+				pfree(urec->uur_payload.data);
+			if (urec->uur_tuple.data)
+				pfree(urec->uur_tuple.data);
+		}
+
+		/* Reset the urec before fetching the tuple */
+		urec->uur_tuple.data = NULL;
+		urec->uur_tuple.len = 0;
+		urec->uur_payload.data = NULL;
+		urec->uur_payload.len = 0;
+		prevrnode = rnode;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecPtrIsValid(log->oldest_data))
+		{
+			/*
+			 * UndoDiscardInfo is not yet initialized. Hence, we've to check
+			 * UndoLogIsDiscarded and if it's already discarded then we have
+			 * nothing to do.
+			 */
+			LWLockRelease(&log->discard_lock);
+			if (UndoLogIsDiscarded(urp))
+			{
+				if (BufferIsValid(urec->uur_buffer))
+					ReleaseBuffer(urec->uur_buffer);
+				return NULL;
+			}
+
+			LWLockAcquire(&log->discard_lock, LW_SHARED);
+		}
+
+		/* Check if it's already discarded. */
+		if (urp < log->oldest_data)
+		{
+			LWLockRelease(&log->discard_lock);
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl *prevlog,
+				   *log;
+
+		log = UndoLogGet(logno, false);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno, true);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr(logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree(urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..73076dc
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,451 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size		size;
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char	   *writeptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_written = *already_written;
+
+	/* The undo record must contain a valid information. */
+	Assert(uur->uur_info != 0);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption that
+	 * it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_reloid = uur->uur_reloid;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before, or
+		 * caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_reloid == uur->uur_reloid);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int			can_write;
+	int			remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing to do
+	 * except update *my_bytes_written, which we must do to ensure that the
+	 * next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool
+UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+				 int *already_decoded, bool header_only)
+{
+	char	   *readptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_decoded = *already_decoded;
+	bool		is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_reloid = work_hdr.urec_reloid;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any
+		 * of the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int			can_read;
+	int			remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..0122850
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,109 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, UndoPersistence,
+									TransactionId xid);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section where we have prepared the undo record.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp,
+				BlockNumber blkno,
+				OffsetNumber offset,
+				TransactionId xid,
+				UndoRecPtr * urec_ptr_out,
+				SatisfyUndoRecordCallback callback);
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+				   TransactionId xid, UndoPersistence upersistence);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+/* Reset globals related to undo buffers */
+extern void ResetUndoBuffers(void);
+
+#endif							/* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..af967e8
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,222 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_reloid;	/* relation OID */
+
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older then RecentGlobalXmin, then we can consider the tuple
+	 * in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;		/* Transaction id */
+	CommandId	urec_cid;		/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+#define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	ForkNumber	urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.  This
+ * also stores the dbid and the  progress of the undo apply during rollback.
+ */
+typedef struct UndoRecordTransaction
+{
+	uint32		urec_progress;	/* undo applying progress. */
+	uint32		urec_xidepoch;	/* epoch of the current transaction */
+	Oid			urec_dbid;		/* database id */
+	uint64		urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+#define urec_next_pos \
+	(SizeOfUndoRecordTransaction - SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;	/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordExpectedSize or InsertUndoRecord.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_reloid;		/* relation OID */
+	TransactionId uur_prevxid;	/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id */
+
+	/*
+	 * undo action apply progress 0 = not started, 1 = completed. In future it
+	 * can also be used to show the progress of how much undo has been applied
+	 * so far with some formulae but currently only 0 and 1 is used.
+	 */
+	uint32		uur_progress;
+	StringInfoData uur_payload; /* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.  For the first call, the given page should be the one which
+ * the caller has determined to contain the current insertion point,
+ * starting_byte should be the byte offset within that page which corresponds
+ * to the current insertion point, and *already_written should be 0.  The
+ * return value will be true if the entire record is successfully written
+ * into that page, and false if not.  In either case, *already_written will
+ * be updated to the number of bytes written by all InsertUndoRecord calls
+ * for this record to date.  If this function is called again to continue
+ * writing the record, the previous value for *already_written should be
+ * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
+ * (since the record will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif							/* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 689c57c..73394c5 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index e01d12e..8cfcd44 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -277,6 +277,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

#24

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#23)

1 attachment(s)

Re: Undo logs

On Thu, Nov 29, 2018 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Nov 26, 2018 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
10.
+ if (!UndoRecPtrIsValid(multi_prep_urp))
+ urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+ else
+ urecptr = multi_prep_urp;
+
+ size = UndoRecordExpectedSize(urec);
..
..
+ if (UndoRecPtrIsValid(multi_prep_urp))
+ {
+ UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+ insert = UndoLogOffsetPlusUsableBytes(insert, size);
+ multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+ }
Can't we use urecptr instead of multi_prep_urp in above code after
urecptr is initialized?
I think we can't because urecptr is just the current pointer we are
going to return but multi_prep_urp is the static variable we need to
update so that
the next prepare can calculate urecptr from this location.

Okay, but that was not apparent from the code, so added a comment in
the attached delta patch. BTW, wouldn't it be better to move this
code to the end of function once prepare for current record is
complete.

More comments
----------------------------
1.
* We can consider that the log as switched if

/that/ needs to be removed.

2.
+ if (prepare_idx == max_prepared_undo)
+ elog(ERROR, "Already reached the maximum prepared limit.");

a. /Already/already
b. we don't use full-stop (.) in error

3.
+ * also stores the dbid and the progress of the undo apply during rollback.

/the progress/ extra space.

4.
+UndoSetPrepareSize(int nrecords, UnpackedUndoRecord *undorecords,
+    TransactionId xid, UndoPersistence upersistence)
+{

nrecords should be a second parameter.

5.
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+   TransactionId xid)

It seems better to have xid parameter before UndoPersistence.

6.
+ /* FIXME: Should we just report error ? */
+ Assert(index < MAX_BUFFER_PER_UNDO);

No need of this Fixme.

7.
PrepareUndoInsert()
{
..
do
{
..
+ /* Undo record can not fit into this block so go to the next block. */
+ cur_blk++;
..
} while (cur_size < size);
..
}

This comment was making me uneasy, so slightly adjusted the code.
Basically, by that time it was not decided whether undo record can fit
in current buffer or not.

8.
+ /*
+ * Save referenced of undo record pointer as well as undo record.
+ * InsertPreparedUndo will use these to insert the prepared record.
+ */
+ prepared_undo[prepare_idx].urec = urec;
+ prepared_undo[prepare_idx].urp = urecptr;

Slightly adjust the above comment.

9.
+InsertPreparedUndo(void)
{
..
+ Assert(prepare_idx > 0);
+
+ /* This must be called under a critical section. */
+ Assert(InRecovery || CritSectionCount > 0);
..
}

I have added a few more comments for above assertions, see if those are correct.

10.
+InsertPreparedUndo(void)
{
..
+ prev_undolen = undo_len;
+
+ UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
..
}

There is no need to use an additional variable prev_undolen in the
above code. I have modified the code to remove it's usage, check if
that is correct.

11.
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease
to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+ TransactionId xid, UndoRecPtr * urec_ptr_out,
+ SatisfyUndoRecordCallback callback)

The comment above UndoFetchRecord is very zheap specific, so I have
tried to simplify it. I think we can give so much detailed examples
only when we introduce zheap code.

Apart from above, there are miscellaneous comments and minor code
edits in the attached delta patch.

12.
PrepareUndoInsert()
{
..
+ /*
+ * If this is the first undo record for this top transaction add the
+ * transaction information to the undo record.
+ *
+ * XXX there is also an option that instead of adding the information to
+ * this record we can prepare a new record which only contain transaction
+ * informations.
+ */
+ if (xid == InvalidTransactionId)

The above comment seems to be out of place, we are doing nothing like
that here. This work is done in UndoRecordAllocate, may be you can
move 'XXX ..' part of the comment in that function.

13.
PrepareUndoInsert()
{
..
if (!UndoRecPtrIsValid(prepared_urec_ptr))
+ urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
+ else
+ urecptr = prepared_urec_ptr;
+
+ size = UndoRecordExpectedSize(urec);
..

I think we should make above code a bit more bulletproof. As it is
written, there is no guarantee the size we have allocated is same as
we are using in this function. How about if we take 'size' as output
parameter from UndoRecordAllocate and then use it in this function?
Additionally, we can have an Assert that the size returned by
UndoRecordAllocate is same as UndoRecordExpectedSize.

14.
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+ int idx;
+ int flags;
+
+ for (idx = 0; idx < buffer_idx; idx++)
+ {
+ flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+ XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+ }
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+ int idx;
+
+ for (idx = 0; idx < buffer_idx; idx++)
+ PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}

One line comment atop of these functions will be good. It will be
better if we place these functions at the end of file or someplace
else, as right now they are between prepare* and insert* function
calls which makes code flow bit awkward.

15.
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{

Why only for persistent undo this step should be performed in the
critical section? I think as this function operates on shred buffer,
even for unlogged undo, it should be done in a critical section.

16.
+InsertPreparedUndo(void)
{
..
/* if starting a new log then there is no prevlen to store */
+ if (offset == UndoLogBlockHeaderSize)
+ {
+ if (log->meta.prevlogno != InvalidUndoLogNumber)
+ {
+ UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+
+ uur->uur_prevlen = prevlog->meta.prevlen;
+ }
..
}

The comment here suggests that for new logs, we don't need prevlen,
but still, in one case you are maintaining the length, can you add few
comments to explain why?

17.
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+ UndoPersistence persistence)
{
..
+ /*
+ * FIXME: This can be optimized to just fetch header first and only if
+ * matches with block number and offset then fetch the complete
+ * record.
+ */
+ if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+ break;
..
}

I don't know how much it matters if we fetch complete record or just
it's header unless the record is big or it falls in two pages. I
think both are boundary cases and I couldn't see this part much in
perf profiles. There is nothing to fix here if you want you can a XXX
comment or maybe suggest it as a future optimization.

18.
+UndoFetchRecord()
{
...
+ /*
+ * Prevent UndoDiscardOneLog() from discarding data while we try to
+ * read it.  Usually we would acquire log->mutex to read log->meta
+ * members, but in this case we know that discard can't move without
+ * also holding log->discard_lock.
+ */
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+ if (!UndoRecPtrIsValid(log->oldest_data))
+ {
+ /*
+ * UndoDiscardInfo is not yet initialized. Hence, we've to check
+ * UndoLogIsDiscarded and if it's already discarded then we have
+ * nothing to do.
+ */
+ LWLockRelease(&log->discard_lock);
+ if (UndoLogIsDiscarded(urp))
+ {
+ if (BufferIsValid(urec->uur_buffer))
+ ReleaseBuffer(urec->uur_buffer);
+ return NULL;
+ }
+
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+ }
+
+ /* Check if it's already discarded. */
+ if (urp < log->oldest_data)
+ {
+ LWLockRelease(&log->discard_lock);
+ if (BufferIsValid(urec->uur_buffer))
+ ReleaseBuffer(urec->uur_buffer);
+ return NULL;
+ }
..
}

Can't we replace this logic with UndoRecordIsValid?

19.
UndoFetchRecord()
{
..
while(true)
{
..
/*
+ * If we have a valid buffer pinned then just ensure that we want to
+ * find the next tuple from the same block.  Otherwise release the
+ * buffer and set it invalid
+ */
+ if (BufferIsValid(urec->uur_buffer))
+ {
+ /*
+ * Undo buffer will be changed if the next undo record belongs to
+ * a different block or undo log.
+ */
+ if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+ (prevrnode.relNode != rnode.relNode))
+ {
+ ReleaseBuffer(urec->uur_buffer);
+ urec->uur_buffer = InvalidBuffer;
+ }
+ }
+ else
+ {
+ /*
+ * If there is not a valid buffer in urec->uur_buffer that means
+ * we had copied the payload data and tuple data so free them.
+ */
+ if (urec->uur_payload.data)
+ pfree(urec->uur_payload.data);
+ if (urec->uur_tuple.data)
+ pfree(urec->uur_tuple.data);
+ }
+
+ /* Reset the urec before fetching the tuple */
+ urec->uur_tuple.data = NULL;
+ urec->uur_tuple.len = 0;
+ urec->uur_payload.data = NULL;
+ urec->uur_payload.len = 0;
+ prevrnode = rnode;
..
}

Can't we move this logic after getting a record with UndoGetOneRecord
and matching with a callback? This is certainly required after the
first record, so it looks a bit odd here. Also, if possible can we
move it to a separate function as this is not the main logic and makes
the main logic difficult to follow.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-undo-interface-v9-delta-amit.patchapplication/octet-stream; name=0003-undo-interface-v9-delta-amit.patchDownload

diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
index ccbcc66059..fbc56363c8 100644
--- a/src/backend/access/undo/undoinsert.c
+++ b/src/backend/access/undo/undoinsert.c
@@ -511,7 +511,7 @@ resize:
 	Assert(AmAttachedToUndoLog(log) || InRecovery);
 
 	/*
-	 * We can consider that the log as switched if this is the first record of
+	 * We can consider the log as switched if this is the first record of
 	 * the log and not the first record of the transaction i.e. same
 	 * transaction continued from the previous log.
 	 */
@@ -563,7 +563,7 @@ resize:
  * This is normally used when more than one undo record needs to be prepared.
  */
 void
-UndoSetPrepareSize(int nrecords, UnpackedUndoRecord *undorecords,
+UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
 				   TransactionId xid, UndoPersistence upersistence)
 {
 	TransactionId txid;
@@ -609,8 +609,8 @@ UndoSetPrepareSize(int nrecords, UnpackedUndoRecord *undorecords,
  * for the top most transactions.
  */
 UndoRecPtr
-PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
-				  TransactionId xid)
+PrepareUndoInsert(UnpackedUndoRecord *urec, TransactionId xid,
+				  UndoPersistence upersistence)
 {
 	UndoRecordSize size;
 	UndoRecPtr	urecptr;
@@ -625,7 +625,7 @@ PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
 
 	/* Already reached maximum prepared limit. */
 	if (prepare_idx == max_prepared_undo)
-		elog(ERROR, "Already reached the maximum prepared limit.");
+		elog(ERROR, "already reached the maximum prepared limit");
 
 	/*
 	 * If this is the first undo record for this top transaction add the
@@ -637,7 +637,7 @@ PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
 	 */
 	if (xid == InvalidTransactionId)
 	{
-		/* we expect during recovery, we always have a valid transaction id. */
+		/* During recovery, we must have a valid transaction id. */
 		Assert(!InRecovery);
 		txid = GetTopTransactionId();
 	}
@@ -656,6 +656,7 @@ PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
 	else
 		urecptr = prepared_urec_ptr;
 
+	/* advance the prepared ptr location for next record. */
 	size = UndoRecordExpectedSize(urec);
 	if (UndoRecPtrIsValid(prepared_urec_ptr))
 	{
@@ -686,25 +687,23 @@ PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
 		else
 			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
 
-		/* FIXME: Should we just report error ? */
+		/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
 		Assert(index < MAX_BUFFER_PER_UNDO);
 
-		/* Keep the track of the buffers we have pinned. */
+		/* Keep the track of the buffers we have pinned and locked. */
 		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
 
-		/* Undo record can not fit into this block so go to the next block. */
-		cur_blk++;
-
 		/*
 		 * If we need more pages they'll be all new so we can definitely skip
 		 * reading from disk.
 		 */
 		rbm = RBM_ZERO;
+		cur_blk++;
 	} while (cur_size < size);
 
 	/*
-	 * Save referenced of undo record pointer as well as undo record.
-	 * InsertPreparedUndo will use these to insert the prepared record.
+	 * Save the undo record information to be later used by InsertPreparedUndo
+	 * to insert the prepared record.
 	 */
 	prepared_undo[prepare_idx].urec = urec;
 	prepared_undo[prepare_idx].urp = urecptr;
@@ -754,11 +753,14 @@ InsertPreparedUndo(void)
 	UnpackedUndoRecord *uur;
 	UndoLogOffset offset;
 	UndoLogControl *log;
-	uint16		prev_undolen;
 
+	/* There must be atleast one prepared undo record. */
 	Assert(prepare_idx > 0);
 
-	/* This must be called under a critical section. */
+	/*
+	 * This must be called under a critical section or we must be in
+	 * recovery.
+	 */
 	Assert(InRecovery || CritSectionCount > 0);
 
 	for (idx = 0; idx < prepare_idx; idx++)
@@ -771,16 +773,14 @@ InsertPreparedUndo(void)
 		starting_byte = UndoRecPtrGetPageOffset(urp);
 		offset = UndoRecPtrGetOffset(urp);
 
-		/*
-		 * We can read meta.prevlen without locking, because only we can write
-		 * to it.
-		 */
 		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
 		Assert(AmAttachedToUndoLog(log) || InRecovery);
-		prev_undolen = log->meta.prevlen;
 
-		/* store the previous undo record length in the header */
-		uur->uur_prevlen = prev_undolen;
+		/*
+		 * Store the previous undo record length in the header.  We can read
+		 * meta.prevlen without locking, because only we can write to it.
+		 */
+		uur->uur_prevlen = log->meta.prevlen;
 
 		/* if starting a new log then there is no prevlen to store */
 		if (offset == UndoLogBlockHeaderSize)
@@ -811,7 +811,7 @@ InsertPreparedUndo(void)
 
 			/*
 			 * Initialize the page whenever we try to write the first record
-			 * in page.
+			 * in page.  We start writting immediately after the block header.
 			 */
 			if (starting_byte == UndoLogBlockHeaderSize)
 				PageInit(page, BLCKSZ, 0);
@@ -828,22 +828,25 @@ InsertPreparedUndo(void)
 			}
 
 			MarkBufferDirty(buffer);
-			starting_byte = UndoLogBlockHeaderSize;
-			bufidx++;
 
 			/*
 			 * If we are swithing to the next block then consider the header
 			 * in total undo length.
 			 */
+			starting_byte = UndoLogBlockHeaderSize;
 			undo_len += UndoLogBlockHeaderSize;
+			bufidx++;
 
+			/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
 			Assert(bufidx < MAX_BUFFER_PER_UNDO);
 		} while (true);
 
-		prev_undolen = undo_len;
-
-		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), undo_len);
 
+		/*
+		 * Link the transactions in the same log so that we can discard all
+		 * the transaction's undo log in one-shot.
+		 */
 		if (UndoRecPtrIsValid(xact_urec_info.urecptr))
 			UndoRecordUpdateTransInfo();
 
@@ -856,8 +859,8 @@ InsertPreparedUndo(void)
 }
 
 /*
- *  Reset the global variables related to undo buffers. This is required at the
- *  transaction abort or releasing undo buffers
+ * Reset the global variables related to undo buffers. This is required at the
+ * transaction abort and while releasing the undo buffers.
  */
 void
 ResetUndoBuffers(void)
@@ -879,7 +882,7 @@ ResetUndoBuffers(void)
 
 	/*
 	 * max_prepared_undo limit is changed so free the allocated memory and
-	 * reset all the variable back to its default value.
+	 * reset all the variable back to their default value.
 	 */
 	if (max_prepared_undo > MAX_PREPARED_UNDO)
 	{
@@ -892,8 +895,8 @@ ResetUndoBuffers(void)
 }
 
 /*
- * Unlock and release undo buffers.  This step performed after exiting any
- * critical section.
+ * Unlock and release the undo buffers.  This step must be performed after
+ * exiting any critical section where we have perfomed undo actions.
  */
 void
 UnlockReleaseUndoBuffers(void)
@@ -909,10 +912,10 @@ UnlockReleaseUndoBuffers(void)
 /*
  * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
  * by urp and unpack the record into urec.  This function will not release the
- * pin on the buffer if complete record is fetched from one buffer,  now caller
+ * pin on the buffer if complete record is fetched from one buffer, so caller
  * can reuse the same urec to fetch the another undo record which is on the
  * same block.  Caller will be responsible to release the buffer inside urec
- * and set it to invalid if he wishes to fetch the record from another block.
+ * and set it to invalid if it wishes to fetch the record from another block.
  */
 static UnpackedUndoRecord *
 UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
@@ -923,11 +926,11 @@ UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
 	int			starting_byte = UndoRecPtrGetPageOffset(urp);
 	int			already_decoded = 0;
 	BlockNumber cur_blk;
-	bool		is_undo_splited = false;
+	bool		is_undo_rec_split = false;
 
 	cur_blk = UndoRecPtrGetBlockNum(urp);
 
-	/* If we already have a previous buffer then no need to allocate new. */
+	/* If we already have a buffer pin then no need to allocate a new one. */
 	if (!BufferIsValid(buffer))
 	{
 		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
@@ -951,11 +954,11 @@ UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
 			break;
 
 		starting_byte = UndoLogBlockHeaderSize;
-		is_undo_splited = true;
+		is_undo_rec_split = true;
 
 		/*
-		 * Complete record is not fitting into one buffer so release the
-		 * buffer pin and also set invalid buffer in the undo record.
+		 * The record spans more than a page so we would have copied it (see
+		 * UnpackUndoRecord).  In such cases, we can release the buffer.
 		 */
 		urec->uur_buffer = InvalidBuffer;
 		UnlockReleaseBuffer(buffer);
@@ -968,10 +971,10 @@ UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
 	}
 
 	/*
-	 * If we have copied the data then release the buffer. Otherwise just
+	 * If we have copied the data then release the buffer, otherwise, just
 	 * unlock it.
 	 */
-	if (is_undo_splited)
+	if (is_undo_rec_split)
 		UnlockReleaseBuffer(buffer);
 	else
 		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
@@ -981,29 +984,23 @@ UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
 
 /*
  * Fetch the next undo record for given blkno, offset and transaction id (if
- * valid).  We need to match transaction id along with block number and offset
- * because in some cases (like reuse of slot for committed transaction), we
- * need to skip the record if it is modified by a transaction later than the
- * transaction indicated by previous undo record.  For example, consider a
- * case where tuple (ctid - 0,1) is modified by transaction id 500 which
- * belongs to transaction slot 0. Then, the same tuple is modified by
- * transaction id 501 which belongs to transaction slot 1.  Then, both the
- * transaction slots are marked for reuse. Then, again the same tuple is
- * modified by transaction id 502 which has used slot 0.  Now, some
- * transaction which has started before transaction 500 wants to traverse the
- * chain to find visible tuple will keep on rotating infinitely between undo
- * tuple written by 502 and 501.  In such a case, we need to skip the undo
- * tuple written by transaction 502 when we want to find the undo record
- * indicated by the previous pointer of undo tuple written by transaction 501.
+ * valid).  The same tuple can be modified by multiple transactions, so during
+ * undo chain traversal sometimes we need to distinguish based on transaction
+ * id.  Callers that don't have any such requirement can pass
+ * InvalidTransactionId.
+ *
  * Start the search from urp.  Caller need to call UndoRecordRelease to release the
  * resources allocated by this function.
  *
  * urec_ptr_out is undo record pointer of the qualified undo record if valid
  * pointer is passed.
+ *
+ * callback function decides whether particular undo record satisfies the
+ * condition of caller.
  */
 UnpackedUndoRecord *
 UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
-				TransactionId xid, UndoRecPtr * urec_ptr_out,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
 				SatisfyUndoRecordCallback callback)
 {
 	RelFileNode rnode,
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
index 012285031c..4b0f1dd82f 100644
--- a/src/include/access/undoinsert.h
+++ b/src/include/access/undoinsert.h
@@ -39,8 +39,8 @@ typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec,
  * undo log only stores mapping for the top most transactions.
  * If in recovery, 'xid' refers to the transaction id stored in WAL.
  */
-extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, UndoPersistence,
-									TransactionId xid);
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+									UndoPersistence);
 
 /*
  * Insert a previously-prepared undo record.  This will write the actual undo
@@ -93,7 +93,7 @@ extern void UndoRecordSetPrevUndoLen(uint16 len);
  * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
  * then it will allocate extra memory to hold the extra prepared undo.
  */
-extern void UndoSetPrepareSize(int max_prepare, UnpackedUndoRecord *undorecords,
+extern void UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
 				   TransactionId xid, UndoPersistence upersistence);
 
 /*
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
index af967e84b4..9ca245509c 100644
--- a/src/include/access/undorecord.h
+++ b/src/include/access/undorecord.h
@@ -101,7 +101,7 @@ typedef struct UndoRecordBlock
 
 /*
  * Identifying information for a transaction to which this undo belongs.  This
- * also stores the dbid and the  progress of the undo apply during rollback.
+ * also stores the dbid and the progress of the undo apply during rollback.
  */
 typedef struct UndoRecordTransaction
 {

#25

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Amit Kapila (#24)

3 attachment(s)

Re: Undo logs

On Sat, Dec 1, 2018 at 12:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Nov 29, 2018 at 6:27 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Mon, Nov 26, 2018 at 2:13 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
10.
+ if (!UndoRecPtrIsValid(multi_prep_urp))
+ urecptr = UndoRecordAllocateMulti(urec, 1, upersistence, txid);
+ else
+ urecptr = multi_prep_urp;
+
+ size = UndoRecordExpectedSize(urec);
..
..
+ if (UndoRecPtrIsValid(multi_prep_urp))
+ {
+ UndoLogOffset insert = UndoRecPtrGetOffset(multi_prep_urp);
+ insert = UndoLogOffsetPlusUsableBytes(insert, size);
+ multi_prep_urp = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+ }
Can't we use urecptr instead of multi_prep_urp in above code after
urecptr is initialized?
I think we can't because urecptr is just the current pointer we are
going to return but multi_prep_urp is the static variable we need to
update so that
the next prepare can calculate urecptr from this location.
Okay, but that was not apparent from the code, so added a comment in
the attached delta patch. BTW, wouldn't it be better to move this
code to the end of function once prepare for current record is
complete.

More comments
----------------------------
1.
* We can consider that the log as switched if

/that/ needs to be removed.
2.
+ if (prepare_idx == max_prepared_undo)
+ elog(ERROR, "Already reached the maximum prepared limit.");
a. /Already/already
b. we don't use full-stop (.) in error

Merged your change

3.
+ * also stores the dbid and the progress of the undo apply during rollback.

/the progress/ extra space.

Merged your change

4.
+UndoSetPrepareSize(int nrecords, UnpackedUndoRecord *undorecords,
+    TransactionId xid, UndoPersistence upersistence)
+{

nrecords should be a second parameter.

Merged your change

5.
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, UndoPersistence upersistence,
+   TransactionId xid)
It seems better to have xid parameter before UndoPersistence.

Merged your change. Done same changes in UndoRecordAllocate as well

6.
+ /* FIXME: Should we just report error ? */
+ Assert(index < MAX_BUFFER_PER_UNDO);

No need of this Fixme.

Merged your change

7.
PrepareUndoInsert()
{
..
do
{
..
+ /* Undo record can not fit into this block so go to the next block. */
+ cur_blk++;
..
} while (cur_size < size);
..
}
This comment was making me uneasy, so slightly adjusted the code.
Basically, by that time it was not decided whether undo record can fit
in current buffer or not.

Merged your change

8.
+ /*
+ * Save referenced of undo record pointer as well as undo record.
+ * InsertPreparedUndo will use these to insert the prepared record.
+ */
+ prepared_undo[prepare_idx].urec = urec;
+ prepared_undo[prepare_idx].urp = urecptr;

Slightly adjust the above comment.

Merged your change

9.
+InsertPreparedUndo(void)
{
..
+ Assert(prepare_idx > 0);
+
+ /* This must be called under a critical section. */
+ Assert(InRecovery || CritSectionCount > 0);
..
}
I have added a few more comments for above assertions, see if those are correct.

Merged your change

10.
+InsertPreparedUndo(void)
{
..
+ prev_undolen = undo_len;
+
+ UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), prev_undolen);
..
}
There is no need to use an additional variable prev_undolen in the
above code. I have modified the code to remove it's usage, check if
that is correct.

looks fine to me.

11.
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  We need to match transaction id along with block number and offset
+ * because in some cases (like reuse of slot for committed transaction), we
+ * need to skip the record if it is modified by a transaction later than the
+ * transaction indicated by previous undo record.  For example, consider a
+ * case where tuple (ctid - 0,1) is modified by transaction id 500 which
+ * belongs to transaction slot 0. Then, the same tuple is modified by
+ * transaction id 501 which belongs to transaction slot 1.  Then, both the
+ * transaction slots are marked for reuse. Then, again the same tuple is
+ * modified by transaction id 502 which has used slot 0.  Now, some
+ * transaction which has started before transaction 500 wants to traverse the
+ * chain to find visible tuple will keep on rotating infinitely between undo
+ * tuple written by 502 and 501.  In such a case, we need to skip the undo
+ * tuple written by transaction 502 when we want to find the undo record
+ * indicated by the previous pointer of undo tuple written by transaction 501.
+ * Start the search from urp.  Caller need to call UndoRecordRelease
to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+ TransactionId xid, UndoRecPtr * urec_ptr_out,
+ SatisfyUndoRecordCallback callback)

The comment above UndoFetchRecord is very zheap specific, so I have
tried to simplify it. I think we can give so much detailed examples
only when we introduce zheap code.

Make sense.

Apart from above, there are miscellaneous comments and minor code
edits in the attached delta patch.

I have merged your changes.

12.
PrepareUndoInsert()
{
..
+ /*
+ * If this is the first undo record for this top transaction add the
+ * transaction information to the undo record.
+ *
+ * XXX there is also an option that instead of adding the information to
+ * this record we can prepare a new record which only contain transaction
+ * informations.
+ */
+ if (xid == InvalidTransactionId)
The above comment seems to be out of place, we are doing nothing like
that here. This work is done in UndoRecordAllocate, may be you can
move 'XXX ..' part of the comment in that function.

Done

13.
PrepareUndoInsert()
{
..
if (!UndoRecPtrIsValid(prepared_urec_ptr))
+ urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
+ else
+ urecptr = prepared_urec_ptr;
+
+ size = UndoRecordExpectedSize(urec);
..
I think we should make above code a bit more bulletproof. As it is
written, there is no guarantee the size we have allocated is same as
we are using in this function.

I agree
How about if we take 'size' as output

parameter from UndoRecordAllocate and then use it in this function?
Additionally, we can have an Assert that the size returned by
UndoRecordAllocate is same as UndoRecordExpectedSize.

With this change we will be able to guarantee when we are allocating
single undo record
but multi prepare will still be a problem. I haven't fix this as of
now. I will think on how
to handle both the cases when we have to prepare one time or when we
have to allocate
once and prepare multiple time.

14.
+
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+ int idx;
+ int flags;
+
+ for (idx = 0; idx < buffer_idx; idx++)
+ {
+ flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+ XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+ }
+}
+
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+ int idx;
+
+ for (idx = 0; idx < buffer_idx; idx++)
+ PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
One line comment atop of these functions will be good. It will be
better if we place these functions at the end of file or someplace
else, as right now they are between prepare* and insert* function
calls which makes code flow bit awkward.

Done

15.
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
Why only for persistent undo this step should be performed in the
critical section? I think as this function operates on shred buffer,
even for unlogged undo, it should be done in a critical section.

I think we can remove this comment? I have removed that in the current patch.

16.
+InsertPreparedUndo(void)
{
..
/* if starting a new log then there is no prevlen to store */
+ if (offset == UndoLogBlockHeaderSize)
+ {
+ if (log->meta.prevlogno != InvalidUndoLogNumber)
+ {
+ UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+
+ uur->uur_prevlen = prevlog->meta.prevlen;
+ }
..
}
The comment here suggests that for new logs, we don't need prevlen,
but still, in one case you are maintaining the length, can you add few
comments to explain why?

Done

17.
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+ UndoPersistence persistence)
{
..
+ /*
+ * FIXME: This can be optimized to just fetch header first and only if
+ * matches with block number and offset then fetch the complete
+ * record.
+ */
+ if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+ break;
..
}
I don't know how much it matters if we fetch complete record or just
it's header unless the record is big or it falls in two pages. I
think both are boundary cases and I couldn't see this part much in
perf profiles. There is nothing to fix here if you want you can a XXX
comment or maybe suggest it as a future optimization.

Changed

18.
+UndoFetchRecord()
{
...
+ /*
+ * Prevent UndoDiscardOneLog() from discarding data while we try to
+ * read it.  Usually we would acquire log->mutex to read log->meta
+ * members, but in this case we know that discard can't move without
+ * also holding log->discard_lock.
+ */
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+ if (!UndoRecPtrIsValid(log->oldest_data))
+ {
+ /*
+ * UndoDiscardInfo is not yet initialized. Hence, we've to check
+ * UndoLogIsDiscarded and if it's already discarded then we have
+ * nothing to do.
+ */
+ LWLockRelease(&log->discard_lock);
+ if (UndoLogIsDiscarded(urp))
+ {
+ if (BufferIsValid(urec->uur_buffer))
+ ReleaseBuffer(urec->uur_buffer);
+ return NULL;
+ }
+
+ LWLockAcquire(&log->discard_lock, LW_SHARED);
+ }
+
+ /* Check if it's already discarded. */
+ if (urp < log->oldest_data)
+ {
+ LWLockRelease(&log->discard_lock);
+ if (BufferIsValid(urec->uur_buffer))
+ ReleaseBuffer(urec->uur_buffer);
+ return NULL;
+ }
..
}

Can't we replace this logic with UndoRecordIsValid?

Done

19.
UndoFetchRecord()
{
..
while(true)
{
..
/*
+ * If we have a valid buffer pinned then just ensure that we want to
+ * find the next tuple from the same block.  Otherwise release the
+ * buffer and set it invalid
+ */
+ if (BufferIsValid(urec->uur_buffer))
+ {
+ /*
+ * Undo buffer will be changed if the next undo record belongs to
+ * a different block or undo log.
+ */
+ if (UndoRecPtrGetBlockNum(urp) != BufferGetBlockNumber(urec->uur_buffer) ||
+ (prevrnode.relNode != rnode.relNode))
+ {
+ ReleaseBuffer(urec->uur_buffer);
+ urec->uur_buffer = InvalidBuffer;
+ }
+ }
+ else
+ {
+ /*
+ * If there is not a valid buffer in urec->uur_buffer that means
+ * we had copied the payload data and tuple data so free them.
+ */
+ if (urec->uur_payload.data)
+ pfree(urec->uur_payload.data);
+ if (urec->uur_tuple.data)
+ pfree(urec->uur_tuple.data);
+ }
+
+ /* Reset the urec before fetching the tuple */
+ urec->uur_tuple.data = NULL;
+ urec->uur_tuple.len = 0;
+ urec->uur_payload.data = NULL;
+ urec->uur_payload.len = 0;
+ prevrnode = rnode;
..
}

Fixed

Apart from fixing these comments I have also rebased Thomas' undo log
patches on the current head.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0001-Add-undo-log-manager_v3.patchapplication/x-patch; name=0001-Add-undo-log-manager_v3.patchDownload

From 0302c701aa0b5bccd5584ef76b4e93e599a3073b Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 3 Dec 2018 09:46:54 +0530
Subject: [PATCH 1/2] Add undo log manager.

Add a new subsystem to manage undo logs.  Undo logs allow data to be appended
efficiently, like logs.  They also allow data to be discarded efficiently from
the other end, like a queue.  Thirdly, they allow efficient buffered random
access, like a relation.

Undo logs physically consist of a set of 1MB segment files under
$PGDATA/base/undo (or per-tablespace equivalent) that are created, deleted or
renamed as required, similarly to the way that WAL segments are managed.
Meta-data about the set of undo logs is stored in shared memory, and written
to per-checkpoint files under $PGDATA/pg_undo.

This commit provides an API for allocating and discarding undo log storage
space and managing the files in a crash-safe way.  A later commit will provide
support for accessing the data stored inside them.

XXX Status: WIP.  Some details around WAL are being reconsidered, as noted in
comments.

Author: Thomas Munro, with contributions from Dilip Kumar and input from
        Amit Kapila and Robert Haas
Tested-By: Neha Sharma
Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com
---
 src/backend/access/Makefile               |    2 +-
 src/backend/access/rmgrdesc/Makefile      |    2 +-
 src/backend/access/rmgrdesc/undologdesc.c |   88 +
 src/backend/access/transam/rmgr.c         |    1 +
 src/backend/access/transam/xlog.c         |   17 +
 src/backend/access/undo/Makefile          |   17 +
 src/backend/access/undo/undolog.c         | 2643 +++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql      |    4 +
 src/backend/commands/tablespace.c         |   23 +
 src/backend/replication/logical/decode.c  |    1 +
 src/backend/storage/ipc/ipci.c            |    3 +
 src/backend/storage/lmgr/lwlock.c         |    2 +
 src/backend/storage/lmgr/lwlocknames.txt  |    1 +
 src/backend/utils/init/postinit.c         |    1 +
 src/backend/utils/misc/guc.c              |   12 +
 src/bin/initdb/initdb.c                   |    2 +
 src/bin/pg_waldump/rmgrdesc.c             |    1 +
 src/include/access/rmgrlist.h             |    1 +
 src/include/access/undolog.h              |  405 +++++
 src/include/access/undolog_xlog.h         |   72 +
 src/include/catalog/pg_proc.dat           |    7 +
 src/include/storage/lwlock.h              |    2 +
 src/include/utils/guc.h                   |    2 +
 src/test/regress/expected/rules.out       |   11 +
 24 files changed, 3318 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/undologdesc.c
 create mode 100644 src/backend/access/undo/Makefile
 create mode 100644 src/backend/access/undo/undolog.c
 create mode 100644 src/include/access/undolog.h
 create mode 100644 src/include/access/undolog_xlog.h

diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index bd93a6a..7f7380c 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  tablesample transam
+			  tablesample transam undo
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..91ad1ef 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -11,6 +11,6 @@ include $(top_builddir)/src/Makefile.global
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
 	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
 	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o undologdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/undologdesc.c b/src/backend/access/rmgrdesc/undologdesc.c
new file mode 100644
index 0000000..6cf32f4
--- /dev/null
+++ b/src/backend/access/rmgrdesc/undologdesc.c
@@ -0,0 +1,88 @@
+/*-------------------------------------------------------------------------
+ *
+ * undologdesc.c
+ *	  rmgr descriptor routines for access/undo/undolog.c
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/undologdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+
+void
+undolog_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_UNDOLOG_CREATE)
+	{
+		xl_undolog_create *xlrec = (xl_undolog_create *) rec;
+
+		appendStringInfo(buf, "logno %u", xlrec->logno);
+	}
+	else if (info == XLOG_UNDOLOG_EXTEND)
+	{
+		xl_undolog_extend *xlrec = (xl_undolog_extend *) rec;
+
+		appendStringInfo(buf, "logno %u end " UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_ATTACH)
+	{
+		xl_undolog_attach *xlrec = (xl_undolog_attach *) rec;
+
+		appendStringInfo(buf, "logno %u xid %u", xlrec->logno, xlrec->xid);
+	}
+	else if (info == XLOG_UNDOLOG_DISCARD)
+	{
+		xl_undolog_discard *xlrec = (xl_undolog_discard *) rec;
+
+		appendStringInfo(buf, "logno %u discard " UndoLogOffsetFormat " end "
+						 UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->discard, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_REWIND)
+	{
+		xl_undolog_rewind *xlrec = (xl_undolog_rewind *) rec;
+
+		appendStringInfo(buf, "logno %u insert " UndoLogOffsetFormat " prevlen %d",
+						 xlrec->logno, xlrec->insert, xlrec->prevlen);
+	}
+
+}
+
+const char *
+undolog_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			id = "CREATE";
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			id = "EXTEND";
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			id = "ATTACH";
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			id = "DISCARD";
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			id = "REWIND";
+			break;
+	}
+
+	return id;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..8b05374 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -18,6 +18,7 @@
 #include "access/multixact.h"
 #include "access/nbtxlog.h"
 #include "access/spgxlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c80b14e..1064ee0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/transam.h"
 #include "access/tuptoaster.h"
 #include "access/twophase.h"
+#include "access/undolog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
@@ -6693,6 +6694,9 @@ StartupXLOG(void)
 	 */
 	restoreTwoPhaseData();
 
+	/* Recover undo log meta data corresponding to this checkpoint. */
+	StartupUndoLogs(ControlFile->checkPointCopy.redo);
+
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
 	RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
@@ -7315,7 +7319,13 @@ StartupXLOG(void)
 	 * end-of-recovery steps fail.
 	 */
 	if (InRecovery)
+	{
 		ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+		ResetUndoLogs(UNDO_UNLOGGED);
+	}
+
+	/* Always reset temporary undo logs. */
+	ResetUndoLogs(UNDO_TEMP);
 
 	/*
 	 * We don't need the latch anymore. It's not strictly necessary to disown
@@ -9020,6 +9030,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointBuffers(flags);	/* performs all required fsyncs */
+	CheckPointUndoLogs(checkPointRedo, ControlFile->checkPointCopy.redo);
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
@@ -9726,6 +9737,9 @@ xlog_redo(XLogReaderState *record)
 		XLogCtl->ckptXid = checkPoint.nextXid;
 		SpinLockRelease(&XLogCtl->info_lck);
 
+		/* Write an undo log metadata snapshot. */
+		CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo);
+
 		/*
 		 * We should've already switched to the new TLI before replaying this
 		 * record.
@@ -9785,6 +9799,9 @@ xlog_redo(XLogReaderState *record)
 		XLogCtl->ckptXid = checkPoint.nextXid;
 		SpinLockRelease(&XLogCtl->info_lck);
 
+		/* Write an undo log metadata snapshot. */
+		CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo);
+
 		/* TLI should not change in an on-line checkpoint */
 		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
 			ereport(PANIC,
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
new file mode 100644
index 0000000..219c696
--- /dev/null
+++ b/src/backend/access/undo/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/undo
+#
+# IDENTIFICATION
+#    src/backend/access/undo/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/undo
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = undolog.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undolog.c b/src/backend/access/undo/undolog.c
new file mode 100644
index 0000000..48dd662
--- /dev/null
+++ b/src/backend/access/undo/undolog.c
@@ -0,0 +1,2643 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.c
+ *	  management of undo logs
+ *
+ * PostgreSQL undo log manager.  This module is responsible for managing the
+ * lifecycle of undo logs and their segment files, associating undo logs with
+ * backends, and allocating space within undo logs.
+ *
+ * For the code that reads and writes blocks of data, see undofile.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undolog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogreader.h"
+#include "catalog/catalog.h"
+#include "catalog/pg_tablespace.h"
+#include "commands/tablespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "nodes/execnodes.h"
+#include "pgstat.h"
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/standby.h"
+#include "storage/undofile.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/varlena.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+/*
+ * During recovery we maintain a mapping of transaction ID to undo logs
+ * numbers.  We do this with a two-level array, so that we use memory only for
+ * chunks of the array that overlap with the range of active xids.
+ */
+#define UndoLogXidLowBits 16
+
+/*
+ * Number of high bits.
+ */
+#define UndoLogXidHighBits \
+	(sizeof(TransactionId) * CHAR_BIT - UndoLogXidLowBits)
+
+/* Extract the upper bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidHigh(xid) ((xid) >> UndoLogXidLowBits)
+
+/* Extract the lower bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidLow(xid) ((xid) & ((1 << UndoLogXidLowBits) - 1))
+
+/*
+ * Main control structure for undo log management in shared memory.
+ * UndoLogControl objects are arranged in a fixed-size array, at a position
+ * determined by the undo log number.
+ */
+typedef struct UndoLogSharedData
+{
+	UndoLogNumber free_lists[UndoPersistenceLevels];
+	UndoLogNumber low_logno; /* the lowest logno */
+	UndoLogNumber next_logno; /* one past the highest logno */
+	UndoLogNumber array_size; /* how many UndoLogControl objects do we have? */
+	UndoLogControl logs[FLEXIBLE_ARRAY_MEMBER];
+} UndoLogSharedData;
+
+/*
+ * Per-backend state for the undo log module.
+ * Backend-local pointers to undo subsystem state in shared memory.
+ */
+typedef struct UndoLogSession
+{
+	UndoLogSharedData *shared;
+
+	/*
+	 * The control object for the undo logs that this session is currently
+	 * attached to at each persistence level.  This is where it will write new
+	 * undo data.
+	 */
+	UndoLogControl *logs[UndoPersistenceLevels];
+
+	/*
+	 * If the undo_tablespaces GUC changes we'll remember to examine it and
+	 * attach to a new undo log using this flag.
+	 */
+	bool			need_to_choose_tablespace;
+
+	/*
+	 * During recovery, the startup process maintains a mapping of xid to undo
+	 * log number, instead of using 'log' above.  This is not used in regular
+	 * backends and can be in backend-private memory so long as recovery is
+	 * single-process.  This map references UNDO_PERMANENT logs only, since
+	 * temporary and unlogged relations don't have WAL to replay.
+	 */
+	UndoLogNumber **xid_map;
+
+	/*
+	 * The slot for the oldest xids still running.  We advance this during
+	 * checkpoints to free up chunks of the map.
+	 */
+	uint16			xid_map_oldest_chunk;
+
+	/* Current dbid.  Used during recovery. */
+	Oid				dbid;
+} UndoLogSession;
+
+UndoLogSession MyUndoLogState;
+
+undologtable_hash *undologtable_cache;
+
+/* GUC variables */
+char	   *undo_tablespaces = NULL;
+
+static UndoLogControl *get_undo_log(UndoLogNumber logno, bool locked);
+static UndoLogControl *allocate_undo_log(void);
+static void free_undo_log(UndoLogControl *log);
+static void attach_undo_log(UndoPersistence level, Oid tablespace);
+static void detach_current_undo_log(UndoPersistence level, bool full);
+static void extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end);
+static void undo_log_before_exit(int code, Datum value);
+static void forget_undo_buffers(int logno, UndoLogOffset old_discard,
+								UndoLogOffset new_discard,
+								bool drop_tail);
+static bool choose_undo_tablespace(bool force_detach, Oid *oid);
+static void undolog_xid_map_gc(void);
+
+PG_FUNCTION_INFO_V1(pg_stat_get_undo_logs);
+
+/*
+ * How many undo logs can be active at a time?  This creates a theoretical
+ * maximum transaction size, but it we set it to a factor the maximum number
+ * of backends it will be a very high limit.  Alternative designs involving
+ * demand paging or dynamic shared memory could remove this limit but
+ * introduce other problems.
+ */
+static inline size_t
+UndoLogNumSlots(void)
+{
+	return MaxBackends * 4;
+}
+
+/*
+ * Return the amount of traditional smhem required for undo log management.
+ * Extra shared memory will be managed using DSM segments.
+ */
+Size
+UndoLogShmemSize(void)
+{
+	return sizeof(UndoLogSharedData) +
+		UndoLogNumSlots() * sizeof(UndoLogControl);
+}
+
+/*
+ * Initialize the undo log subsystem.  Called in each backend.
+ */
+void
+UndoLogShmemInit(void)
+{
+	bool found;
+
+	MyUndoLogState.shared = (UndoLogSharedData *)
+		ShmemInitStruct("UndoLogShared", UndoLogShmemSize(), &found);
+
+	/* The postmaster initialized the shared memory state. */
+	if (!IsUnderPostmaster)
+	{
+		UndoLogSharedData *shared = MyUndoLogState.shared;
+		int		i;
+
+		Assert(!found);
+
+		/*
+		 * We start with no active undo logs.  StartUpUndoLogs() will recreate
+		 * the undo logs that were known at the last checkpoint.
+		 */
+		memset(shared, 0, sizeof(*shared));
+		shared->array_size = UndoLogNumSlots();
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+			shared->free_lists[i] = InvalidUndoLogNumber;
+		for (i = 0; i < shared->array_size; ++i)
+		{
+			memset(&shared->logs[i], 0, sizeof(shared->logs[i]));
+			shared->logs[i].logno = InvalidUndoLogNumber;
+			LWLockInitialize(&shared->logs[i].mutex,
+							 LWTRANCHE_UNDOLOG);
+			LWLockInitialize(&shared->logs[i].discard_lock,
+							 LWTRANCHE_UNDODISCARD);
+		}
+	}
+	else
+		Assert(found);
+
+	/* All backends prepare their per-backend lookup table. */
+	undologtable_cache = undologtable_create(TopMemoryContext,
+											 UndoLogNumSlots(),
+											 NULL);
+}
+
+void
+UndoLogInit(void)
+{
+	before_shmem_exit(undo_log_before_exit, 0);
+}
+
+/*
+ * Figure out which directory holds an undo log based on tablespace.
+ */
+static void
+UndoLogDirectory(Oid tablespace, char *dir)
+{
+	if (tablespace == DEFAULTTABLESPACE_OID ||
+		tablespace == InvalidOid)
+		snprintf(dir, MAXPGPATH, "base/undo");
+	else
+		snprintf(dir, MAXPGPATH, "pg_tblspc/%u/%s/undo",
+				 tablespace, TABLESPACE_VERSION_DIRECTORY);
+}
+
+/*
+ * Compute the pathname to use for an undo log segment file.
+ */
+void
+UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace, char *path)
+{
+	char		dir[MAXPGPATH];
+
+	/* Figure out which directory holds the segment, based on tablespace. */
+	UndoLogDirectory(tablespace, dir);
+
+	/*
+	 * Build the path from log number and offset.  The pathname is the
+	 * UndoRecPtr of the first byte in the segment in hexadecimal, with a
+	 * period inserted between the components.
+	 */
+	snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno,
+			 segno * UndoLogSegmentSize);
+}
+
+/*
+ * Iterate through the set of currently active logs.  Pass in NULL to get the
+ * first undo log.  NULL indicates the end of the set of logs.  The caller
+ * must lock the returned log before accessing its members, and must skip if
+ * logno is not valid.
+ */
+UndoLogControl *
+UndoLogNext(UndoLogControl *log)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+	for (;;)
+	{
+		/* Advance to the next log. */
+		if (log == NULL)
+		{
+			/* Start at the beginning. */
+			log = &shared->logs[0];
+		}
+		else if (++log == &shared->logs[shared->array_size])
+		{
+			/* Past the end. */
+			log = NULL;
+			break;
+		}
+		/* Have we found a slot with a valid log? */
+		if (log->logno != InvalidUndoLogNumber)
+			break;
+	}
+	LWLockRelease(UndoLogLock);
+
+	/* XXX: erm, which lock should the caller hold!? */
+	return log;
+}
+
+/*
+ * Check if an undo log position has been discarded.  'point' must be an undo
+ * log pointer that was allocated at some point in the past, otherwise the
+ * result is undefined.
+ */
+bool
+UndoLogIsDiscarded(UndoRecPtr point)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(point);
+	UndoLogControl *log;
+	bool	result;
+
+	log = get_undo_log(logno, false);
+
+	/*
+	 * If we couldn't find the undo log number, then it must be entirely
+	 * discarded.
+	 */
+	if (log == NULL)
+		return true;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	if (unlikely(logno != log->logno))
+	{
+		/*
+		 * The undo log has been entirely discarded since we looked it up, and
+		 * the UndoLogControl slot is now unused or being used for some other
+		 * undo log.  That means that any pointer within it must be discarded.
+		 */
+		result = true;
+	}
+	else
+	{
+		/* Check if this point is before the discard pointer. */
+		result = UndoRecPtrGetOffset(point) < log->meta.discard;
+	}
+	LWLockRelease(&log->mutex);
+
+	return result;
+}
+
+/*
+ * Store latest transaction's start undo record point in undo meta data.  It
+ * will fetched by the backend when it's reusing the undo log and preparing
+ * its first undo.
+ */
+void
+UndoLogSetLastXactStartPoint(UndoRecPtr point)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(point);
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	/* TODO: review */
+	log->meta.last_xact_start = UndoRecPtrGetOffset(point);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Fetch the previous transaction's start undo record point.
+ */
+UndoRecPtr
+UndoLogGetLastXactStartPoint(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	uint64 last_xact_start = 0;
+
+	if (unlikely(log == NULL))
+		return InvalidUndoRecPtr;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	/* TODO: review */
+	last_xact_start = log->meta.last_xact_start;
+	LWLockRelease(&log->mutex);
+
+	if (last_xact_start == 0)
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, last_xact_start);
+}
+
+/*
+ * Store the last undo record's length in undo meta-data so that it can be
+ * persistent across restart.
+ */
+void
+UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	/* TODO review */
+	log->meta.prevlen = prevlen;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get the last undo record's length.
+ */
+uint16
+UndoLogGetPrevLen(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	uint16	prevlen;
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	/* TODO review */
+	prevlen = log->meta.prevlen;
+	LWLockRelease(&log->mutex);
+
+	return prevlen;
+}
+
+/*
+ * Is this record is the first record for any transaction.
+ */
+bool
+IsTransactionFirstRec(TransactionId xid)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	Assert(InRecovery);
+
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	log = get_undo_log(logno, false);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	/* TODO review */
+	return log->meta.is_first_rec;
+}
+
+/*
+ * Detach from the undo log we are currently attached to, returning it to the
+ * appropriate free list if it still has space.
+ */
+static void
+detach_current_undo_log(UndoPersistence persistence, bool full)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+
+	MyUndoLogState.logs[persistence] = NULL;
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = InvalidPid;
+	log->xid = InvalidTransactionId;
+	if (full)
+		log->meta.status = UNDO_LOG_STATUS_FULL;
+	LWLockRelease(&log->mutex);
+
+	/* Push back onto the appropriate free list, unless it's full. */
+	if (!full)
+	{
+		LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+		log->next_free = shared->free_lists[persistence];
+		shared->free_lists[persistence] = log->logno;
+		LWLockRelease(UndoLogLock);
+	}
+}
+
+/*
+ * Exit handler, detaching from all undo logs.
+ */
+static void
+undo_log_before_exit(int code, Datum arg)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		if (MyUndoLogState.logs[i] != NULL)
+			detach_current_undo_log(i, false);
+	}
+}
+
+/*
+ * Create a new empty segment file on disk for the byte starting at 'end'.
+ */
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+							UndoLogOffset end)
+{
+	struct stat	stat_buffer;
+	off_t	size;
+	char	path[MAXPGPATH];
+	void   *zeroes;
+	size_t	nzeroes = 8192;
+	int		fd;
+
+	UndoLogSegmentPath(logno, end / UndoLogSegmentSize, tablespace, path);
+
+	/*
+	 * Create and fully allocate a new file.  If we crashed and recovered
+	 * then the file might already exist, so use flags that tolerate that.
+	 * It's also possible that it exists but is too short, in which case
+	 * we'll write the rest.  We don't really care what's in the file, we
+	 * just want to make sure that the filesystem has allocated physical
+	 * blocks for it, so that non-COW filesystems will report ENOSPC now
+	 * rather than later when the space is needed and we'll avoid creating
+	 * files with holes.
+	 */
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0 && tablespace != 0)
+	{
+		char undo_path[MAXPGPATH];
+
+		/* Try creating the undo directory for this tablespace. */
+		UndoLogDirectory(tablespace, undo_path);
+		if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+		{
+			char	   *parentdir;
+
+			if (errno != ENOENT || !InRecovery)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+
+			/*
+			 * In recovery, it's possible that the tablespace directory
+			 * doesn't exist because a later WAL record removed the whole
+			 * tablespace.  In that case we create a regular directory to
+			 * stand in for it.  This is similar to the logic in
+			 * TablespaceCreateDbspace().
+			 */
+
+			/* create two parents up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			/* create one parent up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+		}
+
+		fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	}
+	if (fd < 0)
+		elog(ERROR, "could not create new file \"%s\": %m", path);
+	if (fstat(fd, &stat_buffer) < 0)
+		elog(ERROR, "could not stat \"%s\": %m", path);
+	size = stat_buffer.st_size;
+
+	/* A buffer full of zeroes we'll use to fill up new segment files. */
+	zeroes = palloc0(nzeroes);
+
+	while (size < UndoLogSegmentSize)
+	{
+		ssize_t written;
+
+		written = write(fd, zeroes, Min(nzeroes, UndoLogSegmentSize - size));
+		if (written < 0)
+			elog(ERROR, "cannot initialize undo log segment file \"%s\": %m",
+				 path);
+		size += written;
+	}
+
+	/* Flush the contents of the file to disk. */
+	if (pg_fsync(fd) != 0)
+		elog(ERROR, "cannot fsync file \"%s\": %m", path);
+	CloseTransientFile(fd);
+
+	pfree(zeroes);
+
+	elog(LOG, "created undo segment \"%s\"", path); /* XXX: remove me */
+}
+
+/*
+ * Create a new undo segment, when it is unexpectedly not present.
+ */
+void
+UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno)
+{
+	Assert(InRecovery);
+	allocate_empty_undo_segment(logno, tablespace, segno * UndoLogSegmentSize);
+}
+
+/*
+ * Create and zero-fill a new segment for a given undo log number.
+ */
+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
+{
+	UndoLogControl *log;
+	char		dir[MAXPGPATH];
+	size_t		end;
+
+	log = get_undo_log(logno, false);
+
+	/* TODO review interlocking */
+
+	Assert(log != NULL);
+	Assert(log->meta.end % UndoLogSegmentSize == 0);
+	Assert(new_end % UndoLogSegmentSize == 0);
+	Assert(MyUndoLogState.logs[log->meta.persistence] == log || InRecovery);
+
+	/*
+	 * Create all the segments needed to increase 'end' to the requested
+	 * size.  This is quite expensive, so we will try to avoid it completely
+	 * by renaming files into place in UndoLogDiscard instead.
+	 */
+	end = log->meta.end;
+	while (end < new_end)
+	{
+		allocate_empty_undo_segment(logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	Assert(end == new_end);
+
+	/*
+	 * Flush the parent dir so that the directory metadata survives a crash
+	 * after this point.
+	 */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/*
+	 * If we're not in recovery, we need to WAL-log the creation of the new
+	 * file(s).  We do that after the above filesystem modifications, in
+	 * violation of the data-before-WAL rule as exempted by
+	 * src/backend/access/transam/README.  This means that it's possible for
+	 * us to crash having made some or all of the filesystem changes but
+	 * before WAL logging, but in that case we'll eventually try to create the
+	 * same segment(s) again, which is tolerated.
+	 */
+	if (!InRecovery)
+	{
+		xl_undolog_extend xlrec;
+		XLogRecPtr	ptr;
+
+		xlrec.logno = logno;
+		xlrec.end = end;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+		XLogFlush(ptr);
+	}
+
+	/*
+	 * We didn't need to acquire the mutex to read 'end' above because only
+	 * we write to it.  But we need the mutex to update it, because the
+	 * checkpointer might read it concurrently.
+	 *
+	 * XXX It's possible for meta.end to be higher already during
+	 * recovery, because of the timing of a checkpoint; in that case we did
+	 * nothing above and we shouldn't update shmem here.  That interaction
+	 * needs more analysis.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (log->meta.end < end)
+		log->meta.end = end;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get an insertion point that is guaranteed to be backed by enough space to
+ * hold 'size' bytes of data.  To actually write into the undo log, client
+ * code should call this first and then use bufmgr routines to access buffers
+ * and provide WAL logs and redo handlers.  In other words, while this module
+ * looks after making sure the undo log has sufficient space and the undo meta
+ * data is crash safe, the *contents* of the undo log and (indirectly) the
+ * insertion point are the responsibility of client code.
+ *
+ * Return an undo log insertion point that can be converted to a buffer tag
+ * and an insertion point within a buffer page.
+ *
+ * XXX For now an xl_undolog_meta object is filled in, in case it turns out
+ * to be necessary to write it into the WAL record (like FPI, this must be
+ * logged once for each undo log after each checkpoint).  I think this should
+ * be moved out of this interface and done differently -- to review.
+ */
+UndoRecPtr
+UndoLogAllocate(size_t size, UndoPersistence persistence)
+{
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+	UndoLogOffset new_insert;
+	UndoLogNumber prevlogno = InvalidUndoLogNumber;
+	TransactionId logxid;
+
+	/*
+	 * We may need to attach to an undo log, either because this is the first
+	 * time this backend as needed to write to an undo log at all or because
+	 * the undo_tablespaces GUC was changed.  When doing that, we'll need
+	 * interlocking against tablespaces being concurrently dropped.
+	 */
+
+ retry:
+	/* See if we need to check the undo_tablespaces GUC. */
+	if (unlikely(MyUndoLogState.need_to_choose_tablespace || log == NULL))
+	{
+		Oid		tablespace;
+		bool	need_to_unlock;
+
+		need_to_unlock =
+			choose_undo_tablespace(MyUndoLogState.need_to_choose_tablespace,
+								   &tablespace);
+		attach_undo_log(persistence, tablespace);
+		if (need_to_unlock)
+			LWLockRelease(TablespaceCreateLock);
+		log = MyUndoLogState.logs[persistence];
+		log->meta.prevlogno = prevlogno;
+		MyUndoLogState.need_to_choose_tablespace = false;
+	}
+
+	/*
+	 * If this is the first time we've allocated undo log space in this
+	 * transaction, we'll record the xid->undo log association so that it can
+	 * be replayed correctly. Before that, we set the first record flag to
+	 * false.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.is_first_rec = false;
+	logxid = log->xid;
+
+	if (logxid != GetTopTransactionId())
+	{
+		xl_undolog_attach xlrec;
+
+		/*
+		 * While we have the lock, check if we have been forcibly detached by
+		 * DROP TABLESPACE.  That can only happen between transactions (see
+		 * DropUndoLogsInsTablespace()).
+		 */
+		if (log->pid == InvalidPid)
+		{
+			LWLockRelease(&log->mutex);
+			log = NULL;
+			goto retry;
+		}
+		log->xid = GetTopTransactionId();
+		log->meta.is_first_rec = true;
+		LWLockRelease(&log->mutex);
+
+		/* Skip the attach record for unlogged and temporary tables. */
+		if (persistence == UNDO_PERMANENT)
+		{
+			xlrec.xid = GetTopTransactionId();
+			xlrec.logno = log->logno;
+			xlrec.dbid = MyDatabaseId;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_ATTACH);
+		}
+	}
+	else
+	{
+		LWLockRelease(&log->mutex);
+	}
+
+	/*
+	 * 'size' is expressed in usable non-header bytes.  Figure out how far we
+	 * have to move insert to create space for 'size' usable bytes, stepping
+	 * over any intervening headers.
+	 */
+	Assert(log->meta.insert % BLCKSZ >= UndoLogBlockHeaderSize);
+	new_insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	Assert(new_insert % BLCKSZ >= UndoLogBlockHeaderSize);
+
+	/*
+	 * We don't need to acquire log->mutex to read log->meta.insert and
+	 * log->meta.end, because this backend is the only one that can
+	 * modify them.
+	 */
+	if (unlikely(new_insert > log->meta.end))
+	{
+		if (new_insert > UndoLogMaxSize)
+		{
+			/* This undo log is entirely full.  Get a new one. */
+			if (logxid == GetTopTransactionId())
+			{
+				/*
+				 * If the same transaction is split over two undo logs then
+				 * store the previous log number in new log.  See detailed
+				 * comments in undorecord.c file header.
+				 */
+				prevlogno = log->logno;
+			}
+			elog(LOG, "undo log %u is full, switching to a new one", log->logno);
+			log = NULL;
+			detach_current_undo_log(persistence, true);
+			goto retry;
+		}
+		/*
+		 * Extend the end of this undo log to cover new_insert (in other words
+		 * round up to the segment size).
+		 */
+		extend_undo_log(log->logno,
+						new_insert + UndoLogSegmentSize -
+						new_insert % UndoLogSegmentSize);
+		Assert(new_insert <= log->meta.end);
+	}
+
+	return MakeUndoRecPtr(log->logno, log->meta.insert);
+}
+
+/*
+ * In recovery, we expect the xid to map to a known log which already has
+ * enough space in it.
+ */
+UndoRecPtr
+UndoLogAllocateInRecovery(TransactionId xid, size_t size,
+						  UndoPersistence level)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	/*
+	 * The sequence of calls to UndoLogAllocateRecovery() during REDO
+	 * (recovery) must match the sequence of calls to UndoLogAllocate during
+	 * DO, for any given session.  The XXX_redo code for any UNDO-generating
+	 * operation must use UndoLogAllocateRecovery() rather than
+	 * UndoLogAllocate(), because it must supply the extra 'xid' argument so
+	 * that we can find out which undo log number to use.  During DO, that's
+	 * tracked per-backend, but during REDO the original backends/sessions are
+	 * lost and we have only the Xids.
+	 */
+	Assert(InRecovery);
+
+	/*
+	 * Look up the undo log number for this xid.  The mapping must already
+	 * have been created by an XLOG_UNDOLOG_ATTACH record emitted during the
+	 * first call to UndoLogAllocate for this xid after the most recent
+	 * checkpoint.
+	 */
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	if (logno == InvalidUndoLogNumber)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	/*
+	 * This log must already have been created by an XLOG_UNDOLOG_CREATE
+	 * record emitted by UndoLogAllocate().
+	 */
+	log = get_undo_log(logno, false);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	/*
+	 * This log must already have been extended to cover the requested size by
+	 * XLOG_UNDOLOG_EXTEND records emitted by UndoLogAllocate(), or by
+	 * XLOG_UNDLOG_DISCARD records recycling segments.
+	 */
+	if (log->meta.end < UndoLogOffsetPlusUsableBytes(log->meta.insert, size))
+		elog(ERROR,
+			 "unexpectedly couldn't allocate %zu bytes in undo log number %d",
+			 size, logno);
+
+	/*
+	 * By this time we have allocated a undo log in transaction so after this
+	 * it will not be first undo record for the transaction.
+	 */
+	log->meta.is_first_rec = false;
+
+	return MakeUndoRecPtr(logno, log->meta.insert);
+}
+
+/*
+ * Advance the insertion pointer by 'size' usable (non-header) bytes.
+ */
+void
+UndoLogAdvance(UndoRecPtr insertion_point, size_t size, UndoPersistence persistence)
+{
+	UndoLogControl *log = NULL;
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insertion_point) ;
+
+	/*
+	 * During recovery, MyUndoLogState is uninitialized. Hence, we need to work
+	 * more.
+	 */
+	log = (InRecovery) ? get_undo_log(logno, false)
+		: MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+	Assert(InRecovery || logno == log->logno);
+	Assert(UndoRecPtrGetOffset(insertion_point) == log->meta.insert);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Advance the discard pointer in one undo log, discarding all undo data
+ * relating to one or more whole transactions.  The passed in undo pointer is
+ * the address of the oldest data that the called would like to keep, and the
+ * affected undo log is implied by this pointer, ie
+ * UndoRecPtrGetLogNo(discard_pointer).
+ *
+ * The caller asserts that there will be no attempts to access the undo log
+ * region being discarded after this moment.  This operation will cause the
+ * relevant buffers to be dropped immediately, without writing any data out to
+ * disk.  Any attempt to read the buffers (except a partial buffer at the end
+ * of this range which will remain) may result in IO errors, because the
+ * underlying segment file may have been physically removed.
+ *
+ * Only one backend should call this for a given undo log concurrently, or
+ * data structures will become corrupted.  It is expected that the caller will
+ * be an undo worker; only one undo worker should be working on a given undo
+ * log at a time.
+ */
+void
+UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(discard_point);
+	UndoLogOffset discard = UndoRecPtrGetOffset(discard_point);
+	UndoLogOffset old_discard;
+	UndoLogOffset end;
+	UndoLogControl *log;
+	int			segno;
+	int			new_segno;
+	bool		need_to_flush_wal = false;
+	bool		entirely_discarded = false;
+
+	log = get_undo_log(logno, false);
+	if (unlikely(log == NULL))
+		elog(ERROR,
+			 "cannot advance discard pointer for undo log %d because it is already entirely discarded",
+			 logno);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (unlikely(log->logno != logno))
+		elog(ERROR,
+			 "cannot advance discard pointer for undo log %d because it is entirely discarded",
+			 logno);
+	if (discard > log->meta.insert)
+		elog(ERROR, "cannot move discard point past insert point");
+	old_discard = log->meta.discard;
+	if (discard < old_discard)
+		elog(ERROR, "cannot move discard pointer backwards");
+	end = log->meta.end;
+	/* Are we discarding the last remaining data in a log marked as full? */
+	if (log->meta.status == UNDO_LOG_STATUS_FULL &&
+		discard == log->meta.insert)
+	{
+		/*
+		 * Adjust the discard and insert pointers so that the final segment is
+		 * deleted from disk, and remember not to recycle it.
+		 */
+		entirely_discarded = true;
+		log->meta.insert = log->meta.end;
+		discard = log->meta.end;
+	}
+	LWLockRelease(&log->mutex);
+
+	/*
+	 * Drop all buffers holding this undo data out of the buffer pool (except
+	 * the last one, if the new location is in the middle of it somewhere), so
+	 * that the contained data doesn't ever touch the disk.  The caller
+	 * promises that this data will not be needed again.  We have to drop the
+	 * buffers from the buffer pool before removing files, otherwise a
+	 * concurrent session might try to write the block to evict the buffer.
+	 */
+	forget_undo_buffers(logno, old_discard, discard, entirely_discarded);
+
+	/*
+	 * Check if we crossed a segment boundary and need to do some synchronous
+	 * filesystem operations.
+	 */
+	segno = old_discard / UndoLogSegmentSize;
+	new_segno = discard / UndoLogSegmentSize;
+	if (segno < new_segno)
+	{
+		int		recycle;
+		UndoLogOffset pointer;
+
+		/*
+		 * We always WAL-log discards, but we only need to flush the WAL if we
+		 * have performed a filesystem operation.
+		 */
+		need_to_flush_wal = true;
+
+		/*
+		 * XXX When we rename or unlink a file, it's possible that some
+		 * backend still has it open because it has recently read a page from
+		 * it.  smgr/undofile.c in any such backend will eventually close it,
+		 * because it considers that fd to belong to the file with the name
+		 * that we're unlinking or renaming and it doesn't like to keep more
+		 * than one open at a time.  No backend should ever try to read from
+		 * such a file descriptor; that is what it means when we say that the
+		 * caller of UndoLogDiscard() asserts that there will be no attempts
+		 * to access the discarded range of undo log.  In the case of a
+		 * rename, if a backend were to attempt to read undo data in the range
+		 * being discarded, it would read entirely the wrong data.
+		 */
+
+		/*
+		 * How many segments should we recycle (= rename from tail position to
+		 * head position)?  For now it's always 1 unless there is already a
+		 * spare one, but we could have an adaptive algorithm that recycles
+		 * multiple segments at a time and pays just one fsync().
+		 */
+		LWLockAcquire(&log->mutex, LW_SHARED);
+		if ((log->meta.end - log->meta.insert) < UndoLogSegmentSize &&
+			log->meta.status == UNDO_LOG_STATUS_ACTIVE)
+			recycle = 1;
+		else
+			recycle = 0;
+		LWLockRelease(&log->mutex);
+
+		/* Rewind to the start of the segment. */
+		pointer = segno * UndoLogSegmentSize;
+
+		while (pointer < new_segno * UndoLogSegmentSize)
+		{
+			char	discard_path[MAXPGPATH];
+
+			/*
+			 * Before removing the file, make sure that undofile_sync knows
+			 * that it might be missing.
+			 */
+			undofile_forgetsync(log->logno,
+								log->meta.tablespace,
+								pointer / UndoLogSegmentSize);
+
+			UndoLogSegmentPath(logno, pointer / UndoLogSegmentSize,
+							   log->meta.tablespace, discard_path);
+
+			/* Can we recycle the oldest segment? */
+			if (recycle > 0)
+			{
+				char	recycle_path[MAXPGPATH];
+
+				/*
+				 * End points one byte past the end of the current undo space,
+				 * ie to the first byte of the segment file we want to create.
+				 */
+				UndoLogSegmentPath(logno, end / UndoLogSegmentSize,
+								   log->meta.tablespace, recycle_path);
+				if (rename(discard_path, recycle_path) == 0)
+				{
+					elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+					end += UndoLogSegmentSize;
+					--recycle;
+				}
+				else
+				{
+					elog(ERROR, "could not rename \"%s\" to \"%s\": %m",
+						 discard_path, recycle_path);
+				}
+			}
+			else
+			{
+				if (unlink(discard_path) == 0)
+					elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+				else
+					elog(ERROR, "could not unlink \"%s\": %m", discard_path);
+			}
+			pointer += UndoLogSegmentSize;
+		}
+	}
+
+	/* WAL log the discard. */
+	{
+		xl_undolog_discard xlrec;
+		XLogRecPtr ptr;
+
+		xlrec.logno = logno;
+		xlrec.discard = discard;
+		xlrec.end = end;
+		xlrec.latestxid = xid;
+		xlrec.entirely_discarded = entirely_discarded;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_DISCARD);
+
+		if (need_to_flush_wal)
+			XLogFlush(ptr);
+	}
+
+	/* Update shmem to show the new discard and end pointers. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+
+	/* If we discarded everything, the slot can be given up. */
+	if (entirely_discarded)
+		free_undo_log(log);
+}
+
+/*
+ * Return an UndoRecPtr to the oldest valid data in an undo log, or
+ * InvalidUndoRecPtr if it is empty.
+ */
+UndoRecPtr
+UndoLogGetFirstValidRecord(UndoLogControl *log, bool *full)
+{
+	UndoRecPtr	result;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	if (log->meta.discard == log->meta.insert)
+		result = InvalidUndoRecPtr;
+	else
+		result = MakeUndoRecPtr(log->logno, log->meta.discard);
+	*full = log->meta.status == UNDO_LOG_STATUS_FULL;
+	LWLockRelease(&log->mutex);
+
+	return result;
+}
+
+/*
+ * Return the Next insert location.  This will also validate the input xid
+ * if latest insert point is not for the same transaction id then this will
+ * return Invalid Undo pointer.
+ */
+UndoRecPtr
+UndoLogGetNextInsertPtr(UndoLogNumber logno, TransactionId xid)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	TransactionId	logxid;
+	UndoRecPtr	insert;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	insert = log->meta.insert;
+	logxid = log->xid;
+	LWLockRelease(&log->mutex);
+
+	if (TransactionIdIsValid(logxid) && !TransactionIdEquals(logxid, xid))
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, insert);
+}
+
+/*
+ * Get the address of the most recently inserted record.
+ */
+UndoRecPtr
+UndoLogGetLastRecordPtr(UndoLogNumber logno, TransactionId xid)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	TransactionId logxid;
+	UndoRecPtr insert;
+	uint16 prevlen;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	insert = log->meta.insert;
+	logxid = log->xid;
+	prevlen = log->meta.prevlen;
+	LWLockRelease(&log->mutex);
+
+	if (TransactionIdIsValid(logxid) &&
+		TransactionIdIsValid(xid) &&
+		!TransactionIdEquals(logxid, xid))
+		return InvalidUndoRecPtr;
+
+	if (prevlen == 0)
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, insert - prevlen);
+}
+
+/*
+ * Rewind the undo log insert position also set the prevlen in the mata
+ */
+void
+UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen)
+{
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insert_urp);
+	UndoLogControl *log = get_undo_log(logno, false);
+	UndoLogOffset	insert = UndoRecPtrGetOffset(insert_urp);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = insert;
+	log->meta.prevlen = prevlen;
+
+	/*
+	 * Force the wal log on next undo allocation. So that during recovery undo
+	 * insert location is consistent with normal allocation.
+	 */
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	/* WAL log the rewind. */
+	{
+		xl_undolog_rewind xlrec;
+
+		xlrec.logno = logno;
+		xlrec.insert = insert;
+		xlrec.prevlen = prevlen;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_REWIND);
+	}
+}
+
+/*
+ * Delete unreachable files under pg_undo.  Any files corresponding to LSN
+ * positions before the previous checkpoint are no longer needed.
+ */
+static void
+CleanUpUndoCheckPointFiles(XLogRecPtr checkPointRedo)
+{
+	DIR	   *dir;
+	struct dirent *de;
+	char	path[MAXPGPATH];
+	char	oldest_path[MAXPGPATH];
+
+	/*
+	 * If a base backup is in progress, we can't delete any checkpoint
+	 * snapshot files because one of them corresponds to the backup label but
+	 * there could be any number of checkpoints during the backup.
+	 */
+	if (BackupInProgress())
+		return;
+
+	/* Otherwise keep only those >= the previous checkpoint's redo point. */
+	snprintf(oldest_path, MAXPGPATH, "%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	dir = AllocateDir("pg_undo");
+	while ((de = ReadDir(dir, "pg_undo")) != NULL)
+	{
+		/*
+		 * Assume that fixed width uppercase hex strings sort the same way as
+		 * the values they represent, so we can use strcmp to identify undo
+		 * log snapshot files corresponding to checkpoints that we don't need
+		 * anymore.  This assumption holds for ASCII.
+		 */
+		if (!(strlen(de->d_name) == UNDO_CHECKPOINT_FILENAME_LENGTH))
+			continue;
+
+		if (UndoCheckPointFilenamePrecedes(de->d_name, oldest_path))
+		{
+			snprintf(path, MAXPGPATH, "pg_undo/%s", de->d_name);
+			if (unlink(path) != 0)
+				elog(ERROR, "could not unlink file \"%s\": %m", path);
+		}
+	}
+	FreeDir(dir);
+}
+
+/*
+ * Write out the undo log meta data to the pg_undo directory.  The actual
+ * contents of undo logs is in shared buffers and therefore handled by
+ * CheckPointBuffers(), but here we record the table of undo logs and their
+ * properties.
+ */
+void
+CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogMetaData *serialized = NULL;
+	size_t	serialized_size = 0;
+	char   *data;
+	char	path[MAXPGPATH];
+	int		num_logs;
+	int		fd;
+	int		i;
+	pg_crc32c crc;
+
+	/*
+	 * We acquire UndoLogLock to prevent any undo logs from being created or
+	 * discarded while we build a snapshot of them.  This isn't expected to
+	 * take long on a healthy system because the number of active logs should
+	 * be around the number of backends.  Holding this lock won't prevent
+	 * concurrent access to the undo log, except when segments need to be
+	 * added or removed.
+	 */
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+
+	/*
+	 * Rather than doing the file IO while we hold locks, we'll copy the
+	 * meta-data into a palloc'd buffer.
+	 */
+	serialized_size = sizeof(UndoLogMetaData) * UndoLogNumSlots();
+	serialized = (UndoLogMetaData *) palloc0(serialized_size);
+
+	/* Scan through all slots looking for non-empty ones. */
+	num_logs = 0;
+	for (i = 0; i < UndoLogNumSlots(); ++i)
+	{
+		UndoLogControl *slot = &shared->logs[i];
+
+		/* Skip empty slots. */
+		if (slot->logno == InvalidUndoLogNumber)
+			continue;
+
+		/* Capture snapshot while holding each mutex. */
+		LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
+		serialized[num_logs++] = slot->meta;
+		slot->need_attach_wal_record = true; /* XXX: ?!? */
+		LWLockRelease(&slot->mutex);
+	}
+
+	LWLockRelease(UndoLogLock);
+
+	/* Dump into a file under pg_undo. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE);
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", path)));
+
+	/* Compute header checksum. */
+	INIT_CRC32C(crc);
+	COMP_CRC32C(crc, &shared->low_logno, sizeof(shared->low_logno));
+	COMP_CRC32C(crc, &shared->next_logno, sizeof(shared->next_logno));
+	COMP_CRC32C(crc, &num_logs, sizeof(num_logs));
+	FIN_CRC32C(crc);
+
+	/* Write out the number of active logs + crc. */
+	if ((write(fd, &shared->low_logno, sizeof(shared->low_logno)) != sizeof(shared->low_logno)) ||
+		(write(fd, &shared->next_logno, sizeof(shared->next_logno)) != sizeof(shared->next_logno)) ||
+		(write(fd, &num_logs, sizeof(num_logs)) != sizeof(num_logs)) ||
+		(write(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+	/* Write out the meta data for all active undo logs. */
+	data = (char *) serialized;
+	INIT_CRC32C(crc);
+	serialized_size = num_logs * sizeof(UndoLogMetaData);
+	while (serialized_size > 0)
+	{
+		ssize_t written;
+
+		written = write(fd, data, serialized_size);
+		if (written < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write to file \"%s\": %m", path)));
+		COMP_CRC32C(crc, data, written);
+		serialized_size -= written;
+		data += written;
+	}
+	FIN_CRC32C(crc);
+
+	if (write(fd, &crc, sizeof(crc)) != sizeof(crc))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+
+	/* Flush file and directory entry. */
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC);
+	pg_fsync(fd);
+	CloseTransientFile(fd);
+	fsync_fname("pg_undo", true);
+	pgstat_report_wait_end();
+
+	if (serialized)
+		pfree(serialized);
+
+	CleanUpUndoCheckPointFiles(priorCheckPointRedo);
+	undolog_xid_map_gc();
+}
+
+void
+StartupUndoLogs(XLogRecPtr checkPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char	path[MAXPGPATH];
+	int		i;
+	int		fd;
+	int		nlogs;
+	pg_crc32c crc;
+	pg_crc32c new_crc;
+
+	/* If initdb is calling, there is no file to read yet. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	/* Open the pg_undo file corresponding to the given checkpoint. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_READ);
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+		elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
+
+	/* Read the active log number range. */
+	if ((read(fd, &shared->low_logno, sizeof(shared->low_logno))
+		 != sizeof(shared->low_logno)) ||
+		(read(fd, &shared->next_logno, sizeof(shared->next_logno))
+		 != sizeof(shared->next_logno)) ||
+		(read(fd, &nlogs, sizeof(nlogs)) != sizeof(nlogs)) ||
+		(read(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+
+	/* Verify the header checksum. */
+	INIT_CRC32C(new_crc);
+	COMP_CRC32C(new_crc, &shared->low_logno, sizeof(shared->low_logno));
+	COMP_CRC32C(new_crc, &shared->next_logno, sizeof(shared->next_logno));
+	COMP_CRC32C(new_crc, &nlogs, sizeof(shared->next_logno));
+	FIN_CRC32C(new_crc);
+
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	/*
+	 * We'll acquire UndoLogLock just because allocate_undo_log() asserts we
+	 * hold it (we don't actually expect concurrent access yet).
+	 */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+	/* Initialize all the logs and set up the freelist. */
+	INIT_CRC32C(new_crc);
+	for (i = 0; i < nlogs; ++i)
+	{
+		ssize_t size;
+		UndoLogControl *log;
+
+		/*
+		 * Get a new slot to hold this UndoLogControl object.  If this
+		 * checkpoint was created on a system with a higher max_connections
+		 * setting, it's theoretically possible that we don't have enough
+		 * space and cannot start up.
+		 */
+		log = allocate_undo_log();
+		if (!log)
+			ereport(ERROR,
+					(errmsg("not enough undo log slots to recover from checkpoint: need at least %d, have %zu",
+							nlogs, UndoLogNumSlots()),
+					 errhint("Consider increasing max_connections")));
+
+		/* Read in the meta data for this undo log. */
+		if ((size = read(fd, &log->meta, sizeof(log->meta))) != sizeof(log->meta))
+			elog(ERROR, "short read of pg_undo meta data in file \"%s\": %m (got %zu, wanted %zu)",
+				 path, size, sizeof(log->meta));
+		COMP_CRC32C(new_crc, &log->meta, sizeof(log->meta));
+
+		/*
+		 * At normal start-up, or during recovery, all active undo logs start
+		 * out on the appropriate free list.
+		 */
+		log->logno = log->meta.logno;
+		log->pid = InvalidPid;
+		log->xid = InvalidTransactionId;
+		if (log->meta.status == UNDO_LOG_STATUS_ACTIVE)
+		{
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = log->logno;
+		}
+	}
+	FIN_CRC32C(new_crc);
+
+	LWLockRelease(UndoLogLock);
+
+	/* Verify body checksum. */
+	if (read(fd, &crc, sizeof(crc)) != sizeof(crc))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	CloseTransientFile(fd);
+	pgstat_report_wait_end();
+}
+
+/*
+ * Return a pointer to a newly allocated UndoLogControl object in shared
+ * memory, or return NULL if there are no free slots.  The caller should
+ * acquire the mutex and set up the object.
+ */
+static UndoLogControl *
+allocate_undo_log(void)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log;
+	int		i;
+
+	Assert(LWLockHeldByMeInMode(UndoLogLock, LW_EXCLUSIVE));
+
+	for (i = 0; i < UndoLogNumSlots(); ++i)
+	{
+		log = &shared->logs[i];
+		if (log->logno == InvalidUndoLogNumber)
+		{
+			memset(&log->meta, 0, sizeof(log->meta));
+			log->next_free = InvalidUndoLogNumber;
+			/* TODO: oldest_xid etc? */
+			return log;
+		}
+	}
+
+	return NULL;
+}
+
+/*
+ * Free an UndoLogControl object in shared memory, so that it can be reused.
+ */
+static void
+free_undo_log(UndoLogControl *log)
+{
+	/*
+	 * When removing an undo log from a slot in shared memory, we acquire
+	 * UndoLogLock and log->mutex, so that other code can hold either lock to
+	 * prevent the object from disappearing.
+	 */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	Assert(log->logno != InvalidUndoLogNumber);
+	log->logno = InvalidUndoLogNumber;
+	memset(&log->meta, 0, sizeof(log->meta));
+	LWLockRelease(&log->mutex);
+	LWLockRelease(UndoLogLock);
+}
+
+/*
+ * Get the UndoLogControl object for a given log number.
+ *
+ * The caller may or may not already hold UndoLogLock, and should indicate
+ * this by passing 'locked'.  We'll acquire it in the slow path if necessary.
+ * Either way, the caller must deal with the possibility that the returned
+ * UndoLogControl object pointed to no longer contains the requested logno by
+ * the time it is accessed.
+ *
+ * To do that, one of the following approaches must be taken by the calling
+ * code:
+ *
+ * 1.  If it is known that the calling backend is attached to the log, then it
+ * can be assumed that the UndoLogControl slot still holds the same undo log
+ * number.  The UndoLogControl slot can only change with the cooperation of
+ * the undo log that is attached to it (it must first be marked as
+ * UNDO_LOG_STATUS_FULL, which happens when a backend detaches).  Calling
+ * code should probably assert that it is attached and the logno is as
+ * expected, however.
+ *
+ * 2.  Acquire log->mutex before accessing any members, and after doing so,
+ * check that the logno is as expected.  If it is not, the entire undo log
+ * must be assumed to be discarded and the caller must behave accordingly.
+ *
+ * Return NULL if the undo log has been entirely discarded.  It is an error to
+ * ask for undo logs that have never been created.
+ */
+static UndoLogControl *
+get_undo_log(UndoLogNumber logno, bool locked)
+{
+	UndoLogControl *result = NULL;
+	UndoLogTableEntry *entry;
+	bool	   found;
+
+	Assert(locked == LWLockHeldByMe(UndoLogLock));
+
+	/* First see if we already have it in our cache. */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	if (likely(entry))
+		result = entry->control;
+	else
+	{
+		UndoLogSharedData *shared = MyUndoLogState.shared;
+		int		i;
+
+		/* Nope.  Linear search for the slot in shared memory. */
+		if (!locked)
+			LWLockAcquire(UndoLogLock, LW_SHARED);
+		for (i = 0; i < UndoLogNumSlots(); ++i)
+		{
+			if (shared->logs[i].logno == logno)
+			{
+				/* Found it. */
+
+				/*
+				 * TODO: Should this function be usable in a critical section?
+				 * Woudl it make sense to detect that we are in a critical
+				 * section and just return the pointer to the log without
+				 * updating the cache, to avoid any chance of allocating
+				 * memory?
+				 */
+
+				entry = undologtable_insert(undologtable_cache, logno, &found);
+				entry->number = logno;
+				entry->control = &shared->logs[i];
+				entry->tablespace = entry->control->meta.tablespace;
+				result = entry->control;
+				break;
+			}
+		}
+
+		/*
+		 * If we didn't find it, then it must already have been entirely
+		 * discarded.  We create a negative cache entry so that we can answer
+		 * this question quickly next time.
+		 *
+		 * TODO: We could track the lowest known undo log number, to reduce
+		 * the negative cache entry bloat.
+		 */
+		if (result == NULL)
+		{
+			/*
+			 * Sanity check: the caller should not be asking about undo logs
+			 * that have never existed.
+			 */
+			if (logno >= shared->next_logno)
+				elog(PANIC, "undo log %u hasn't been created yet", logno);
+			entry = undologtable_insert(undologtable_cache, logno, &found);
+			entry->number = logno;
+			entry->control = NULL;
+			entry->tablespace = 0;
+		}
+		if (!locked)
+			LWLockRelease(UndoLogLock);
+	}
+
+	return result;
+}
+
+/*
+ * Get a pointer to an UndoLogControl object corresponding to a given logno.
+ *
+ * In general, the caller must acquire the UndoLogControl's mutex to access
+ * the contents, and at that time must consider that the logno might have
+ * changed because the undo log it contained has been entirely discarded.
+ *
+ * If the calling backend is currently attached to the undo log, that is not
+ * possible, because logs can only reach UNDO_LOG_STATUS_DISCARDED after first
+ * reaching UNDO_LOG_STATUS_FULL, and that only happens while detaching.
+ */
+UndoLogControl *
+UndoLogGet(UndoLogNumber logno, bool missing_ok)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	if (log == NULL && !missing_ok)
+		elog(ERROR, "unknown undo log number %d", logno);
+
+	return log;
+}
+
+/*
+ * Attach to an undo log, possibly creating or recycling one as required.
+ */
+static void
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = NULL;
+	UndoLogNumber logno;
+	UndoLogNumber *place;
+
+	Assert(!InRecovery);
+	Assert(MyUndoLogState.logs[persistence] == NULL);
+
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+	/*
+	 * For now we have a simple linked list of unattached undo logs for each
+	 * persistence level.  We'll grovel though it to find something for the
+	 * tablespace you asked for.  If you're not using multiple tablespaces
+	 * it'll be able to pop one off the front.  We might need a hash table
+	 * keyed by tablespace if this simple scheme turns out to be too slow when
+	 * using many tablespaces and many undo logs, but that seems like an
+	 * unusual use case not worth optimizing for.
+	 */
+	place = &shared->free_lists[persistence];
+	while (*place != InvalidUndoLogNumber)
+	{
+		UndoLogControl *candidate = get_undo_log(*place, true);
+
+		/*
+		 * There should never be an undo log on the freelist that has been
+		 * entirely discarded, or hasn't been created yet.  The persistence
+		 * level should match the freelist.
+		 */
+		if (unlikely(candidate == NULL))
+			elog(ERROR,
+				 "corrupted undo log freelist, no such undo log %u", *place);
+		if (unlikely(candidate->meta.persistence != persistence))
+			elog(ERROR,
+				 "corrupted undo log freelist, undo log %u with persistence %d found on freelist %d",
+				 *place, candidate->meta.persistence, persistence);
+
+		if (candidate->meta.tablespace == tablespace)
+		{
+			logno = *place;
+			log = candidate;
+			*place = candidate->next_free;
+			break;
+		}
+		place = &candidate->next_free;
+	}
+
+	/*
+	 * All existing undo logs for this tablespace and persistence level are
+	 * busy, so we'll have to create a new one.
+	 */
+	if (log == NULL)
+	{
+		if (shared->next_logno > MaxUndoLogNumber)
+		{
+			/*
+			 * You've used up all 16 exabytes of undo log addressing space.
+			 * This is a difficult state to reach using only 16 exabytes of
+			 * WAL.
+			 */
+			elog(ERROR, "undo log address space exhausted");
+		}
+
+		/* Allocate a slot from the UndoLogControl pool. */
+		log = allocate_undo_log();
+		if (unlikely(!log))
+			ereport(ERROR,
+					(errmsg("could not create new undo log"),
+					 errdetail("The maximum number of active undo logs is %zu.",
+							   UndoLogNumSlots()),
+					 errhint("Consider increasing max_connections.")));
+		log->logno = logno = shared->next_logno;
+
+		/*
+		 * The insert and discard pointers start after the first block's
+		 * header.  XXX That means that insert is > end for a short time in a
+		 * newly created undo log.  Is there any problem with that?
+		 */
+		log->meta.insert = UndoLogBlockHeaderSize;
+		log->meta.discard = UndoLogBlockHeaderSize;
+
+		log->meta.logno = logno;
+		log->meta.tablespace = tablespace;
+		log->meta.persistence = persistence;
+		log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+
+		/* Move the high log number pointer past this one. */
+		++shared->next_logno;
+
+		/* WAL-log the creation of this new undo log. */
+		{
+			xl_undolog_create xlrec;
+
+			xlrec.logno = logno;
+			xlrec.tablespace = log->meta.tablespace;
+			xlrec.persistence = log->meta.persistence;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE);
+		}
+
+		/*
+		 * This undo log has no segments.  UndoLogAllocate will create the
+		 * first one on demand.
+		 */
+	}
+	LWLockRelease(UndoLogLock);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = MyProcPid;
+	log->xid = InvalidTransactionId;
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	MyUndoLogState.logs[persistence] = log;
+}
+
+/*
+ * Free chunks of the xid/undo log map that relate to transactions that are no
+ * longer running.  This is run at each checkpoint.
+ */
+static void
+undolog_xid_map_gc(void)
+{
+	UndoLogNumber **xid_map = MyUndoLogState.xid_map;
+	TransactionId oldest_xid;
+	uint16 new_oldest_chunk;
+	uint16 oldest_chunk;
+
+	if (xid_map == NULL)
+		return;
+
+	/*
+	 * During crash recovery, it may not be possible to call GetOldestXmin()
+	 * yet because latestCompletedXid is invalid.
+	 */
+	if (!TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid))
+		return;
+
+	oldest_xid = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
+	new_oldest_chunk = UndoLogGetXidHigh(oldest_xid);
+	oldest_chunk = MyUndoLogState.xid_map_oldest_chunk;
+
+	while (oldest_chunk != new_oldest_chunk)
+	{
+		if (xid_map[oldest_chunk])
+		{
+			pfree(xid_map[oldest_chunk]);
+			xid_map[oldest_chunk] = NULL;
+		}
+		oldest_chunk = (oldest_chunk + 1) % (1 << UndoLogXidHighBits);
+	}
+	MyUndoLogState.xid_map_oldest_chunk = new_oldest_chunk;
+}
+
+/*
+ * Associate a xid with an undo log, during recovery.  In a primary server,
+ * this isn't necessary because backends know which undo log they're attached
+ * to.  During recovery, the natural association between backends and xids is
+ * lost, so we need to manage that explicitly.
+ */
+static void
+undolog_xid_map_add(TransactionId xid, UndoLogNumber logno)
+{
+	uint16		high_bits;
+	uint16		low_bits;
+
+	high_bits = UndoLogGetXidHigh(xid);
+	low_bits = UndoLogGetXidLow(xid);
+
+	if (unlikely(MyUndoLogState.xid_map == NULL))
+	{
+		/* First time through.  Create mapping array. */
+		MyUndoLogState.xid_map =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber *) *
+								   (1 << (32 - UndoLogXidLowBits)));
+		MyUndoLogState.xid_map_oldest_chunk = high_bits;
+	}
+
+	if (unlikely(MyUndoLogState.xid_map[high_bits] == NULL))
+	{
+		/* This bank of mappings doesn't exist yet.  Create it. */
+		MyUndoLogState.xid_map[high_bits] =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber) *
+								   (1 << UndoLogXidLowBits));
+	}
+
+	/* Associate this xid with this undo log number. */
+	MyUndoLogState.xid_map[high_bits][low_bits] = logno;
+}
+
+/* check_hook: validate new undo_tablespaces */
+bool
+check_undo_tablespaces(char **newval, void **extra, GucSource source)
+{
+	char	   *rawname;
+	List	   *namelist;
+
+	/* Need a modifiable copy of string */
+	rawname = pstrdup(*newval);
+
+	/*
+	 * Parse string into list of identifiers, just to check for
+	 * well-formedness (unfortunateley we can't validate the names in the
+	 * catalog yet).
+	 */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+	{
+		/* syntax error in name list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawname);
+		list_free(namelist);
+		return false;
+	}
+
+	/*
+	 * Make sure we aren't already in a transaction that has been assigned an
+	 * XID.  This ensures we don't detach from an undo log that we might have
+	 * started writing undo data into for this transaction.
+	 */
+	if (GetTopTransactionIdIfAny() != InvalidTransactionId)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 (errmsg("undo_tablespaces cannot be changed while a transaction is in progress"))));
+	list_free(namelist);
+
+	return true;
+}
+
+/* assign_hook: do extra actions as needed */
+void
+assign_undo_tablespaces(const char *newval, void *extra)
+{
+	/*
+	 * This is normally called only when GetTopTransactionIdIfAny() ==
+	 * InvalidTransactionId (because you can't change undo_tablespaces in the
+	 * middle of a transaction that's been asigned an xid), but we can't
+	 * assert that because it's also called at the end of a transaction that's
+	 * rolling back, to reset the GUC if it was set inside the transaction.
+	 */
+
+	/* Tell UndoLogAllocate() to reexamine undo_tablespaces. */
+	MyUndoLogState.need_to_choose_tablespace = true;
+}
+
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+	char   *rawname;
+	List   *namelist;
+	bool	need_to_unlock;
+	int		length;
+	int		i;
+
+	/* We need a modifiable copy of string. */
+	rawname = pstrdup(undo_tablespaces);
+
+	/* Break string into list of identifiers. */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+		elog(ERROR, "undo_tablespaces is unexpectedly malformed");
+
+	length = list_length(namelist);
+	if (length == 0 ||
+		(length == 1 && ((char *) linitial(namelist))[0] == '\0'))
+	{
+		/*
+		 * If it's an empty string, then we'll use the default tablespace.  No
+		 * locking is required because it can't be dropped.
+		 */
+		*tablespace = DEFAULTTABLESPACE_OID;
+		need_to_unlock = false;
+	}
+	else
+	{
+		/*
+		 * Choose an OID using our pid, so that if several backends have the
+		 * same multi-tablespace setting they'll spread out.  We could easily
+		 * do better than this if more serious load balancing is judged
+		 * useful.
+		 */
+		int		index = MyProcPid % length;
+		int		first_index = index;
+		Oid		oid = InvalidOid;
+
+		/*
+		 * Take the tablespace create/drop lock while we look the name up.
+		 * This prevents the tablespace from being dropped while we're trying
+		 * to resolve the name, or while the called is trying to create an
+		 * undo log in it.  The caller will have to release this lock.
+		 */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		for (;;)
+		{
+			const char *name = list_nth(namelist, index);
+
+			oid = get_tablespace_oid(name, true);
+			if (oid == InvalidOid)
+			{
+				/* Unknown tablespace, try the next one. */
+				index = (index + 1) % length;
+				/*
+				 * But if we've tried them all, it's time to complain.  We'll
+				 * arbitrarily complain about the last one we tried in the
+				 * error message.
+				 */
+				if (index == first_index)
+					ereport(ERROR,
+							(errcode(ERRCODE_UNDEFINED_OBJECT),
+							 errmsg("tablespace \"%s\" does not exist", name),
+							 errhint("Create the tablespace or set undo_tablespaces to a valid or empty list.")));
+				continue;
+			}
+			if (oid == GLOBALTABLESPACE_OID)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("undo logs cannot be placed in pg_global tablespace")));
+			/* If we got here we succeeded in finding one. */
+			break;
+		}
+
+		Assert(oid != InvalidOid);
+		*tablespace = oid;
+		need_to_unlock = true;
+	}
+
+	/*
+	 * If we came here because the user changed undo_tablesaces, then detach
+	 * from any undo logs we happen to be attached to.
+	 */
+	if (force_detach)
+	{
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+		{
+			UndoLogControl *log = MyUndoLogState.logs[i];
+			UndoLogSharedData *shared = MyUndoLogState.shared;
+
+			if (log != NULL)
+			{
+				LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+				log->pid = InvalidPid;
+				log->xid = InvalidTransactionId;
+				LWLockRelease(&log->mutex);
+
+				LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+				log->next_free = shared->free_lists[i];
+				shared->free_lists[i] = log->logno;
+				LWLockRelease(UndoLogLock);
+
+				MyUndoLogState.logs[i] = NULL;
+			}
+		}
+	}
+
+	return need_to_unlock;
+}
+
+bool
+DropUndoLogsInTablespace(Oid tablespace)
+{
+	DIR *dir;
+	char undo_path[MAXPGPATH];
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log;
+	int		i;
+
+	Assert(LWLockHeldByMe(TablespaceCreateLock));
+	Assert(tablespace != DEFAULTTABLESPACE_OID);
+
+	/* First, try to kick everyone off any undo logs in this tablespace. */
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		bool ok;
+		bool return_to_freelist = false;
+
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/* Check if this undo log can be forcibly detached. */
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		if (log->meta.discard == log->meta.insert &&
+			(log->xid == InvalidTransactionId ||
+			 !TransactionIdIsInProgress(log->xid)))
+		{
+			log->xid = InvalidTransactionId;
+			if (log->pid != InvalidPid)
+			{
+				log->pid = InvalidPid;
+				return_to_freelist = true;
+			}
+			ok = true;
+		}
+		else
+		{
+			/*
+			 * There is data we need in this undo log.  We can't force it to
+			 * be detached.
+			 */
+			ok = false;
+		}
+		LWLockRelease(&log->mutex);
+
+		/* If we failed, then give up now and report failure. */
+		if (!ok)
+			return false;
+
+		/*
+		 * Put this undo log back on the appropriate free-list.  No one can
+		 * attach to it while we hold TablespaceCreateLock, but if we return
+		 * earlier in a future go around this loop, we need the undo log to
+		 * remain usable.  We'll remove all appropriate logs from the
+		 * free-lists in a separate step below.
+		 */
+		if (return_to_freelist)
+		{
+			LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = log->logno;
+			LWLockRelease(UndoLogLock);
+		}
+	}
+
+	/*
+	 * We detached all backends from undo logs in this tablespace, and no one
+	 * can attach to any non-default-tablespace undo logs while we hold
+	 * TablespaceCreateLock.  We can now drop the undo logs.
+	 */
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/*
+		 * Make sure no buffers remain.  When that is done by UndoDiscard(),
+		 * the final page is left in shared_buffers because it may contain
+		 * data, or at least be needed again very soon.  Here we need to drop
+		 * even that page from the buffer pool.
+		 */
+		forget_undo_buffers(log->logno, log->meta.discard, log->meta.discard, true);
+
+		/*
+		 * TODO: For now we drop the undo log, meaning that it will never be
+		 * used again.  That wastes the rest of its address space.  Instead,
+		 * we should put it onto a special list of 'offline' undo logs, ready
+		 * to be reactivated in some other tablespace.  Then we can keep the
+		 * unused portion of its address space.
+		 */
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		log->meta.status = UNDO_LOG_STATUS_DISCARDED;
+		LWLockRelease(&log->mutex);
+	}
+
+	/* Unlink all undo segment files in this tablespace. */
+	UndoLogDirectory(tablespace, undo_path);
+
+	dir = AllocateDir(undo_path);
+	if (dir != NULL)
+	{
+		struct dirent *de;
+
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 ||
+				strcmp(de->d_name, "..") == 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+	}
+
+	/* Remove all dropped undo logs from the free-lists. */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		UndoLogControl *log;
+		UndoLogNumber *place;
+
+		place = &shared->free_lists[i];
+		while (*place != InvalidUndoLogNumber)
+		{
+			log = get_undo_log(*place, true);
+			if (!log)
+				elog(ERROR,
+					 "corrupted undo log freelist, unknown log %u", *place);
+			if (log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+				*place = log->next_free;
+			else
+				place = &log->next_free;
+		}
+	}
+	LWLockRelease(UndoLogLock);
+
+	return true;
+}
+
+void
+ResetUndoLogs(UndoPersistence persistence)
+{
+	UndoLogControl *log;
+
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		DIR	   *dir;
+		struct dirent *de;
+		char	undo_path[MAXPGPATH];
+		char	segment_prefix[MAXPGPATH];
+		size_t	segment_prefix_size;
+
+		if (log->meta.persistence != persistence)
+			continue;
+
+		/* Scan the directory for files belonging to this undo log. */
+		snprintf(segment_prefix, sizeof(segment_prefix), "%06X.", log->logno);
+		segment_prefix_size = strlen(segment_prefix);
+		UndoLogDirectory(log->meta.tablespace, undo_path);
+		dir = AllocateDir(undo_path);
+		if (dir == NULL)
+			continue;
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			elog(LOG, "unlinked undo segment \"%s\"", segment_path); /* XXX: remove me */
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+
+		/*
+		 * We have no segment files.  Set the pointers to indicate that there
+		 * is no data.  The discard and insert pointers point to the first
+		 * usable byte in the segment we will create when we next try to
+		 * allocate.  This is a bit strange, because it means that they are
+		 * past the end pointer.  That's the same as when new undo logs are
+		 * created.
+		 *
+		 * TODO: Should we rewind to zero instead, so we can reuse that (now)
+		 * unreferenced address space?
+		 */
+		log->meta.insert = log->meta.discard = log->meta.end +
+			UndoLogBlockHeaderSize;
+	}
+}
+
+Datum
+pg_stat_get_undo_logs(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_UNDO_LOGS_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char *tablespace_name = NULL;
+	Oid last_tablespace = InvalidOid;
+	int			i;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Scan all undo logs to build the results. */
+	for (i = 0; i < shared->array_size; ++i)
+	{
+		UndoLogControl *log = &shared->logs[i];
+		char buffer[17];
+		Datum values[PG_STAT_GET_UNDO_LOGS_COLS];
+		bool nulls[PG_STAT_GET_UNDO_LOGS_COLS] = { false };
+		Oid tablespace;
+
+		if (log == NULL)
+			continue;
+
+		/*
+		 * This won't be a consistent result overall, but the values for each
+		 * log will be consistent because we'll take the per-log lock while
+		 * copying them.
+		 */
+		LWLockAcquire(&log->mutex, LW_SHARED);
+
+		/* Skip unused slots and entirely discarded undo logs. */
+		if (log->logno == InvalidUndoLogNumber ||
+			log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+		{
+			LWLockRelease(&log->mutex);
+			continue;
+		}
+
+		values[0] = ObjectIdGetDatum((Oid) log->logno);
+		values[1] = CStringGetTextDatum(
+			log->meta.persistence == UNDO_PERMANENT ? "permanent" :
+			log->meta.persistence == UNDO_UNLOGGED ? "unlogged" :
+			log->meta.persistence == UNDO_TEMP ? "temporary" : "<uknown>");
+		tablespace = log->meta.tablespace;
+
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.discard));
+		values[3] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.insert));
+		values[4] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.end));
+		values[5] = CStringGetTextDatum(buffer);
+		if (log->xid == InvalidTransactionId)
+			nulls[6] = true;
+		else
+			values[6] = TransactionIdGetDatum(log->xid);
+		if (log->pid == InvalidPid)
+			nulls[7] = true;
+		else
+			values[7] = Int32GetDatum((int64) log->pid);
+		if (log->meta.prevlogno == InvalidUndoLogNumber)
+			nulls[8] = true;
+		else
+			values[8] = ObjectIdGetDatum((Oid) log->meta.prevlogno);
+		switch (log->meta.status)
+		{
+		case UNDO_LOG_STATUS_ACTIVE:
+			values[9] = CStringGetTextDatum("ACTIVE"); break;
+		case UNDO_LOG_STATUS_FULL:
+			values[9] = CStringGetTextDatum("FULL"); break;
+		default:
+			nulls[9] = true;
+		}
+		LWLockRelease(&log->mutex);
+
+		/*
+		 * Deal with potentially slow tablespace name lookup without the lock.
+		 * Avoid making multiple calls to that expensive function for the
+		 * common case of repeating tablespace.
+		 */
+		if (tablespace != last_tablespace)
+		{
+			if (tablespace_name)
+				pfree(tablespace_name);
+			tablespace_name = get_tablespace_name(tablespace);
+			last_tablespace = tablespace;
+		}
+		if (tablespace_name)
+		{
+			values[2] = CStringGetTextDatum(tablespace_name);
+			nulls[2] = false;
+		}
+		else
+			nulls[2] = true;
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	if (tablespace_name)
+		pfree(tablespace_name);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * replay the creation of a new undo log
+ */
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+	xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	/* Create meta-data space in shared memory. */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	/* TODO: assert that it doesn't exist already? */
+	log = allocate_undo_log();
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->logno = xlrec->logno;
+	log->meta.logno = xlrec->logno;
+	log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+	log->meta.persistence = xlrec->persistence;
+	log->meta.tablespace = xlrec->tablespace;
+	log->meta.insert = UndoLogBlockHeaderSize;
+	log->meta.discard = UndoLogBlockHeaderSize;
+	shared->next_logno = Max(xlrec->logno + 1, shared->next_logno);
+	LWLockRelease(&log->mutex);
+	LWLockRelease(UndoLogLock);
+}
+
+/*
+ * replay the addition of a new segment to an undo log
+ */
+static void
+undolog_xlog_extend(XLogReaderState *record)
+{
+	xl_undolog_extend *xlrec = (xl_undolog_extend *) XLogRecGetData(record);
+
+	/* Extend exactly as we would during DO phase. */
+	extend_undo_log(xlrec->logno, xlrec->end);
+}
+
+/*
+ * replay the association of an xid with a specific undo log
+ */
+static void
+undolog_xlog_attach(XLogReaderState *record)
+{
+	xl_undolog_attach *xlrec = (xl_undolog_attach *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	undolog_xid_map_add(xlrec->xid, xlrec->logno);
+
+	/* Restore current dbid */
+	MyUndoLogState.dbid = xlrec->dbid;
+
+	/*
+	 * Whatever follows is the first record for this transaction.  Zheap will
+	 * use this to add UREC_INFO_TRANSACTION.
+	 */
+	log = get_undo_log(xlrec->logno, false);
+	/* TODO */
+	log->meta.is_first_rec = true;
+	log->xid = xlrec->xid;
+}
+
+/*
+ * Drop all buffers for the given undo log, from the old_discard to up
+ * new_discard.  If drop_tail is true, also drop the buffer that holds
+ * new_discard; this is used when discarding undo logs completely, for example
+ * via DROP TABLESPACE.  If it is false, then the final buffer is not dropped
+ * because it may contain data.
+ *
+ */
+static void
+forget_undo_buffers(int logno, UndoLogOffset old_discard,
+					UndoLogOffset new_discard, bool drop_tail)
+{
+	BlockNumber old_blockno;
+	BlockNumber new_blockno;
+	RelFileNode	rnode;
+
+	UndoRecPtrAssignRelFileNode(rnode, MakeUndoRecPtr(logno, old_discard));
+	old_blockno = old_discard / BLCKSZ;
+	new_blockno = new_discard / BLCKSZ;
+	if (drop_tail)
+		++new_blockno;
+	while (old_blockno < new_blockno)
+		ForgetBuffer(rnode, UndoLogForkNum, old_blockno++);
+}
+
+/*
+ * replay an undo segment discard record
+ */
+static void
+undolog_xlog_discard(XLogReaderState *record)
+{
+	xl_undolog_discard *xlrec = (xl_undolog_discard *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogOffset discard;
+	UndoLogOffset end;
+	UndoLogOffset old_segment_begin;
+	UndoLogOffset new_segment_begin;
+	RelFileNode rnode = {0};
+	char	dir[MAXPGPATH];
+
+	log = get_undo_log(xlrec->logno, false);
+	if (log == NULL)
+		elog(ERROR, "unknown undo log %d", xlrec->logno);
+
+	/*
+	 * We're about to discard undologs. In Hot Standby mode, ensure that
+	 * there's no queries running which need to get tuple from discarded undo.
+	 *
+	 * XXX we are passing empty rnode to the conflict function so that it can
+	 * check conflict in all the backend regardless of which database the
+	 * backend is connected.
+	 */
+	if (InHotStandby && TransactionIdIsValid(xlrec->latestxid))
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestxid, rnode);
+
+	/*
+	 * See if we need to unlink or rename any files, but don't consider it an
+	 * error if we find that files are missing.  Since UndoLogDiscard()
+	 * performs filesystem operations before WAL logging or updating shmem
+	 * which could be checkpointed, a crash could have left files already
+	 * deleted, but we could replay WAL that expects the files to be there.
+	 */
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	Assert(log->logno == xlrec->logno);
+	discard = log->meta.discard;
+	end = log->meta.end;
+	LWLockRelease(&log->mutex);
+
+	/* Drop buffers before we remove/recycle any files. */
+	forget_undo_buffers(xlrec->logno, discard, xlrec->discard,
+						xlrec->entirely_discarded);
+
+	/* Rewind to the start of the segment. */
+	old_segment_begin = discard - discard % UndoLogSegmentSize;
+	new_segment_begin = xlrec->discard - xlrec->discard % UndoLogSegmentSize;
+
+	/* Unlink or rename segments that are no longer in range. */
+	while (old_segment_begin < new_segment_begin)
+	{
+		char	discard_path[MAXPGPATH];
+
+		/*
+		 * Before removing the file, make sure that undofile_sync knows that
+		 * it might be missing.
+		 */
+		undofile_forgetsync(log->logno,
+							log->meta.tablespace,
+							old_segment_begin / UndoLogSegmentSize);
+
+		UndoLogSegmentPath(xlrec->logno, old_segment_begin / UndoLogSegmentSize,
+						   log->meta.tablespace, discard_path);
+
+		/* Can we recycle the oldest segment? */
+		if (end < xlrec->end)
+		{
+			char	recycle_path[MAXPGPATH];
+
+			UndoLogSegmentPath(xlrec->logno, end / UndoLogSegmentSize,
+							   log->meta.tablespace, recycle_path);
+			if (rename(discard_path, recycle_path) == 0)
+			{
+				elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+				end += UndoLogSegmentSize;
+			}
+			else
+			{
+				elog(LOG, "could not rename \"%s\" to \"%s\": %m",
+					 discard_path, recycle_path);
+			}
+		}
+		else
+		{
+			if (unlink(discard_path) == 0)
+				elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+			else
+				elog(LOG, "could not unlink \"%s\": %m", discard_path);
+		}
+		old_segment_begin += UndoLogSegmentSize;
+	}
+
+	/* Create any further new segments that are needed the slow way. */
+	while (end < xlrec->end)
+	{
+		allocate_empty_undo_segment(xlrec->logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	/* Flush the directory entries. */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/* Update shmem. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = xlrec->discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+
+	/* If we discarded everything, the slot can be given up. */
+	if (xlrec->entirely_discarded)
+		free_undo_log(log);
+}
+
+/*
+ * replay the rewind of a undo log
+ */
+static void
+undolog_xlog_rewind(XLogReaderState *record)
+{
+	xl_undolog_rewind *xlrec = (xl_undolog_rewind *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	log = get_undo_log(xlrec->logno, false);
+	log->meta.insert = xlrec->insert;
+	log->meta.prevlen = xlrec->prevlen;
+}
+
+void
+undolog_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			undolog_xlog_create(record);
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			undolog_xlog_extend(record);
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			undolog_xlog_attach(record);
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			undolog_xlog_discard(record);
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			undolog_xlog_rewind(record);
+			break;
+		default:
+			elog(PANIC, "undo_redo: unknown op code %u", info);
+	}
+}
+
+/*
+ * For assertions only.
+ */
+bool
+AmAttachedToUndoLog(UndoLogControl *log)
+{
+	/*
+	 * In general, we can't access log's members without locking.  But this
+	 * function is intended only for asserting that you are attached, and
+	 * while you're attached the slot can't be recycled, so don't bother
+	 * locking.
+	 */
+	return MyUndoLogState.logs[log->meta.persistence] == log;
+}
+
+/*
+ * For testing use only.  This function is only used by the test_undo module.
+ */
+void
+UndoLogDetachFull(void)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+		if (MyUndoLogState.logs[i])
+			detach_current_undo_log(i, true);
+}
+
+/*
+ * Fetch database id from the undo log state
+ */
+Oid
+UndoLogStateGetDatabaseId()
+{
+	Assert(InRecovery);
+	return MyUndoLogState.dbid;
+}
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 715995d..9fc79b6 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -939,6 +939,10 @@ GRANT SELECT (subdbid, subname, subowner, subenabled, subslotname, subpublicatio
     ON pg_subscription TO public;
 
 
+CREATE VIEW pg_stat_undo_logs AS
+    SELECT *
+    FROM pg_stat_get_undo_logs();
+
 --
 -- We have a few function definitions in here, too.
 -- At some point there might be enough to justify breaking them out into
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 4a714f6..281c1e9 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -54,6 +54,7 @@
 #include "access/reloptions.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/undolog.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -488,6 +489,20 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 	LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
 
 	/*
+	 * Drop the undo logs in this tablespace.  This will fail (without
+	 * dropping anything) if there are undo logs that we can't afford to drop
+	 * because they contain non-discarded data or a transaction is in
+	 * progress.  Since we hold TablespaceCreateLock, no other session will be
+	 * able to attach to an undo log in this tablespace (or any tablespace
+	 * except default) concurrently.
+	 */
+	if (!DropUndoLogsInTablespace(tablespaceoid))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs",
+						tablespacename)));
+
+	/*
 	 * Try to remove the physical infrastructure.
 	 */
 	if (!destroy_tablespace_directories(tablespaceoid, false))
@@ -1487,6 +1502,14 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		/* This shouldn't be able to fail in recovery. */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		if (!DropUndoLogsInTablespace(xlrec->ts_id))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("tablespace cannot be dropped because it contains non-empty undo logs")));
+		LWLockRelease(TablespaceCreateLock);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index e3b0565..1a7a381 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -154,6 +154,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
+		case RM_UNDOLOG_ID:
 			/* just deal with xid, and done */
 			ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
 									buf.origptr);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a58..4725cbe 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/undolog.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -127,6 +128,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, UndoLogShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
@@ -219,6 +221,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	UndoLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81..b6c0b00 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,8 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+	LWLockRegisterTranche(LWTRANCHE_UNDOLOG, "undo_log");
+	LWLockRegisterTranche(LWTRANCHE_UNDODISCARD, "undo_discard");
 
 	/* Register named tranches. */
 	for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ec..554af46 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
 CLogTruncationLock					45
+UndoLogLock							46
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index b636b1e..fcc7a86 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -556,6 +556,7 @@ BaseInit(void)
 	InitFileAccess();
 	smgrinit();
 	InitBufferPoolAccess();
+	UndoLogInit();
 }
 
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 03594e7..8b0ade6 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -119,6 +119,7 @@ extern int	CommitDelay;
 extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
+extern char *undo_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
 
@@ -3534,6 +3535,17 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Sets the tablespace(s) to use for undo logs."),
+			NULL,
+			GUC_LIST_INPUT | GUC_LIST_QUOTE
+		},
+		&undo_tablespaces,
+		"",
+		check_undo_tablespaces, assign_undo_tablespaces, NULL
+	},
+
+	{
 		{"dynamic_library_path", PGC_SUSET, CLIENT_CONN_OTHER,
 			gettext_noop("Sets the path for dynamically loadable modules."),
 			gettext_noop("If a dynamically loadable module needs to be opened and "
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 211a963..ea02210 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -209,11 +209,13 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_undo",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
+	"base/undo",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..938150d 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -20,6 +20,7 @@
 #include "access/nbtxlog.h"
 #include "access/rmgr.h"
 #include "access/spgxlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 0bbe9879..9c6fca4 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_UNDOLOG_ID, "UndoLog", undolog_redo, undolog_desc, undolog_identify, NULL, NULL, NULL)
diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h
new file mode 100644
index 0000000..10bd502
--- /dev/null
+++ b/src/include/access/undolog.h
@@ -0,0 +1,405 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.h
+ *
+ * PostgreSQL undo log manager.  This module is responsible for lifecycle
+ * management of undo logs and backing files, associating undo logs with
+ * backends, allocating and managing space within undo logs.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_H
+#define UNDOLOG_H
+
+#include "access/xlogreader.h"
+#include "catalog/pg_class.h"
+#include "common/relpath.h"
+#include "storage/bufpage.h"
+
+#ifndef FRONTEND
+#include "storage/lwlock.h"
+#endif
+
+/* The type used to identify an undo log and position within it. */
+typedef uint64 UndoRecPtr;
+
+/* The type used for undo record lengths. */
+typedef uint16 UndoRecordSize;
+
+/* Undo log statuses. */
+typedef enum
+{
+	UNDO_LOG_STATUS_UNUSED = 0,
+	UNDO_LOG_STATUS_ACTIVE,
+	UNDO_LOG_STATUS_FULL,
+	UNDO_LOG_STATUS_DISCARDED
+} UndoLogStatus;
+
+/*
+ * Undo log persistence levels.  These have a one-to-one correspondence with
+ * relpersistence values, but are small integers so that we can use them as an
+ * index into the "logs" and "lognos" arrays.
+ */
+typedef enum
+{
+	UNDO_PERMANENT = 0,
+	UNDO_UNLOGGED = 1,
+	UNDO_TEMP = 2
+} UndoPersistence;
+
+#define UndoPersistenceLevels 3
+
+/*
+ * Convert from relpersistence ('p', 'u', 't') to an UndoPersistence
+ * enumerator.
+ */
+#define UndoPersistenceForRelPersistence(rp)						\
+	((rp) == RELPERSISTENCE_PERMANENT ? UNDO_PERMANENT :			\
+	 (rp) == RELPERSISTENCE_UNLOGGED ? UNDO_UNLOGGED : UNDO_TEMP)
+
+/*
+ * Convert from UndoPersistence to a relpersistence value.
+ */
+#define RelPersistenceForUndoPersistence(up)				\
+	((up) == UNDO_PERMANENT ? RELPERSISTENCE_PERMANENT :	\
+	 (up) == UNDO_UNLOGGED ? RELPERSISTENCE_UNLOGGED :		\
+	 RELPERSISTENCE_TEMP)
+
+/*
+ * Get the appropriate UndoPersistence value from a Relation.
+ */
+#define UndoPersistenceForRelation(rel)									\
+	(UndoPersistenceForRelPersistence((rel)->rd_rel->relpersistence))
+
+/* Type for offsets within undo logs */
+typedef uint64 UndoLogOffset;
+
+/* printf-family format string for UndoRecPtr. */
+#define UndoRecPtrFormat "%016" INT64_MODIFIER "X"
+
+/* printf-family format string for UndoLogOffset. */
+#define UndoLogOffsetFormat UINT64_FORMAT
+
+/* Number of blocks of BLCKSZ in an undo log segment file.  128 = 1MB. */
+#define UNDOSEG_SIZE 128
+
+/* Size of an undo log segment file in bytes. */
+#define UndoLogSegmentSize ((size_t) BLCKSZ * UNDOSEG_SIZE)
+
+/* The width of an undo log number in bits.  24 allows for 16.7m logs. */
+#define UndoLogNumberBits 24
+
+/* The maximum valid undo log number. */
+#define MaxUndoLogNumber ((1 << UndoLogNumberBits) - 1)
+
+/* The width of an undo log offset in bits.  40 allows for 1TB per log.*/
+#define UndoLogOffsetBits (64 - UndoLogNumberBits)
+
+/* Special value for undo record pointer which indicates that it is invalid. */
+#define	InvalidUndoRecPtr	((UndoRecPtr) 0)
+
+/* End-of-list value when building linked lists of undo logs. */
+#define InvalidUndoLogNumber -1
+
+/*
+ * This undo record pointer will be used in the transaction header this special
+ * value is the indication that currently we don't have the value of the the
+ * next transactions start point but it will be updated with a valid value
+ * in the future.
+ */
+#define SpecialUndoRecPtr	((UndoRecPtr) 0xFFFFFFFFFFFFFFFF)
+
+/*
+ * The maximum amount of data that can be stored in an undo log.  Can be set
+ * artificially low to test full log behavior.
+ */
+#define UndoLogMaxSize ((UndoLogOffset) 1 << UndoLogOffsetBits)
+
+/* Type for numbering undo logs. */
+typedef int UndoLogNumber;
+
+/* Extract the undo log number from an UndoRecPtr. */
+#define UndoRecPtrGetLogNo(urp)					\
+	((urp) >> UndoLogOffsetBits)
+
+/* Extract the offset from an UndoRecPtr. */
+#define UndoRecPtrGetOffset(urp)				\
+	((urp) & ((UINT64CONST(1) << UndoLogOffsetBits) - 1))
+
+/* Make an UndoRecPtr from an log number and offset. */
+#define MakeUndoRecPtr(logno, offset)			\
+	(((uint64) (logno) << UndoLogOffsetBits) | (offset))
+
+/* The number of unusable bytes in the header of each block. */
+#define UndoLogBlockHeaderSize SizeOfPageHeaderData
+
+/* The number of usable bytes we can store per block. */
+#define UndoLogUsableBytesPerPage (BLCKSZ - UndoLogBlockHeaderSize)
+
+/* The pseudo-database OID used for undo logs. */
+#define UndoLogDatabaseOid 9
+
+/* Length of undo checkpoint filename */
+#define UNDO_CHECKPOINT_FILENAME_LENGTH	16
+
+/*
+ * UndoRecPtrIsValid
+ *		True iff undoRecPtr is valid.
+ */
+#define UndoRecPtrIsValid(undoRecPtr) \
+	((bool) ((UndoRecPtr) (undoRecPtr) != InvalidUndoRecPtr))
+
+/* Extract the relnode for an undo log. */
+#define UndoRecPtrGetRelNode(urp)				\
+	UndoRecPtrGetLogNo(urp)
+
+/* The only valid fork number for undo log buffers. */
+#define UndoLogForkNum MAIN_FORKNUM
+
+/* Compute the block number that holds a given UndoRecPtr. */
+#define UndoRecPtrGetBlockNum(urp)				\
+	(UndoRecPtrGetOffset(urp) / BLCKSZ)
+
+/* Compute the offset of a given UndoRecPtr in the page that holds it. */
+#define UndoRecPtrGetPageOffset(urp)			\
+	(UndoRecPtrGetOffset(urp) % BLCKSZ)
+
+/* Compare two undo checkpoint files to find the oldest file. */
+#define UndoCheckPointFilenamePrecedes(file1, file2)	\
+	(strcmp(file1, file2) < 0)
+
+/* What is the offset of the i'th non-header byte? */
+#define UndoLogOffsetFromUsableByteNo(i)								\
+	(((i) / UndoLogUsableBytesPerPage) * BLCKSZ +						\
+	 UndoLogBlockHeaderSize +											\
+	 ((i) % UndoLogUsableBytesPerPage))
+
+/* How many non-header bytes are there before a given offset? */
+#define UndoLogOffsetToUsableByteNo(offset)				\
+	(((offset) % BLCKSZ - UndoLogBlockHeaderSize) +		\
+	 ((offset) / BLCKSZ) * UndoLogUsableBytesPerPage)
+
+/* Add 'n' usable bytes to offset stepping over headers to find new offset. */
+#define UndoLogOffsetPlusUsableBytes(offset, n)							\
+	UndoLogOffsetFromUsableByteNo(UndoLogOffsetToUsableByteNo(offset) + (n))
+
+/* Populate a RelFileNode from an UndoRecPtr. */
+#define UndoRecPtrAssignRelFileNode(rfn, urp)			\
+	do													\
+	{													\
+		(rfn).spcNode = UndoRecPtrGetTablespace(urp);	\
+		(rfn).dbNode = UndoLogDatabaseOid;				\
+		(rfn).relNode = UndoRecPtrGetRelNode(urp);		\
+	} while (false);
+
+/*
+ * Control metadata for an active undo log.  Lives in shared memory inside an
+ * UndoLogControl object, but also written to disk during checkpoints.
+ */
+typedef struct UndoLogMetaData
+{
+	UndoLogNumber logno;
+	UndoLogStatus status;
+	Oid		tablespace;
+	UndoPersistence persistence;	/* permanent, unlogged, temp? */
+	UndoLogOffset insert;			/* next insertion point (head) */
+	UndoLogOffset end;				/* one past end of highest segment */
+	UndoLogOffset discard;			/* oldest data needed (tail) */
+	UndoLogOffset last_xact_start;	/* last transactions start undo offset */
+
+	/*
+	 * If the same transaction is split over two undo logs then it stored the
+	 * previous log number, see file header comments of undorecord.c for its
+	 * usage.
+	 *
+	 * Fixme: See if we can find other way to handle it instead of keeping
+	 * previous log number.
+	 */
+	UndoLogNumber prevlogno;		/* Previous undo log number */
+	bool	is_first_rec;
+
+	/*
+	 * last undo record's length. We need to save this in undo meta and WAL
+	 * log so that the value can be preserved across restart so that the first
+	 * undo record after the restart can get this value properly.  This will be
+	 * used going to the previous record of the transaction during rollback.
+	 * In case the transaction have done some operation before checkpoint and
+	 * remaining after checkpoint in such case if we can't get the previous
+	 * record prevlen which which before checkpoint we can not properly
+	 * rollback.  And, undo worker is also fetch this value when rolling back
+	 * the last transaction in the undo log for locating the last undo record
+	 * of the transaction.
+	 */
+	uint16	prevlen;
+} UndoLogMetaData;
+
+#ifndef FRONTEND
+
+/*
+ * The in-memory control object for an undo log.  We have a fixed-sized array
+ * of these.
+ */
+typedef struct UndoLogControl
+{
+	/*
+	 * Protected by UndoLogLock and 'mutex'.  Both must be held to steal this
+	 * slot for another undolog.  Either may be held to prevent that from
+	 * happening.
+	 */
+	UndoLogNumber logno;			/* InvalidUndoLogNumber for unused slots */
+
+	/* Protected by UndoLogLock. */
+	UndoLogNumber next_free;		/* link for active unattached undo logs */
+
+	/* Protected by 'mutex'. */
+	LWLock	mutex;
+	UndoLogMetaData meta;			/* current meta-data */
+	XLogRecPtr      lsn;
+	bool	need_attach_wal_record;	/* need_attach_wal_record */
+	pid_t		pid;				/* InvalidPid for unattached */
+	TransactionId xid;
+
+	/* Protected by 'discard_lock'.  State used by undo workers. */
+	LWLock		discard_lock;		/* prevents discarding while reading */
+	TransactionId	oldest_xid;		/* cache of oldest transaction's xid */
+	uint32		oldest_xidepoch;
+	UndoRecPtr	oldest_data;
+
+} UndoLogControl;
+
+extern UndoLogControl *UndoLogGet(UndoLogNumber logno, bool missing_ok);
+extern UndoLogControl *UndoLogNext(UndoLogControl *log);
+extern bool AmAttachedToUndoLog(UndoLogControl *log);
+extern UndoRecPtr UndoLogGetFirstValidRecord(UndoLogControl *log, bool *full);
+
+/*
+ * Each backend maintains a small hash table mapping undo log numbers to
+ * UndoLogControl objects in shared memory.
+ *
+ * We also cache the tablespace here, since we need fast access to that when
+ * resolving UndoRecPtr to an buffer tag.  We could also reach that via
+ * control->meta.tablespace, but that can't be accessed without locking (since
+ * the UndoLogControl object might be recycled).  Since the tablespace for a
+ * given undo log is constant for the whole life of the undo log, there is no
+ * invalidation problem to worry about.
+ */
+typedef struct UndoLogTableEntry
+{
+	UndoLogNumber	number;
+	UndoLogControl *control;
+	Oid				tablespace;
+	char			status;
+} UndoLogTableEntry;
+
+/*
+ * Instantiate fast inline hash table access functions.  We use an identity
+ * hash function for speed, since we already have integers and don't expect
+ * many collisions.
+ */
+#define SH_PREFIX undologtable
+#define SH_ELEMENT_TYPE UndoLogTableEntry
+#define SH_KEY_TYPE UndoLogNumber
+#define SH_KEY number
+#define SH_HASH_KEY(tb, key) (key)
+#define SH_EQUAL(tb, a, b) ((a) == (b))
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+extern PGDLLIMPORT undologtable_hash *undologtable_cache;
+
+/*
+ * Find the OID of the tablespace that holds a given UndoRecPtr.  This is
+ * included in the header so it can be inlined by UndoRecPtrAssignRelFileNode.
+ */
+static inline Oid
+UndoRecPtrGetTablespace(UndoRecPtr urp)
+{
+	UndoLogNumber		logno = UndoRecPtrGetLogNo(urp);
+	UndoLogTableEntry  *entry;
+
+	/*
+	 * Fast path, for undo logs we've seen before.  This is safe because
+	 * tablespaces are constant for the lifetime of an undo log number.
+	 */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	if (likely(entry))
+		return entry->tablespace;
+
+	/*
+	 * Slow path: force cache entry to be created.  Raises an error if the
+	 * undo log has been entirely discarded, or hasn't been created yet.  That
+	 * is appropriate here, because this interface is designed for accessing
+	 * undo pages via bufmgr, and we should never be trying to access undo
+	 * pages that have been discarded.
+	 */
+	UndoLogGet(logno, false);
+
+	/*
+	 * We use the value from the newly created cache entry, because it's
+	 * cheaper than acquiring log->mutex and reading log->meta.tablespace.
+	 */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	return entry->tablespace;
+}
+#endif
+
+/* Space management. */
+extern UndoRecPtr UndoLogAllocate(size_t size,
+								  UndoPersistence level);
+extern UndoRecPtr UndoLogAllocateInRecovery(TransactionId xid,
+											size_t size,
+											UndoPersistence persistence);
+extern void UndoLogAdvance(UndoRecPtr insertion_point,
+						   size_t size,
+						   UndoPersistence persistence);
+extern void UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid);
+extern bool UndoLogIsDiscarded(UndoRecPtr point);
+
+/* Initialization interfaces. */
+extern void StartupUndoLogs(XLogRecPtr checkPointRedo);
+extern void UndoLogShmemInit(void);
+extern Size UndoLogShmemSize(void);
+extern void UndoLogInit(void);
+extern void UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace,
+							   char *path);
+extern void ResetUndoLogs(UndoPersistence persistence);
+
+/* Interface use by tablespace.c. */
+extern bool DropUndoLogsInTablespace(Oid tablespace);
+
+/* GUC interfaces. */
+extern void assign_undo_tablespaces(const char *newval, void *extra);
+
+/* Checkpointing interfaces. */
+extern void CheckPointUndoLogs(XLogRecPtr checkPointRedo,
+							   XLogRecPtr priorCheckPointRedo);
+
+extern void UndoLogSetLastXactStartPoint(UndoRecPtr point);
+extern UndoRecPtr UndoLogGetLastXactStartPoint(UndoLogNumber logno);
+extern UndoRecPtr UndoLogGetNextInsertPtr(UndoLogNumber logno,
+										  TransactionId xid);
+extern UndoRecPtr UndoLogGetLastRecordPtr(UndoLogNumber,
+										  TransactionId xid);
+extern void UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen);
+extern bool IsTransactionFirstRec(TransactionId xid);
+extern void UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen);
+extern uint16 UndoLogGetPrevLen(UndoLogNumber logno);
+extern void UndoLogSetLSN(XLogRecPtr lsn);
+void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno);
+/* Redo interface. */
+extern void undolog_redo(XLogReaderState *record);
+/* Discard the undo logs for temp tables */
+extern void TempUndoDiscard(UndoLogNumber);
+extern Oid UndoLogStateGetDatabaseId(void);
+
+/* Test-only interfacing. */
+extern void UndoLogDetachFull(void);
+
+#endif
diff --git a/src/include/access/undolog_xlog.h b/src/include/access/undolog_xlog.h
new file mode 100644
index 0000000..fe88ac5
--- /dev/null
+++ b/src/include/access/undolog_xlog.h
@@ -0,0 +1,72 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog_xlog.h
+ *	  undo log access XLOG definitions.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_XLOG_H
+#define UNDOLOG_XLOG_H
+
+#include "access/undolog.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+
+/* XLOG records */
+#define XLOG_UNDOLOG_CREATE		0x00
+#define XLOG_UNDOLOG_EXTEND		0x10
+#define XLOG_UNDOLOG_ATTACH		0x20
+#define XLOG_UNDOLOG_DISCARD	0x30
+#define XLOG_UNDOLOG_REWIND		0x40
+#define XLOG_UNDOLOG_META		0x50
+
+/* Create a new undo log. */
+typedef struct xl_undolog_create
+{
+	UndoLogNumber logno;
+	Oid		tablespace;
+	UndoPersistence persistence;
+} xl_undolog_create;
+
+/* Extend an undo log by adding a new segment. */
+typedef struct xl_undolog_extend
+{
+	UndoLogNumber logno;
+	UndoLogOffset end;
+} xl_undolog_extend;
+
+/* Record the undo log number used for a transaction. */
+typedef struct xl_undolog_attach
+{
+	TransactionId xid;
+	UndoLogNumber logno;
+	Oid				dbid;
+} xl_undolog_attach;
+
+/* Discard space, and possibly destroy or recycle undo log segments. */
+typedef struct xl_undolog_discard
+{
+	UndoLogNumber logno;
+	UndoLogOffset discard;
+	UndoLogOffset end;
+	TransactionId latestxid;	/* latest xid whose undolog are discarded. */
+	bool		  entirely_discarded;
+} xl_undolog_discard;
+
+/* Rewind insert location of the undo log. */
+typedef struct xl_undolog_rewind
+{
+	UndoLogNumber logno;
+	UndoLogOffset insert;
+	uint16		  prevlen;
+} xl_undolog_rewind;
+
+extern void undolog_desc(StringInfo buf,XLogReaderState *record);
+extern const char *undolog_identify(uint8 info);
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 034a41e..bc8da54 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10048,4 +10048,11 @@
   proargnames => '{rootrelid,relid,parentrelid,isleaf,level}',
   prosrc => 'pg_partition_tree' },
 
+# undo logs
+{ oid => '5032', descr => 'list undo logs',
+  proname => 'pg_stat_get_undo_logs', procost => '1', prorows => '10', proretset => 't',
+  prorettype => 'record', proargtypes => '',
+  proallargtypes => '{oid,text,text,text,text,text,xid,int4,oid,text}', proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{logno,persistence,tablespace,discard,insert,end,xid,pid,prev_logno,status}', prosrc => 'pg_stat_get_undo_logs' },
+
 ]
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index b2dcb73..4305af6 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,8 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_TBM,
 	LWTRANCHE_PARALLEL_APPEND,
+	LWTRANCHE_UNDOLOG,
+	LWTRANCHE_UNDODISCARD,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 64457c7..8b30828 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -426,6 +426,8 @@ extern void GUC_check_errcode(int sqlerrcode);
 extern bool check_default_tablespace(char **newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra, GucSource source);
 extern void assign_temp_tablespaces(const char *newval, void *extra);
+extern bool check_undo_tablespaces(char **newval, void **extra, GucSource source);
+extern void assign_undo_tablespaces(const char *newval, void *extra);
 
 /* in catalog/namespace.c */
 extern bool check_search_path(char **newval, void **extra, GucSource source);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 735dd37..f3de192 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1918,6 +1918,17 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
+pg_stat_undo_logs| SELECT pg_stat_get_undo_logs.logno,
+    pg_stat_get_undo_logs.persistence,
+    pg_stat_get_undo_logs.tablespace,
+    pg_stat_get_undo_logs.discard,
+    pg_stat_get_undo_logs.insert,
+    pg_stat_get_undo_logs."end",
+    pg_stat_get_undo_logs.xid,
+    pg_stat_get_undo_logs.pid,
+    pg_stat_get_undo_logs.prev_logno,
+    pg_stat_get_undo_logs.status
+   FROM pg_stat_get_undo_logs() pg_stat_get_undo_logs(logno, persistence, tablespace, discard, insert, "end", xid, pid, prev_logno, status);
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
-- 
1.8.3.1

0002-Provide-access-to-undo-log-data-via-the-buffer-manag_v3.patchapplication/x-patch; name=0002-Provide-access-to-undo-log-data-via-the-buffer-manag_v3.patchDownload

From 19cdf05f7dbfdec57605f273f0c78c999f3f0d0a Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 3 Dec 2018 10:12:34 +0530
Subject: [PATCH 2/2] Provide access to undo log data via the buffer manager.

In ancient Berkeley POSTGRES, smgr.c allowed for different storage engines, of
which only md.c survives.  Revive this mechanism to provide access to undo log
data through the existing buffer manager.

Undo logs exist in a pseudo-database whose OID is used to dispatch IO requests
to undofile.c instead of md.c.

Note: a separate proposal generalizes the fsync request machinery, see
https://commitfest.postgresql.org/20/1829/.  This patch has some stand-in
fsync machinery, but will be rebased on that other one depending on progress.
It seems better to avoid tangling up too many concurrently proposals so for
now this patch has its own fsync queue, duplicating some code from md.c.

Author: Thomas Munro, though ForgetBuffer() was contributed by Robert Haas
Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com
---
 src/backend/access/transam/xlogutils.c |  10 +-
 src/backend/postmaster/checkpointer.c  |   2 +-
 src/backend/postmaster/pgstat.c        |  24 +-
 src/backend/storage/buffer/bufmgr.c    |  82 ++++-
 src/backend/storage/smgr/Makefile      |   2 +-
 src/backend/storage/smgr/md.c          |  15 +-
 src/backend/storage/smgr/smgr.c        |  49 ++-
 src/backend/storage/smgr/undofile.c    | 546 +++++++++++++++++++++++++++++++++
 src/include/pgstat.h                   |  16 +-
 src/include/storage/bufmgr.h           |  14 +-
 src/include/storage/smgr.h             |  35 ++-
 src/include/storage/undofile.h         |  50 +++
 12 files changed, 810 insertions(+), 35 deletions(-)
 create mode 100644 src/backend/storage/smgr/undofile.c
 create mode 100644 src/include/storage/undofile.h

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4ecdc92..8fed7b1 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -346,7 +346,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * Make sure that if the block is marked with WILL_INIT, the caller is
 	 * going to initialize it. And vice versa.
 	 */
-	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+	zeromode = (mode == RBM_ZERO || mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
 	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
@@ -462,7 +462,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -487,7 +487,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -497,7 +498,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index b9c118e..b2505c8 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1314,7 +1314,7 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		smgrrequestsync(request->rnode, request->forknum, request->segno);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8676088..9d717d9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3515,7 +3515,7 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_WAL_WRITER_MAIN:
 			event_name = "WalWriterMain";
 			break;
-			/* no default case, so that compiler will warn */
+		/* no default case, so that compiler will warn */
 	}
 
 	return event_name;
@@ -3897,6 +3897,28 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_READ:
+			event_name = "UndoCheckpointRead";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_WRITE:
+			event_name = "UndoCheckpointWrite";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_SYNC:
+			event_name = "UndoCheckpointSync";
+			break;
+		case WAIT_EVENT_UNDO_FILE_READ:
+			event_name = "UndoFileRead";
+			break;
+		case WAIT_EVENT_UNDO_FILE_WRITE:
+			event_name = "UndoFileWrite";
+			break;
+		case WAIT_EVENT_UNDO_FILE_FLUSH:
+			event_name = "UndoFileFlush";
+			break;
+		case WAIT_EVENT_UNDO_FILE_SYNC:
+			event_name = "UndoFileSync";
+			break;
+
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9817770..bf2408a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -176,6 +176,7 @@ static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
 static inline int32 GetPrivateRefCount(Buffer buffer);
 static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
+static void InvalidateBuffer(BufferDesc *buf);
 
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
@@ -618,10 +619,12 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * valid, the page is zeroed instead of throwing an error. This is intended
  * for non-critical data, where the caller is prepared to repair errors.
  *
- * In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's
+ * In RBM_ZERO mode, if the page isn't in buffer cache already, it's
  * filled with zeros instead of reading it from disk.  Useful when the caller
  * is going to fill the page from scratch, since this saves I/O and avoids
  * unnecessary failure if the page-on-disk has corrupt page headers.
+ *
+ * In RBM_ZERO_AND_LOCK mode, the page is zeroed and also locked.
  * The page is returned locked to ensure that the caller has a chance to
  * initialize the page before it's made visible to others.
  * Caution: do not use this mode to read a page that is beyond the relation's
@@ -672,24 +675,20 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy,
+						  char relpersistence)
 {
 	bool		hit;
 
-	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
-
-	Assert(InRecovery);
+	SMgrRelation smgr = smgropen(rnode,
+								 relpersistence == RELPERSISTENCE_TEMP
+								 ? MyBackendId : InvalidBackendId);
 
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -883,7 +882,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Read in the page, unless the caller intends to overwrite it and
 		 * just wants us to allocate a buffer.
 		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+		if (mode == RBM_ZERO ||
+			mode == RBM_ZERO_AND_LOCK ||
+			mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
@@ -1338,6 +1339,61 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 }
 
 /*
+ * ForgetBuffer -- drop a buffer from shared buffers
+ *
+ * If the buffer isn't present in shared buffers, nothing happens.  If it is
+ * present, it is discarded without making any attempt to write it back out to
+ * the operating system.  The caller must therefore somehow be sure that the
+ * data won't be needed for anything now or in the future.  It assumes that
+ * there is no concurrent access to the block, except that it might be being
+ * concurrently written.
+ */
+void
+ForgetBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum)
+{
+	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
+	BufferTag	tag;			/* identity of target block */
+	uint32		hash;			/* hash value for tag */
+	LWLock	   *partitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(tag, smgr->smgr_rnode.node, forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	hash = BufTableHashCode(&tag);
+	partitionLock = BufMappingPartitionLock(hash);
+
+	/* see if the block is in the buffer pool */
+	LWLockAcquire(partitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&tag, hash);
+	LWLockRelease(partitionLock);
+
+	/* didn't find it, so nothing to do */
+	if (buf_id < 0)
+		return;
+
+	/* take the buffer header lock */
+	bufHdr = GetBufferDescriptor(buf_id);
+	buf_state = LockBufHdr(bufHdr);
+
+	/*
+	 * The buffer might been evicted after we released the partition lock and
+	 * before we acquired the buffer header lock.  If so, the buffer we've
+	 * locked might contain some other data which we shouldn't touch. If the
+	 * buffer hasn't been recycled, we proceed to invalidate it.
+	 */
+	if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+		bufHdr->tag.blockNum == blockNum &&
+		bufHdr->tag.forkNum == forkNum)
+		InvalidateBuffer(bufHdr);		/* releases spinlock */
+	else
+		UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
  * InvalidateBuffer -- mark a shared buffer invalid and return it to the
  * freelist.
  *
@@ -1412,7 +1468,7 @@ retry:
 		LWLockRelease(oldPartitionLock);
 		/* safety check: should definitely not be our *own* pin */
 		if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
-			elog(ERROR, "buffer is pinned in InvalidateBuffer");
+			elog(PANIC, "buffer is pinned in InvalidateBuffer");
 		WaitIO(buf);
 		goto retry;
 	}
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0..b657eb2 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrtype.o undofile.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 4c6a505..4c489a2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -45,7 +45,7 @@
 #define UNLINKS_PER_ABSORB		10
 
 /*
- * Special values for the segno arg to RememberFsyncRequest.
+ * Special values for the segno arg to mdrequestsync.
  *
  * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
  * fsync request from the queue if an identical, subsequent request is found.
@@ -1420,7 +1420,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+		mdrequestsync(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
 	}
 	else
 	{
@@ -1456,8 +1456,7 @@ register_unlink(RelFileNodeBackend rnode)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
+		mdrequestsync(rnode.node, MAIN_FORKNUM, UNLINK_RELATION_REQUEST);
 	}
 	else
 	{
@@ -1476,7 +1475,7 @@ register_unlink(RelFileNodeBackend rnode)
 }
 
 /*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ * mdrequestsync() -- callback from checkpointer side of fsync request
  *
  * We stuff fsync requests into the local hash table for execution
  * during the checkpointer's next checkpoint.  UNLINK requests go into a
@@ -1497,7 +1496,7 @@ register_unlink(RelFileNodeBackend rnode)
  * heavyweight operation anyhow, so we'll live with it.)
  */
 void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
 {
 	Assert(pendingOpsTable);
 
@@ -1640,7 +1639,7 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+		mdrequestsync(rnode, forknum, FORGET_RELATION_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
@@ -1679,7 +1678,7 @@ ForgetDatabaseFsyncRequests(Oid dbid)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
+		mdrequestsync(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 189342e..d0b2c0d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,6 +58,8 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
+	void		(*smgr_requestsync) (RelFileNode rnode, ForkNumber forknum,
+									 int segno);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
 	void		(*smgr_sync) (void);	/* may be NULL */
@@ -81,15 +83,45 @@ static const f_smgr smgrsw[] = {
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
+		.smgr_requestsync = mdrequestsync,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_pre_ckpt = mdpreckpt,
 		.smgr_sync = mdsync,
 		.smgr_post_ckpt = mdpostckpt
+	},
+	/* undo logs */
+	{
+		.smgr_init = undofile_init,
+		.smgr_shutdown = undofile_shutdown,
+		.smgr_close = undofile_close,
+		.smgr_create = undofile_create,
+		.smgr_exists = undofile_exists,
+		.smgr_unlink = undofile_unlink,
+		.smgr_extend = undofile_extend,
+		.smgr_prefetch = undofile_prefetch,
+		.smgr_read = undofile_read,
+		.smgr_write = undofile_write,
+		.smgr_writeback = undofile_writeback,
+		.smgr_nblocks = undofile_nblocks,
+		.smgr_truncate = undofile_truncate,
+		.smgr_requestsync = undofile_requestsync,
+		.smgr_immedsync = undofile_immedsync,
+		.smgr_pre_ckpt = undofile_preckpt,
+		.smgr_sync = undofile_sync,
+		.smgr_post_ckpt = undofile_postckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
+/*
+ * In ancient Postgres the catalog entry for each relation controlled the
+ * choice of storage manager implementation.  Now we have only md.c for
+ * regular relations, and undofile.c for undo log storage in the undolog
+ * pseudo-database.
+ */
+#define SmgrWhichForRelFileNode(rfn)			\
+	((rfn).dbNode == 9 ? 1 : 0)
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -185,11 +217,18 @@ smgropen(RelFileNode rnode, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		reln->smgr_fsm_nblocks = InvalidBlockNumber;
 		reln->smgr_vm_nblocks = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+
+		/* Which storage manager implementation? */
+		reln->smgr_which = SmgrWhichForRelFileNode(rnode);
 
 		/* mark it not open */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
 			reln->md_num_open_segs[forknum] = 0;
+			reln->md_seg_fds[forknum] = NULL;
+		}
+
+		reln->private_data = NULL;
 
 		/* it has no owner yet */
 		add_to_unowned_list(reln);
@@ -723,6 +762,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 }
 
 /*
+ *	smgrrequestsync() -- Enqueue a request for smgrsync() to flush data.
+ */
+void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	smgrsw[SmgrWhichForRelFileNode(rnode)].smgr_requestsync(rnode, forknum, segno);
+}
+
+/*
  *	smgrimmedsync() -- Force the specified relation to stable storage.
  *
  *		Synchronously force all previous writes to the specified relation
diff --git a/src/backend/storage/smgr/undofile.c b/src/backend/storage/smgr/undofile.c
new file mode 100644
index 0000000..afba64e
--- /dev/null
+++ b/src/backend/storage/smgr/undofile.c
@@ -0,0 +1,546 @@
+/*
+ * undofile.h
+ *
+ * PostgreSQL undo file manager.  This module provides SMGR-compatible
+ * interface to the files that back undo logs on the filesystem, so that undo
+ * log data can use the shared buffer pool.  Other aspects of undo log
+ * management are provided by undolog.c, so the SMGR interfaces not directly
+ * concerned with reading, writing and flushing data are unimplemented.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/storage/smgr/undofile.c
+ */
+
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "storage/fd.h"
+#include "storage/undofile.h"
+#include "utils/memutils.h"
+
+/* intervals for calling AbsorbFsyncRequests in undofile_sync */
+#define FSYNCS_PER_ABSORB		10
+
+/*
+ * Special values for the fork arg to undofile_requestsync.
+ */
+#define FORGET_UNDO_SEGMENT_FSYNC	(InvalidBlockNumber)
+
+/*
+ * While md.c expects random access and has a small number of huge
+ * segments, undofile.c manages a potentially very large number of smaller
+ * segments and has a less random access pattern.  Therefore, instead of
+ * keeping a potentially huge array of vfds we'll just keep the most
+ * recently accessed N.
+ *
+ * For now, N == 1, so we just need to hold onto one 'File' handle.
+ */
+typedef struct UndoFileState
+{
+	int		mru_segno;
+	File	mru_file;
+} UndoFileState;
+
+static MemoryContext UndoFileCxt;
+
+typedef uint16 CycleCtr;
+
+/*
+ * An entry recording the segments that need to be fsynced by undofile_sync().
+ * This is a bit simpler than md.c's version, though it could perhaps be
+ * merged into a common struct.  One difference is that we can have much
+ * larger segment numbers, so we'll adjust for that to avoid having a lot of
+ * leading zero bits.
+ */
+typedef struct
+{
+	RelFileNode rnode;
+	Bitmapset  *requests;
+	CycleCtr	cycle_ctr;
+} PendingOperationEntry;
+
+static HTAB *pendingOpsTable = NULL;
+static MemoryContext pendingOpsCxt;
+
+static CycleCtr undofile_sync_cycle_ctr = 0;
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok);
+static File undofile_get_segment_file(SMgrRelation reln, int segno);
+
+void
+undofile_init(void)
+{
+	UndoFileCxt = AllocSetContextCreate(TopMemoryContext,
+										"UndoFileSmgr",
+										ALLOCSET_DEFAULT_SIZES);
+
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		pendingOpsCxt = AllocSetContextCreate(UndoFileCxt,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(RelFileNode);
+		hash_ctl.entrysize = sizeof(PendingOperationEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOpsTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+}
+
+void
+undofile_shutdown(void)
+{
+}
+
+void
+undofile_close(SMgrRelation reln, ForkNumber forknum)
+{
+}
+
+void
+undofile_create(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_create is not supported");
+}
+
+bool
+undofile_exists(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_exists is not supported");
+}
+
+void
+undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_unlink is not supported");
+}
+
+void
+undofile_extend(SMgrRelation reln, ForkNumber forknum,
+				BlockNumber blocknum, char *buffer,
+				bool skipFsync)
+{
+	elog(ERROR, "undofile_extend is not supported");
+}
+
+void
+undofile_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	elog(ERROR, "undofile_prefetch is not supported");
+}
+
+void
+undofile_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  char *buffer)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	nbytes = FileRead(file, buffer, BLCKSZ, seekpos, WAIT_EVENT_UNDO_FILE_READ);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+}
+
+static void
+register_dirty_segment(SMgrRelation reln, ForkNumber forknum, int segno, File file)
+{
+	/* Temp relations should never be fsync'd */
+	Assert(!SmgrIsTemp(reln));
+
+	if (pendingOpsTable)
+	{
+		/* push it into local pending-ops table */
+		undofile_requestsync(reln->smgr_rnode.node, forknum, segno);
+	}
+	else
+	{
+		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, segno))
+			return;				/* passed it off successfully */
+
+		ereport(DEBUG1,
+				(errmsg("could not forward fsync request because request queue is full")));
+
+		if (FileSync(file, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(file))));
+	}
+}
+
+void
+undofile_write(SMgrRelation reln, ForkNumber forknum,
+			   BlockNumber blocknum, char *buffer,
+			   bool skipFsync)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	nbytes = FileWrite(file, buffer, BLCKSZ, seekpos, WAIT_EVENT_UNDO_FILE_WRITE);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		/*
+		 * short write: unexpected, because this should be overwriting an
+		 * entirely pre-allocated segment file
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_DISK_FULL),
+				 errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+
+	if (!skipFsync && !SmgrIsTemp(reln))
+		register_dirty_segment(reln, forknum, blocknum / UNDOSEG_SIZE, file);
+}
+
+void
+undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+				   BlockNumber blocknum, BlockNumber nblocks)
+{
+	while (nblocks > 0)
+	{
+		File	file;
+		int		nflush;
+
+		file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+
+		/* compute number of desired writes within the current segment */
+		nflush = Min(nblocks,
+					 1 + UNDOSEG_SIZE - (blocknum % UNDOSEG_SIZE));
+
+		FileWriteback(file,
+					  (blocknum % UNDOSEG_SIZE) * BLCKSZ,
+					  nflush * BLCKSZ, WAIT_EVENT_UNDO_FILE_FLUSH);
+
+		nblocks -= nflush;
+		blocknum += nflush;
+	}
+}
+
+BlockNumber
+undofile_nblocks(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_nblocks is not supported");
+	return 0;
+}
+
+void
+undofile_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
+{
+	elog(ERROR, "undofile_truncate is not supported");
+}
+
+void
+undofile_immedsync(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_immedsync is not supported");
+}
+
+void
+undofile_preckpt(void)
+{
+}
+
+void
+undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+	PendingOperationEntry *entry;
+	bool		found;
+
+	Assert(pendingOpsTable);
+
+	if (forknum == FORGET_UNDO_SEGMENT_FSYNC)
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_FIND,
+													  NULL);
+		if (entry)
+			entry->requests = bms_del_member(entry->requests, segno);
+	}
+	else
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_ENTER,
+													  &found);
+		if (!found)
+		{
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+			entry->requests = bms_make_singleton(segno);
+		}
+		else
+			entry->requests = bms_add_member(entry->requests, segno);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+void
+undofile_forgetsync(Oid logno, Oid tablespace, int segno)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = 9;
+	rnode.spcNode = tablespace;
+	rnode.relNode = logno;
+
+	if (pendingOpsTable)
+		undofile_requestsync(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno);
+	else if (IsUnderPostmaster)
+	{
+		while (!ForwardFsyncRequest(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno))
+			pg_usleep(10000L);
+	}
+}
+
+void
+undofile_sync(void)
+{
+	static bool undofile_sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingOperationEntry *entry;
+	int			absorb_counter;
+	int			segno;
+
+	if (!pendingOpsTable)
+		elog(ERROR, "cannot sync without a pendingOpsTable");
+
+	AbsorbFsyncRequests();
+
+	if (undofile_sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOpsTable);
+		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+	}
+
+	undofile_sync_cycle_ctr++;
+	undofile_sync_in_progress = true;
+
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOpsTable);
+	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		Bitmapset	   *requests;
+
+		/* Skip entries that arrived after we arrived. */
+		if (entry->cycle_ctr == undofile_sync_cycle_ctr)
+			continue;
+
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == undofile_sync_cycle_ctr);
+
+		if (!enableFsync)
+			continue;
+
+		requests = entry->requests;
+		entry->requests = NULL;
+
+		segno = -1;
+		while ((segno = bms_next_member(requests, segno)) >= 0)
+		{
+			File		file;
+
+			if (!enableFsync)
+				continue;
+
+			file = undofile_open_segment_file(entry->rnode.relNode,
+											  entry->rnode.spcNode,
+											  segno, true /* missing_ok */);
+
+			/*
+			 * The file may be gone due to concurrent discard.  We'll ignore
+			 * that, but only if we find a cancel request for this segment in
+			 * the queue.
+			 *
+			 * It's also possible that we succeed in opening a segment file
+			 * that is subsequently recycled (renamed to represent a new range
+			 * of undo log), in which case we'll fsync that later file
+			 * instead.  That is rare and harmless.
+			 */
+			if (file <= 0)
+			{
+				char		name[MAXPGPATH];
+
+				/*
+				 * Put the request back into the bitset in a way that can't
+				 * fail due to memory allocation.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				/*
+				 * Check if a forgetsync request has arrived to delete that
+				 * segment.
+				 */
+				AbsorbFsyncRequests();
+				if (bms_is_member(segno, entry->requests))
+				{
+					UndoLogSegmentPath(entry->rnode.relNode,
+									   segno,
+									   entry->rnode.spcNode,
+									   name);
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not fsync file \"%s\": %m", name)));
+				}
+				/* It must have been removed, so we can safely skip it. */
+				continue;
+			}
+
+			elog(LOG, "fsync()ing %s", FilePathName(file));	/* TODO: remove me */
+			if (FileSync(file, WAIT_EVENT_UNDO_FILE_SYNC) < 0)
+			{
+				char		name[MAXPGPATH];
+
+				strcpy(name, FilePathName(file));
+				FileClose(file);
+
+				/*
+				 * Keep the failed requests, but merge with any new ones.  The
+				 * requirement to be able to do this without risk of failure
+				 * prevents us from using a smaller bitmap that doesn't bother
+				 * tracking leading zeros.  Perhaps another data structure
+				 * would be better.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m", name)));
+			}
+			requests = bms_del_member(requests, segno);
+			FileClose(file);
+
+			if (--absorb_counter <= 0)
+			{
+				AbsorbFsyncRequests();
+				absorb_counter = FSYNCS_PER_ABSORB;
+			}
+		}
+
+		bms_free(requests);
+	}
+
+	undofile_sync_in_progress = true;
+}
+
+void undofile_postckpt(void)
+{
+}
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok)
+{
+	File		file;
+	char		path[MAXPGPATH];
+
+	UndoLogSegmentPath(relNode, segno, spcNode, path);
+	file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+
+	if (file <= 0 && (!missing_ok || errno != ENOENT))
+		elog(ERROR, "cannot open undo segment file '%s': %m", path);
+
+	return file;
+}
+
+/*
+ * Get a File for a particular segment of a SMgrRelation representing an undo
+ * log.
+ */
+static File undofile_get_segment_file(SMgrRelation reln, int segno)
+{
+	UndoFileState *state;
+
+
+	/*
+	 * Create private state space on demand.
+	 *
+	 * XXX There should probably be a smgr 'open' or 'init' interface that
+	 * would do this.  smgr.c currently initializes reln->md_XXX stuff
+	 * directly...
+	 */
+	state = (UndoFileState *) reln->private_data;
+	if (unlikely(state == NULL))
+	{
+		state = MemoryContextAllocZero(UndoFileCxt, sizeof(UndoFileState));
+		reln->private_data = state;
+	}
+
+	/* If we have a file open already, check if we need to close it. */
+	if (state->mru_file > 0 && state->mru_segno != segno)
+	{
+		/* These are not the blocks we're looking for. */
+		FileClose(state->mru_file);
+		state->mru_file = 0;
+	}
+
+	/* Check if we need to open a new file. */
+	if (state->mru_file <= 0)
+	{
+		state->mru_file =
+			undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+									   reln->smgr_rnode.node.spcNode,
+									   segno, InRecovery);
+		if (InRecovery && state->mru_file <= 0)
+		{
+			/*
+			 * If in recovery, we may be trying to access a file that will
+			 * later be unlinked.  Tolerate missing files, creating a new
+			 * zero-filled file as required.
+			 */
+			UndoLogNewSegment(reln->smgr_rnode.node.relNode,
+							  reln->smgr_rnode.node.spcNode,
+							  segno);
+			state->mru_file =
+				undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+										   reln->smgr_rnode.node.spcNode,
+										   segno, false);
+			Assert(state->mru_file > 0);
+		}
+		state->mru_segno = segno;
+	}
+
+	return state->mru_file;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f1c10d1..763379e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -624,6 +624,11 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter tuples_inserted;
 	PgStat_Counter tuples_updated;
 	PgStat_Counter tuples_deleted;
+
+	/*
+	 * Counter tuples_hot_updated stores number of hot updates for heap table
+	 * and the number of inplace updates for zheap table.
+	 */
 	PgStat_Counter tuples_hot_updated;
 
 	PgStat_Counter n_live_tuples;
@@ -743,6 +748,7 @@ typedef enum BackendState
 #define PG_WAIT_IPC					0x08000000U
 #define PG_WAIT_TIMEOUT				0x09000000U
 #define PG_WAIT_IO					0x0A000000U
+#define PG_WAIT_PAGE_TRANS_SLOT		0x0B000000U
 
 /* ----------
  * Wait Events - Activity
@@ -767,7 +773,7 @@ typedef enum
 	WAIT_EVENT_SYSLOGGER_MAIN,
 	WAIT_EVENT_WAL_RECEIVER_MAIN,
 	WAIT_EVENT_WAL_SENDER_MAIN,
-	WAIT_EVENT_WAL_WRITER_MAIN
+	WAIT_EVENT_WAL_WRITER_MAIN,
 } WaitEventActivity;
 
 /* ----------
@@ -913,6 +919,13 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_UNDO_CHECKPOINT_READ,
+	WAIT_EVENT_UNDO_CHECKPOINT_WRITE,
+	WAIT_EVENT_UNDO_CHECKPOINT_SYNC,
+	WAIT_EVENT_UNDO_FILE_READ,
+	WAIT_EVENT_UNDO_FILE_WRITE,
+	WAIT_EVENT_UNDO_FILE_FLUSH,
+	WAIT_EVENT_UNDO_FILE_SYNC,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
@@ -1317,6 +1330,7 @@ pgstat_report_wait_end(void)
 
 extern void pgstat_count_heap_insert(Relation rel, PgStat_Counter n);
 extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_zheap_update(Relation rel);
 extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce390..5b13556 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -38,8 +38,9 @@ typedef enum BufferAccessStrategyType
 typedef enum
 {
 	RBM_NORMAL,					/* Normal read */
-	RBM_ZERO_AND_LOCK,			/* Don't read from disk, caller will
-								 * initialize. Also locks the page. */
+	RBM_ZERO,					/* Don't read from disk, caller will
+								 * initialize. */
+	RBM_ZERO_AND_LOCK,			/* Like RBM_ZERO, but also locks the page. */
 	RBM_ZERO_AND_CLEANUP_LOCK,	/* Like RBM_ZERO_AND_LOCK, but locks the page
 								 * in "cleanup" mode */
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
@@ -171,7 +172,10 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 				   BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 						  ForkNumber forkNum, BlockNumber blockNum,
-						  ReadBufferMode mode, BufferAccessStrategy strategy);
+						  ReadBufferMode mode, BufferAccessStrategy strategy,
+						  char relpersistence);
+extern void ForgetBuffer(RelFileNode rnode, ForkNumber forkNum,
+			 BlockNumber blockNum);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -228,6 +232,10 @@ extern void AtProcExit_LocalBuffers(void);
 
 extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
 
+/* in localbuf.c */
+extern void ForgetLocalBuffer(RelFileNode rnode, ForkNumber forkNum,
+				  BlockNumber blockNum);
+
 /* in freelist.c */
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index c843bbc..65d164b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -71,6 +71,9 @@ typedef struct SMgrRelationData
 	int			md_num_open_segs[MAX_FORKNUM + 1];
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
+	/* For use by implementations. */
+	void	   *private_data;
+
 	/* if unowned, list link in list of all unowned SMgrRelations */
 	struct SMgrRelationData *next_unowned_reln;
 } SMgrRelationData;
@@ -105,6 +108,7 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
+extern void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
 extern void smgrsync(void);
@@ -133,14 +137,41 @@ extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
+extern void mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);
 
+/* in undofile.c */
+extern void undofile_init(void);
+extern void undofile_shutdown(void);
+extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
+							bool isRedo);
+extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
+							bool isRedo);
+extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber blocknum);
+extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum, char *buffer);
+extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber nblocks);
+extern void undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno);
+extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_preckpt(void);
+extern void undofile_sync(void);
+extern void undofile_postckpt(void);
+
 extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/undofile.h b/src/include/storage/undofile.h
new file mode 100644
index 0000000..7544be3
--- /dev/null
+++ b/src/include/storage/undofile.h
@@ -0,0 +1,50 @@
+/*
+ * undofile.h
+ *
+ * PostgreSQL undo file manager.  This module manages the files that back undo
+ * logs on the filesystem.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/undofile.h
+ */
+
+#ifndef UNDOFILE_H
+#define UNDOFILE_H
+
+#include "storage/smgr.h"
+
+/* Prototypes of functions exposed to SMgr. */
+extern void undofile_init(void);
+extern void undofile_shutdown(void);
+extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
+							bool isRedo);
+extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
+							bool isRedo);
+extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum, char *buffer,
+							bool skipFsync);
+extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
+extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum, char *buffer);
+extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, char *buffer,
+						   bool skipFsync);
+extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber nblocks);
+extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_pre_ckpt(void);
+extern void undofile_sync(void);
+extern void undofile_post_ckpt(void);
+
+/* Functions used by undolog.c. */
+extern void undofile_forgetsync(Oid logno, Oid tablespace, int segno);
+
+#endif
-- 
1.8.3.1

0003-undo-interface-v10.patchapplication/x-patch; name=0003-undo-interface-v10.patchDownload

From 5a51ebb0ea23ab19e81aa415c07483bd5c1c5801 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 3 Dec 2018 10:33:50 +0530
Subject: [PATCH] undo-interface-v9

Provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

Dilip Kumar with help from Rafia Sabih based on an early prototype
for forming undo record by Robert Haas and design inputs from Amit Kapila
Reviewed by Amit Kapila.
---
 src/backend/access/transam/xact.c    |   28 +
 src/backend/access/transam/xlog.c    |   30 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1193 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  451 +++++++++++++
 src/include/access/undoinsert.h      |  109 ++++
 src/include/access/undorecord.h      |  222 +++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 2037 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d967400..6060013 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/undoinsert.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -66,6 +67,7 @@
 #include "utils/timestamp.h"
 #include "pg_trace.h"
 
+#define	AtAbort_ResetUndoBuffers() ResetUndoBuffers()
 
 /*
  *	User-tweakable parameters
@@ -189,6 +191,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +918,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
@@ -2631,6 +2657,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtAbort_ResetUndoBuffers();
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4815,6 +4842,7 @@ AbortSubTransaction(void)
 		AtEOSubXact_PgStat(false, s->nestingLevel);
 		AtSubAbort_Snapshot(s->nestingLevel);
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
+		AtAbort_ResetUndoBuffers();
 	}
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1064ee0..01815a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8323,6 +8323,36 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/*
+	 * Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..7c7e4ff
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1193 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ *  Undo record are stored in sequential order in the undo log.  And, each
+ *  transaction's first undo record a.k.a. transaction header points to the next
+ *  transaction's start header.  Transaction headers are linked so that the
+ *  discard worker can read undo log transaction by transaction and avoid
+ *  reading each undo record.
+ *
+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.
+ *
+ *  We only connect to next log if the same transaction spread to next log
+ *  otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId prev_txid[UndoPersistenceLevels] = {0};
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber logno;		/* Undo log number */
+	BlockNumber blk;			/* block number */
+	Buffer		buf;			/* buffer allocated for the block */
+	bool		zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr	urp;			/* undo record pointer */
+	UnpackedUndoRecord *urec;	/* undo record */
+	int			undo_buffer_idx[MAX_BUFFER_PER_UNDO];	/* undo_buffer array
+														 * index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace def_prepared[MAX_PREPARED_UNDO];
+static int	prepare_idx;
+static int	max_prepared_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr prepared_urec_ptr = InvalidUndoRecPtr;
+static bool update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.  This
+ * is populated while current transaction is updating its undo record pointer
+ * in previous transactions first undo record.
+ */
+typedef struct XactUndoRecordInfo
+{
+	UndoRecPtr	urecptr;		/* txn's start urecptr */
+	int			idx_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;		/* undo record header */
+} XactUndoRecordInfo;
+
+static XactUndoRecordInfo xact_urec_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
+				 UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+						   bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int UndoGetBufferSlot(RelFileNode rnode, BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl * log,
+				  UndoRecPtr urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or
+		 * not so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that
+		 * the doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber cur_blk;
+	RelFileNode rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno, false);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno, false);
+	}
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		xact_urp = InvalidUndoRecPtr;
+	else
+		xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 * UndoRecordIsValid will release the lock if it returns false.
+	 */
+	if (!UndoRecordIsValid(log, xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(xact_urp);
+
+	/*
+	 * Read undo record header in by calling UnpackUndoRecord, if the undo
+	 * record header is split across buffers then we need to read the complete
+	 * header by invoking UnpackUndoRecord multiple times.
+	 */
+	while (true)
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk,
+								   RBM_NORMAL,
+								   log->meta.persistence);
+		xact_urec_info.idx_undo_buffers[index++] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+
+		if (UnpackUndoRecord(&xact_urec_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		/* Could not fetch the complete header so go to the next block. */
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	xact_urec_info.uur.uur_next = urecptr;
+	xact_urec_info.urecptr = xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(xact_urec_info.urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	urec_ptr = xact_urec_info.urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, urec_ptr))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(urec_ptr);
+
+	do
+	{
+		Buffer		buffer;
+		int			buf_idx;
+
+		buf_idx = xact_urec_info.idx_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&xact_urec_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while (true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+UndoGetBufferSlot(RelFileNode rnode,
+				  BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence)
+{
+	int			i;
+	Buffer		buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g when
+		 * previous transaction start header is in previous undo log) so
+		 * compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+																	   GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId txid, UndoPersistence upersistence)
+{
+	UnpackedUndoRecord *urec = NULL;
+	UndoLogControl *log;
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	bool		need_xact_hdr = false;
+	bool		log_switched = false;
+	int			i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/* Is this the first undo record of the transaction? */
+	if ((InRecovery && IsTransactionFirstRec(txid)) ||
+		(!InRecovery && prev_txid[upersistence] != txid))
+		need_xact_hdr = true;
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		/*
+		 * Prepare the transacion header for the first undo record of
+		 * transaction. XXX there is also an option that instead of adding the
+		 * information to this record we can prepare a new record which only
+		 * contain transaction informations.
+		 */
+		if (need_xact_hdr && i == 0)
+		{
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			/* During recovery, get the database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+
+			/* Set uur_info to include the transaction header. */
+			urec->uur_info |= UREC_INFO_TRANSACTION;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables with invalid values as
+			 * these are used only with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+		}
+
+		/* Calculate the size of the undo record based on the info required. */
+		UndoRecordSetInfo(urec);
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+	/*
+	 * By now, we must be attached to some undo log unless we are in recovery.
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * We can consider the log as switched if this is the first record of the
+	 * log and not the first record of the transaction i.e. same transaction
+	 * continued from the previous log.
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, then do it now.
+	 */
+	if (!need_xact_hdr &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_xact_hdr = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+		goto resize;
+	}
+
+	/* Update the previous transaction's start undo record, if required. */
+	if (need_xact_hdr || log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many undo records can be
+ * prepared before we can insert them.  If the size is greater than
+ * MAX_PREPARED_UNDO then it will allocate extra memory to hold the extra
+ * prepared undo.
+ *
+ * This is normally used when more than one undo record needs to be prepared.
+ */
+void
+UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	prepared_urec_ptr = UndoRecordAllocate(undorecords, nrecords, txid,
+										   upersistence);
+	if (nrecords <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(nrecords * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's starting
+	 * undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((nrecords + 1) * MAX_BUFFER_PER_UNDO *
+						  sizeof(UndoBuffers));
+	max_prepared_undo = nrecords;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ *
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, TransactionId xid,
+				  UndoPersistence upersistence)
+{
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	RelFileNode rnode;
+	UndoRecordSize cur_size = 0;
+	BlockNumber cur_blk;
+	TransactionId txid;
+	int			starting_byte;
+	int			index = 0;
+	int			bufidx;
+	ReadBufferMode rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepared_undo)
+		elog(ERROR, "already reached the maximum prepared limit");
+
+
+	if (xid == InvalidTransactionId)
+	{
+		/* During recovery, we must have a valid transaction id. */
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping
+		 * for the top most transactions.
+		 */
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(prepared_urec_ptr))
+		urecptr = UndoRecordAllocate(urec, 1, txid, upersistence);
+	else
+		urecptr = prepared_urec_ptr;
+
+	/* advance the prepared ptr location for next record. */
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(prepared_urec_ptr))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(prepared_urec_ptr);
+
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		prepared_urec_ptr = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned and locked. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+		cur_blk++;
+	} while (cur_size < size);
+
+	/*
+	 * Save the undo record information to be later used by InsertPreparedUndo
+	 * to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  This step should be performed after entering a
+ * criticalsection; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page		page;
+	int			starting_byte;
+	int			already_written;
+	int			bufidx = 0;
+	int			idx;
+	uint16		undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord *uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+
+	/* There must be atleast one prepared undo record. */
+	Assert(prepare_idx > 0);
+
+	/*
+	 * This must be called under a critical section or we must be in recovery.
+	 */
+	Assert(InRecovery || CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+		/*
+		 * Store the previous undo record length in the header.  We can read
+		 * meta.prevlen without locking, because only we can write to it.
+		 */
+		uur->uur_prevlen = log->meta.prevlen;
+
+		/*
+		 * If starting a new log then there is no prevlen to store except when
+		 * the same transaction is continuing from the previous undo log read
+		 * detailed comment atop this file.
+		 */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/*
+		 * if starting from a new page then consider block header size in
+		 * prevlen calculation.
+		 */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+			uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer		buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.  We start writting immediately after the block header.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			starting_byte = UndoLogBlockHeaderSize;
+			undo_len += UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while (true);
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), undo_len);
+
+		/*
+		 * Link the transactions in the same log so that we can discard all
+		 * the transaction's undo log in one-shot.
+		 */
+		if (UndoRecPtrIsValid(xact_urec_info.urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ * Reset the global variables related to undo buffers. This is required at the
+ * transaction abort and while releasing the undo buffers.
+ */
+void
+ResetUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+	{
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	xact_urec_info.urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	prepared_urec_ptr = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepared_undo limit is changed so free the allocated memory and
+	 * reset all the variable back to their default value.
+	 */
+	if (max_prepared_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepared_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Unlock and release the undo buffers.  This step must be performed after
+ * exiting any critical section where we have perfomed undo actions.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+
+	ResetUndoBuffers();
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer, so caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if it wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord *
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer		buffer = urec->uur_buffer;
+	Page		page;
+	int			starting_byte = UndoRecPtrGetPageOffset(urp);
+	int			already_decoded = 0;
+	BlockNumber cur_blk;
+	bool		is_undo_rec_split = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a buffer pin then no need to allocate a new one. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * XXX This can be optimized to just fetch header first and only if
+		 * matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_rec_split = true;
+
+		/*
+		 * The record spans more than a page so we would have copied it (see
+		 * UnpackUndoRecord).  In such cases, we can release the buffer.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer, otherwise, just
+	 * unlock it.
+	 */
+	if (is_undo_rec_split)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * ResetUndoRecord - Helper function for UndoFetchRecord to reset the current
+ * record.
+ */
+static void
+ResetUndoRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode *rnode,
+				RelFileNode *prevrec_rnode)
+{
+	/*
+	 * If we have a valid buffer pinned then just ensure that we want to find
+	 * the next tuple from the same block.  Otherwise release the buffer and
+	 * set it invalid
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		/*
+		 * Undo buffer will be changed if the next undo record belongs to a
+		 * different block or undo log.
+		 */
+		if ((UndoRecPtrGetBlockNum(urp) !=
+			 BufferGetBlockNumber(urec->uur_buffer)) ||
+			(prevrec_rnode->relNode != rnode->relNode))
+		{
+			ReleaseBuffer(urec->uur_buffer);
+			urec->uur_buffer = InvalidBuffer;
+		}
+	}
+	else
+	{
+		/*
+		 * If there is not a valid buffer in urec->uur_buffer that means we
+		 * had copied the payload data and tuple data so free them.
+		 */
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	/* Reset the urec before fetching the tuple */
+	urec->uur_tuple.data = NULL;
+	urec->uur_tuple.len = 0;
+	urec->uur_payload.data = NULL;
+	urec->uur_payload.len = 0;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  The same tuple can be modified by multiple transactions, so during
+ * undo chain traversal sometimes we need to distinguish based on transaction
+ * id.  Callers that don't have any such requirement can pass
+ * InvalidTransactionId.
+ *
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ *
+ * callback function decides whether particular undo record satisfies the
+ * condition of caller.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode rnode,
+				prevrec_rnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int			logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+	UndoRecPtrAssignRelFileNode(rnode, urp);
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecordIsValid(log, urp))
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+		prevrec_rnode = rnode;
+
+		/* Get rnode for the current undo record pointer. */
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/* Reset the current undorecord before fetching the next. */
+		ResetUndoRecord(urec, urp, &rnode, &prevrec_rnode);
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl *prevlog,
+				   *log;
+
+		log = UndoLogGet(logno, false);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno, true);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr(logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree(urec);
+}
+
+/*
+ * Called whenever we attach to a new undo log, so that we forget about our
+ * translation-unit private state relating to the log we were last attached
+ * to.
+ */
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+	prev_txid[persistence] = InvalidTransactionId;
+}
+
+/*
+ * RegisterUndoLogBuffers - Register the undo buffers.
+ */
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int			idx;
+	int			flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+/*
+ * UndoLogBuffersSetLSN - Set LSN on undo page.
+*/
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int			idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..73076dc
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,451 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size		size;
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin
+ * writing, while *already_written is the number of bytes written to
+ * previous pages.  Returns true if the remainder of the record was
+ * written and false if more bytes remain to be written; in either
+ * case, *already_written is set to the number of bytes written thus
+ * far.
+ *
+ * This function assumes that if *already_written is non-zero on entry,
+ * the same UnpackedUndoRecord is passed each time.  It also assumes
+ * that UnpackUndoRecord is not called between successive calls to
+ * InsertUndoRecord for the same UnpackedUndoRecord.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char	   *writeptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_written = *already_written;
+
+	/* The undo record must contain a valid information. */
+	Assert(uur->uur_info != 0);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption that
+	 * it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_reloid = uur->uur_reloid;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before, or
+		 * caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_reloid == uur->uur_reloid);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int			can_write;
+	int			remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing to do
+	 * except update *my_bytes_written, which we must do to ensure that the
+	 * next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool
+UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+				 int *already_decoded, bool header_only)
+{
+	char	   *readptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_decoded = *already_decoded;
+	bool		is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_reloid = work_hdr.urec_reloid;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any
+		 * of the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int			can_read;
+	int			remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..fe4a97e
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,109 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid);
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+				  UndoPersistence);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+
+/*
+ * Unlock and release undo buffers.  This step performed after exiting any
+ * critical section where we have prepared the undo record.
+ */
+extern void UnlockReleaseUndoBuffers(void);
+
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);
+
+/*
+ * Fetch the next undo record for given blkno and offset.  Start the search
+ * from urp.  Caller need to call UndoRecordRelease to release the resources
+ * allocated by this function.
+ */
+extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp,
+				BlockNumber blkno,
+				OffsetNumber offset,
+				TransactionId xid,
+				UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback);
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+
+/*
+ * Set the value of PrevUndoLen.
+ */
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many maximum prepared can
+ * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
+ * then it will allocate extra memory to hold the extra prepared undo.
+ */
+extern void UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence);
+
+/*
+ * return the previous undo record pointer.
+ */
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+
+extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
+
+/* Reset globals related to undo buffers */
+extern void ResetUndoBuffers(void);
+
+#endif							/* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..9ca2455
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,222 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_reloid;	/* relation OID */
+
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older then RecentGlobalXmin, then we can consider the tuple
+	 * in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;		/* Transaction id */
+	CommandId	urec_cid;		/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+#define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	ForkNumber	urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.  This
+ * also stores the dbid and the progress of the undo apply during rollback.
+ */
+typedef struct UndoRecordTransaction
+{
+	uint32		urec_progress;	/* undo applying progress. */
+	uint32		urec_xidepoch;	/* epoch of the current transaction */
+	Oid			urec_dbid;		/* database id */
+	uint64		urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+#define urec_next_pos \
+	(SizeOfUndoRecordTransaction - SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;	/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordExpectedSize or InsertUndoRecord.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_reloid;		/* relation OID */
+	TransactionId uur_prevxid;	/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id */
+
+	/*
+	 * undo action apply progress 0 = not started, 1 = completed. In future it
+	 * can also be used to show the progress of how much undo has been applied
+	 * so far with some formulae but currently only 0 and 1 is used.
+	 */
+	uint32		uur_progress;
+	StringInfoData uur_payload; /* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.  For the first call, the given page should be the one which
+ * the caller has determined to contain the current insertion point,
+ * starting_byte should be the byte offset within that page which corresponds
+ * to the current insertion point, and *already_written should be 0.  The
+ * return value will be true if the entire record is successfully written
+ * into that page, and false if not.  In either case, *already_written will
+ * be updated to the number of bytes written by all InsertUndoRecord calls
+ * for this record to date.  If this function is called again to continue
+ * writing the record, the previous value for *already_written should be
+ * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
+ * (since the record will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif							/* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 689c57c..73394c5 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f3a7ba4..d4e742f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -310,6 +310,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

#26

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#25)

1 attachment(s)

Re: Undo logs

On Tue, Dec 4, 2018 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Dec 1, 2018 at 12:58 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
13.
PrepareUndoInsert()
{
..
if (!UndoRecPtrIsValid(prepared_urec_ptr))
+ urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
+ else
+ urecptr = prepared_urec_ptr;
+
+ size = UndoRecordExpectedSize(urec);
..
I think we should make above code a bit more bulletproof. As it is
written, there is no guarantee the size we have allocated is same as
we are using in this function.
I agree
How about if we take 'size' as output

parameter from UndoRecordAllocate and then use it in this function?
Additionally, we can have an Assert that the size returned by
UndoRecordAllocate is same as UndoRecordExpectedSize.

With this change we will be able to guarantee when we are allocating
single undo record
but multi prepare will still be a problem. I haven't fix this as of
now. I will think on how
to handle both the cases when we have to prepare one time or when we
have to allocate
once and prepare multiple time.

Yeah, this appears tricky. I think we can leave it as it is unless we
get some better idea.

1.
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+ prev_txid[persistence] = InvalidTransactionId;
+}

Are we using this API in this or any other patch or in zheap? I see
this can be useful API, but unless we have a plan to use it, I don't
think we should retain it.

2.
+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread across
+ *  multiple undo log.  And, we need some special handling while inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction header for
+ *  the first record in the new log and update the transaction header with this
+ *  new log's location. This will allow us to connect transactions across logs
+ *  when the same transaction span across log (for this we keep track of the
+ *  previous logno in undo log meta) which is required to find the latest undo
+ *  record pointer of the aborted transaction for executing the undo actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that
will be enough
+ *  to avoid executing same actions.  However, there is possibility that after
+ *  executing the undo actions the undo pointer got discarded, now in later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header chain.
+ *  To avoid this situation we first check if the next_urec of the transaction
+ *  is already discarded then no need to access that and start executing from
+ *  the last undo record in the current log.

I think I see the problem in the discard mechanism when the log is
spread across multiple logs. Basically, if the second log contains
undo of some transaction prior to the transaction which has just
decided to spread it's undo in the chosen undo log, then we might
discard the undo log of some transaction(s) inadvertently. Am, I
missing something? If not, then I guess we need to ensure that we
don't immediately discard the undo in the second log when a single
transactions undo is spreaded across two logs

Before choosing a new undo log to span the undo for a transaction, do
we ensure that it is not already linked with some other undo log for a
similar reason?

One more thing in this regard:
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+    TransactionId txid, UndoPersistence upersistence)
{
..
..
+ if (InRecovery)
+ urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+ else
+ urecptr = UndoLogAllocate(size, upersistence);
+
+ log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+ /*
+ * By now, we must be attached to some undo log unless we are in recovery.
+ */
+ Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+ /*
+ * We can consider the log as switched if this is the first record of the
+ * log and not the first record of the transaction i.e. same transaction
+ * continued from the previous log.
+ */
+ if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+ log->meta.prevlogno != InvalidUndoLogNumber)
+ log_switched = true;
..
..
}

Isn't there a hidden assumption in the above code that you will always
get a fresh undo log if the undo doesn't fit in the currently attached
log? What is the guarantee of same?

3.
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+   UndoPersistence);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);

This text is duplicate of what is mentioned in .c file, so I have
removed it in delta patch. Similarly, I have removed duplicate text
atop other functions exposed via undorecord.h

4.
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);

This API is nowhere defined or used. What is the intention?

5.
+typedef struct UndoRecordHeader
+{
..
+ /*
+ * Transaction id that has modified the tuple present in this undo record.
+ * If this is older then RecentGlobalXmin, then we can consider the tuple
+ * in this undo record as visible.
+ */
+ TransactionId urec_prevxid;
..

/then/than

I think we need to mention oldestXidWithEpochHavingUndo instead of
RecentGlobalXmin.

6.
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS 0x01

I have expanded this comments in the attached delta patch. I think we
should remove the define UREC_INFO_PAYLOAD_CONTAINS_SLOT from the
patch as this is zheap specific and should be added later along with
the zheap code.

7.
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+ ForkNumber urec_fork; /* fork number */
+} UndoRecordRelationDetails;

This comment seems to be out-dated, so modified in the attached delta patch.

8.
+typedef struct UndoRecordTransaction
+{
+ uint32 urec_progress; /* undo applying progress. */
+ uint32 urec_xidepoch; /* epoch of the current transaction */

Can you expand comments about how the progress is defined and used?
Also, write a few sentences about why epoch is captured and or used?

9.
+#define urec_next_pos \
+ (SizeOfUndoRecordTransaction - SizeOfUrecNext)

What is its purpose?

10.
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);

Again, I see duplicate text in .h and .c files, so removed this and
similar comments from .h files. I have tried to move some part of
comments from .h to .c file, so that it is easier to read from one
place rather than referring at two places. See, if I have missed
anything.

Apart from the above, I have done few other cosmetic changes in the
attached delta patch, see, if you like those, kindly include it in
the main patch.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-undo-interface-v10-delta-amit.patchapplication/octet-stream; name=0003-undo-interface-v10-delta-amit.patchDownload

diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
index 7c7e4ff1a1..3d57109e04 100644
--- a/src/backend/access/undo/undoinsert.c
+++ b/src/backend/access/undo/undoinsert.c
@@ -11,38 +11,41 @@
  * NOTES:
  * Undo record layout:
  *
- *  Undo record are stored in sequential order in the undo log.  And, each
- *  transaction's first undo record a.k.a. transaction header points to the next
- *  transaction's start header.  Transaction headers are linked so that the
- *  discard worker can read undo log transaction by transaction and avoid
- *  reading each undo record.
+ * Undo records are stored in sequential order in the undo log.  Each undo
+ * record consists of a variable length header, tuple data, and payload
+ * information.  The first undo record of each transaction contains a
+ * transaction header that points to the next transaction's start header.
+ * This allows us to discard the entire transaction's log at one-shot rather
+ * than record-by-record.  The callers are not aware of transaction header,
+ * this is entirely maintained and used by undo record layer.   See
+ * undorecord.h for detailed information about undo record header.
  *
  * Handling multi log:
  *
- *  It is possible that the undo record of a transaction can be spread across
- *  multiple undo log.  And, we need some special handling while inserting the
- *  undo for discard and rollback to work sanely.
+ * It is possible that the undo record of a transaction can be spread across
+ * multiple undo log.  And, we need some special handling while inserting the
+ * undo for discard and rollback to work sanely.
  *
- *  If the undorecord goes to next log then we insert a transaction header for
- *  the first record in the new log and update the transaction header with this
- *  new log's location. This will allow us to connect transactions across logs
- *  when the same transaction span across log (for this we keep track of the
- *  previous logno in undo log meta) which is required to find the latest undo
- *  record pointer of the aborted transaction for executing the undo actions
- *  before discard. If the next log get processed first in that case we
- *  don't need to trace back the actual start pointer of the transaction,
- *  in such case we can only execute the undo actions from the current log
- *  because the undo pointer in the slot will be rewound and that will be enough
- *  to avoid executing same actions.  However, there is possibility that after
- *  executing the undo actions the undo pointer got discarded, now in later
- *  stage while processing the previous log it might try to fetch the undo
- *  record in the discarded log while chasing the transaction header chain.
- *  To avoid this situation we first check if the next_urec of the transaction
- *  is already discarded then no need to access that and start executing from
- *  the last undo record in the current log.
+ * If the undorecord goes to next log then we insert a transaction header for
+ * the first record in the new log and update the transaction header with this
+ * new log's location. This will allow us to connect transactions across logs
+ * when the same transaction span across log (for this we keep track of the
+ * previous logno in undo log meta) which is required to find the latest undo
+ * record pointer of the aborted transaction for executing the undo actions
+ * before discard. If the next log get processed first in that case we
+ * don't need to trace back the actual start pointer of the transaction,
+ * in such case we can only execute the undo actions from the current log
+ * because the undo pointer in the slot will be rewound and that will be enough
+ * to avoid executing same actions.  However, there is possibility that after
+ * executing the undo actions the undo pointer got discarded, now in later
+ * stage while processing the previous log it might try to fetch the undo
+ * record in the discarded log while chasing the transaction header chain.
+ * To avoid this situation we first check if the next_urec of the transaction
+ * is already discarded then no need to access that and start executing from
+ * the last undo record in the current log.
  *
- *  We only connect to next log if the same transaction spread to next log
- *  otherwise don't.
+ * We only connect to next log if the same transaction spread to next log
+ * otherwise don't.
  *-------------------------------------------------------------------------
  */
 
@@ -464,9 +467,12 @@ resize:
 
 		/*
 		 * Prepare the transacion header for the first undo record of
-		 * transaction. XXX there is also an option that instead of adding the
+		 * transaction.
+		 *
+		 * XXX There is also an option that instead of adding the
 		 * information to this record we can prepare a new record which only
-		 * contain transaction informations.
+		 * contain transaction informations, but we can't see any clear
+		 * advantage of the same.
 		 */
 		if (need_xact_hdr && i == 0)
 		{
@@ -837,57 +843,6 @@ InsertPreparedUndo(void)
 	}
 }
 
-/*
- * Reset the global variables related to undo buffers. This is required at the
- * transaction abort and while releasing the undo buffers.
- */
-void
-ResetUndoBuffers(void)
-{
-	int			i;
-
-	for (i = 0; i < buffer_idx; i++)
-	{
-		undo_buffer[i].blk = InvalidBlockNumber;
-		undo_buffer[i].buf = InvalidBuffer;
-	}
-
-	xact_urec_info.urecptr = InvalidUndoRecPtr;
-
-	/* Reset the prepared index. */
-	prepare_idx = 0;
-	buffer_idx = 0;
-	prepared_urec_ptr = InvalidUndoRecPtr;
-
-	/*
-	 * max_prepared_undo limit is changed so free the allocated memory and
-	 * reset all the variable back to their default value.
-	 */
-	if (max_prepared_undo > MAX_PREPARED_UNDO)
-	{
-		pfree(undo_buffer);
-		pfree(prepared_undo);
-		undo_buffer = def_buffers;
-		prepared_undo = def_prepared;
-		max_prepared_undo = MAX_PREPARED_UNDO;
-	}
-}
-
-/*
- * Unlock and release the undo buffers.  This step must be performed after
- * exiting any critical section where we have perfomed undo actions.
- */
-void
-UnlockReleaseUndoBuffers(void)
-{
-	int			i;
-
-	for (i = 0; i < buffer_idx; i++)
-		UnlockReleaseBuffer(undo_buffer[i].buf);
-
-	ResetUndoBuffers();
-}
-
 /*
  * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
  * by urp and unpack the record into urec.  This function will not release the
@@ -1022,6 +977,10 @@ ResetUndoRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode *rnode,
  *
  * callback function decides whether particular undo record satisfies the
  * condition of caller.
+ *
+ * Returns the required undo record if found, otherwise, return NULL which
+ * means either the record is already discarded or there is no such record
+ * in the undo chain.
  */
 UnpackedUndoRecord *
 UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
@@ -1191,3 +1150,54 @@ UndoLogBuffersSetLSN(XLogRecPtr recptr)
 	for (idx = 0; idx < buffer_idx; idx++)
 		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
 }
+
+/*
+ * Reset the global variables related to undo buffers. This is required at the
+ * transaction abort and while releasing the undo buffers.
+ */
+void
+ResetUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+	{
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	xact_urec_info.urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	prepared_urec_ptr = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepared_undo limit is changed so free the allocated memory and
+	 * reset all the variable back to their default value.
+	 */
+	if (max_prepared_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepared_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Unlock and release the undo buffers.  This step must be performed after
+ * exiting any critical section where we have perfomed undo actions.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+
+	ResetUndoBuffers();
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
index 73076dc5f4..b44c41d7f5 100644
--- a/src/backend/access/undo/undorecord.c
+++ b/src/backend/access/undo/undorecord.c
@@ -58,18 +58,28 @@ UndoRecordExpectedSize(UnpackedUndoRecord *uur)
 }
 
 /*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.
+ *
  * Insert as much of an undo record as will fit in the given page.
- * starting_byte is the byte within the give page at which to begin
- * writing, while *already_written is the number of bytes written to
- * previous pages.  Returns true if the remainder of the record was
- * written and false if more bytes remain to be written; in either
- * case, *already_written is set to the number of bytes written thus
- * far.
+ * starting_byte is the byte within the give page at which to begin writing,
+ * while *already_written is the number of bytes written to previous pages.
+ *
+ * Returns true if the remainder of the record was written and false if more
+ * bytes remain to be written; in either case, *already_written is set to the
+ * number of bytes written thus far.
+ *
+ * This function assumes that if *already_written is non-zero on entry, the
+ * same UnpackedUndoRecord is passed each time.  It also assumes that
+ * UnpackUndoRecord is not called between successive calls to InsertUndoRecord
+ * for the same UnpackedUndoRecord.
+ *
+ * If this function is called again to continue writing the record, the
+ * previous value for *already_written should be passed again, and
+ * starting_byte should be passed as sizeof(PageHeaderData) (since the record
+ * will continue immediately following the page header).
  *
- * This function assumes that if *already_written is non-zero on entry,
- * the same UnpackedUndoRecord is passed each time.  It also assumes
- * that UnpackUndoRecord is not called between successive calls to
- * InsertUndoRecord for the same UnpackedUndoRecord.
+ * This function sets uur->uur_info as a side effect.
  */
 bool
 InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
index fe4a97ec2e..93b6410368 100644
--- a/src/include/access/undoinsert.h
+++ b/src/include/access/undoinsert.h
@@ -28,35 +28,12 @@ typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec,
 										   OffsetNumber offset,
 										   TransactionId xid);
 
-/*
- * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
- * intended to insert.  Upon return, the necessary undo buffers are pinned and
- * locked.
- * This should be done before any critical section is established, since it
- * can fail.
- *
- * If not in recovery, 'xid' should refer to the top transaction id because
- * undo log only stores mapping for the top most transactions.
- * If in recovery, 'xid' refers to the transaction id stored in WAL.
- */
 extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
 				  UndoPersistence);
-
-/*
- * Insert a previously-prepared undo record.  This will write the actual undo
- * record into the buffers already pinned and locked in PreparedUndoInsert,
- * and mark them dirty.  For persistent undo, this step should be performed
- * after entering a critical section; it should never fail.
- */
 extern void InsertPreparedUndo(void);
 
 extern void RegisterUndoLogBuffers(uint8 first_block_id);
 extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
-
-/*
- * Unlock and release undo buffers.  This step performed after exiting any
- * critical section where we have prepared the undo record.
- */
 extern void UnlockReleaseUndoBuffers(void);
 
 /*
@@ -65,45 +42,16 @@ extern void UnlockReleaseUndoBuffers(void);
  * inserting undo after having prepared a record for insertion.
  */
 extern void CancelPreparedUndo(void);
-
-/*
- * Fetch the next undo record for given blkno and offset.  Start the search
- * from urp.  Caller need to call UndoRecordRelease to release the resources
- * allocated by this function.
- */
 extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp,
-				BlockNumber blkno,
-				OffsetNumber offset,
-				TransactionId xid,
-				UndoRecPtr *urec_ptr_out,
+				BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
 				SatisfyUndoRecordCallback callback);
-
-/*
- * Release the resources allocated by UndoFetchRecord.
- */
 extern void UndoRecordRelease(UnpackedUndoRecord *urec);
-
-/*
- * Set the value of PrevUndoLen.
- */
 extern void UndoRecordSetPrevUndoLen(uint16 len);
-
-/*
- * Call UndoSetPrepareSize to set the value of how many maximum prepared can
- * be done before inserting the prepared undo.  If size is > MAX_PREPARED_UNDO
- * then it will allocate extra memory to hold the extra prepared undo.
- */
 extern void UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
 				   TransactionId xid, UndoPersistence upersistence);
-
-/*
- * return the previous undo record pointer.
- */
 extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
-
 extern void UndoRecordOnUndoLogChange(UndoPersistence persistence);
-
-/* Reset globals related to undo buffers */
 extern void ResetUndoBuffers(void);
 
 #endif							/* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
index 9ca245509c..04de50e112 100644
--- a/src/include/access/undorecord.h
+++ b/src/include/access/undorecord.h
@@ -37,8 +37,8 @@ typedef struct UndoRecordHeader
 
 	/*
 	 * Transaction id that has modified the tuple present in this undo record.
-	 * If this is older then RecentGlobalXmin, then we can consider the tuple
-	 * in this undo record as visible.
+	 * If this is older then oldestXidWithEpochHavingUndo, then we can consider
+	 * the tuple in this undo record as visible.
 	 */
 	TransactionId urec_prevxid;
 
@@ -60,8 +60,16 @@ typedef struct UndoRecordHeader
  *
  * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
  *
+ * If UREC_INFO_TRANSACTION is set, an UndoRecordTransaction structure
+ * follows.
+ *
  * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
  *
+ * If UREC_INFO_PAYLOAD_CONTAINS_SLOT is set, payload contains an additional
+ * information about transaction slot.  This is specific to zheap, we might
+ * want to invent a special way to encode different information in payload
+ * structure, but for now this seems easiest.
+ *
  * When (as will often be the case) multiple structures are present, they
  * appear in the same order in which the constants are defined here.  That is,
  * UndoRecordRelationDetails appears first.
@@ -71,10 +79,10 @@ typedef struct UndoRecordHeader
 #define UREC_INFO_PAYLOAD					0x04
 #define UREC_INFO_TRANSACTION				0x08
 #define UREC_INFO_PAYLOAD_CONTAINS_SLOT		0x10
+
 /*
  * Additional information about a relation to which this record pertains,
- * namely the tablespace OID and fork number.  If the tablespace OID is
- * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this structure
+ * namely the fork number.  If the fork number is MAIN_FORKNUM, this structure
  * can (and should) be omitted.
  */
 typedef struct UndoRecordRelationDetails
@@ -141,7 +149,8 @@ typedef struct UndoRecordPayload
  *
  * When creating an undo record from an UnpackedUndoRecord, caller should
  * set uur_info to 0.  It will be initialized by the first call to
- * UndoRecordExpectedSize or InsertUndoRecord.
+ * UndoRecordSetInfo or InsertUndoRecord.  We do set it in
+ * UndoRecordAllocate for transaction specific header information.
  *
  * When an undo record is decoded into an UnpackedUndoRecord, all fields
  * will be initialized, but those for which no information is available
@@ -166,56 +175,20 @@ typedef struct UnpackedUndoRecord
 	Oid			uur_dbid;		/* database id */
 
 	/*
-	 * undo action apply progress 0 = not started, 1 = completed. In future it
-	 * can also be used to show the progress of how much undo has been applied
-	 * so far with some formulae but currently only 0 and 1 is used.
+	 * This indicates undo action apply progress, 0 means not started, 1 means
+	 * completed.  In future, it can also be used to show the progress of how
+	 * much undo has been applied so far with some formula.
 	 */
 	uint32		uur_progress;
 	StringInfoData uur_payload; /* payload bytes */
 	StringInfoData uur_tuple;	/* tuple bytes */
 } UnpackedUndoRecord;
 
-/*
- * Set uur_info for an UnpackedUndoRecord appropriately based on which
- * other fields are set.
- */
-extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
 
-/*
- * Compute the number of bytes of storage that will be required to insert
- * an undo record.  Sets uur->uur_info as a side effect.
- */
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
 extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
-
-/*
- * To insert an undo record, call InsertUndoRecord() repeatedly until it
- * returns true.  For the first call, the given page should be the one which
- * the caller has determined to contain the current insertion point,
- * starting_byte should be the byte offset within that page which corresponds
- * to the current insertion point, and *already_written should be 0.  The
- * return value will be true if the entire record is successfully written
- * into that page, and false if not.  In either case, *already_written will
- * be updated to the number of bytes written by all InsertUndoRecord calls
- * for this record to date.  If this function is called again to continue
- * writing the record, the previous value for *already_written should be
- * passed again, and starting_byte should be passed as sizeof(PageHeaderData)
- * (since the record will continue immediately following the page header).
- *
- * This function sets uur->uur_info as a side effect.
- */
 extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
 				 int starting_byte, int *already_written, bool header_only);
-
-/*
- * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
- * the first call, starting_byte should be set to the beginning of the undo
- * record within the specified page, and *already_decoded should be set to 0;
- * the function will update it based on the number of bytes decoded.  The
- * return value is true if the entire record was unpacked and false if the
- * record continues on the next page.  In the latter case, the function
- * should be called again with the next page, passing starting_byte as the
- * sizeof(PageHeaderData).
- */
 extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
 				 int starting_byte, int *already_decoded, bool header_only);

#27

Dilip Kumar

dilip.kumar@enterprisedb.com

about 7 years ago

In reply to: Amit Kapila (#26)

1 attachment(s)

Re: Undo logs

On Sat, Dec 8, 2018 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Dec 4, 2018 at 3:00 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Sat, Dec 1, 2018 at 12:58 PM Amit Kapila <amit.kapila16@gmail.com>

wrote:
13.
PrepareUndoInsert()
{
..
if (!UndoRecPtrIsValid(prepared_urec_ptr))
+ urecptr = UndoRecordAllocate(urec, 1, upersistence, txid);
+ else
+ urecptr = prepared_urec_ptr;
+
+ size = UndoRecordExpectedSize(urec);
..
I think we should make above code a bit more bulletproof. As it is
written, there is no guarantee the size we have allocated is same as
we are using in this function.
I agree
How about if we take 'size' as output

parameter from UndoRecordAllocate and then use it in this function?
Additionally, we can have an Assert that the size returned by
UndoRecordAllocate is same as UndoRecordExpectedSize.

With this change we will be able to guarantee when we are allocating
single undo record
but multi prepare will still be a problem. I haven't fix this as of
now. I will think on how
to handle both the cases when we have to prepare one time or when we
have to allocate
once and prepare multiple time.
Yeah, this appears tricky. I think we can leave it as it is unless we
get some better idea.
1.
+void
+UndoRecordOnUndoLogChange(UndoPersistence persistence)
+{
+ prev_txid[persistence] = InvalidTransactionId;
+}
Are we using this API in this or any other patch or in zheap? I see
this can be useful API, but unless we have a plan to use it, I don't
think we should retain it.

Currently, we are not using it so removed.

+ * Handling multi log:
+ *
+ *  It is possible that the undo record of a transaction can be spread
across
+ *  multiple undo log.  And, we need some special handling while
inserting the
+ *  undo for discard and rollback to work sanely.
+ *
+ *  If the undorecord goes to next log then we insert a transaction
header for
+ *  the first record in the new log and update the transaction header
with this
+ *  new log's location. This will allow us to connect transactions across
logs
+ *  when the same transaction span across log (for this we keep track of
the
+ *  previous logno in undo log meta) which is required to find the latest
undo
+ *  record pointer of the aborted transaction for executing the undo
actions
+ *  before discard. If the next log get processed first in that case we
+ *  don't need to trace back the actual start pointer of the transaction,
+ *  in such case we can only execute the undo actions from the current log
+ *  because the undo pointer in the slot will be rewound and that
will be enough
+ *  to avoid executing same actions.  However, there is possibility that
after
+ *  executing the undo actions the undo pointer got discarded, now in
later
+ *  stage while processing the previous log it might try to fetch the undo
+ *  record in the discarded log while chasing the transaction header
chain.
+ *  To avoid this situation we first check if the next_urec of the
transaction
+ *  is already discarded then no need to access that and start executing
from
+ *  the last undo record in the current log.

Actually, I don't see exactly this problem here because we only process one
undo log at a time, so we will not got to the next undo log and discard
some transaction for which we supposed to retain the undo.

If not, then I guess we need to ensure that we
don't immediately discard the undo in the second log when a single
transactions undo is spreaded across two logs

Before choosing a new undo log to span the undo for a transaction, do
we ensure that it is not already linked with some other undo log for a
similar reason?
One more thing in this regard:
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+    TransactionId txid, UndoPersistence upersistence)
{
..
..
+ if (InRecovery)
+ urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+ else
+ urecptr = UndoLogAllocate(size, upersistence);
+
+ log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+ /*
+ * By now, we must be attached to some undo log unless we are in recovery.
+ */
+ Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+ /*
+ * We can consider the log as switched if this is the first record of the
+ * log and not the first record of the transaction i.e. same transaction
+ * continued from the previous log.
+ */
+ if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+ log->meta.prevlogno != InvalidUndoLogNumber)
+ log_switched = true;
..
..
}
Isn't there a hidden assumption in the above code that you will always
get a fresh undo log if the undo doesn't fit in the currently attached
log? What is the guarantee of same?

Yeah it's a problem, we might get the undo which may not be empty. One way
to avoid this could be that instead of relying on the check
"UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize". We can add some
flag to the undo log meta data that whether it's the first record after
attach or not and based on that we can decide. But, I want to think some
better solution where we can identify without adding anything extra to undo
meta.

+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo
record you
+ * intended to insert.  Upon return, the necessary undo buffers are
pinned and
+ * locked.
+ * This should be done before any critical section is established, since
it
+ * can fail.
+ *
+ * If not in recovery, 'xid' should refer to the top transaction id
because
+ * undo log only stores mapping for the top most transactions.
+ * If in recovery, 'xid' refers to the transaction id stored in WAL.
+ */
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId
xid,
+   UndoPersistence);
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual
undo
+ * record into the buffers already pinned and locked in
PreparedUndoInsert,
+ * and mark them dirty.  For persistent undo, this step should be
performed
+ * after entering a critical section; it should never fail.
+ */
+extern void InsertPreparedUndo(void);

This text is duplicate of what is mentioned in .c file, so I have
removed it in delta patch. Similarly, I have removed duplicate text
atop other functions exposed via undorecord.h

4.
+/*
+ * Forget about any previously-prepared undo record.  Error recovery calls
+ * this, but it can also be used by other code that changes its mind about
+ * inserting undo after having prepared a record for insertion.
+ */
+extern void CancelPreparedUndo(void);

This API is nowhere defined or used. What is the intention?

Not required

5.
+typedef struct UndoRecordHeader
+{
..
+ /*
+ * Transaction id that has modified the tuple present in this undo record.
+ * If this is older then RecentGlobalXmin, then we can consider the tuple
+ * in this undo record as visible.
+ */
+ TransactionId urec_prevxid;
..

/then/than

Done

I think we need to mention oldestXidWithEpochHavingUndo instead of
RecentGlobalXmin.

Merged

6.
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.
That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS 0x01

I have expanded this comments in the attached delta patch.

Merged

I think we
should remove the define UREC_INFO_PAYLOAD_CONTAINS_SLOT from the
patch as this is zheap specific and should be added later along with
the zheap code.

Yeah we can, so removed in my new patch.

+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the tablespace OID and fork number.  If the tablespace OID is
+ * DEFAULTTABLESPACE_OID and the fork number is MAIN_FORKNUM, this
structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+ ForkNumber urec_fork; /* fork number */
+} UndoRecordRelationDetails;

This comment seems to be out-dated, so modified in the attached delta
patch.

Merged

8.
+typedef struct UndoRecordTransaction
+{
+ uint32 urec_progress; /* undo applying progress. */
+ uint32 urec_xidepoch; /* epoch of the current transaction */

Can you expand comments about how the progress is defined and used?

Moved your comment from UnpackedUndoRecord to this structure and in
UnpackedUndoRecord I have mentioned that we can refer detailed comment in
this structure.

Also, write a few sentences about why epoch is captured and or used?

urec_xidepoch is captured mainly for the zheap visibility purpose so isn't
it good that we mention it there?

9.
+#define urec_next_pos \
+ (SizeOfUndoRecordTransaction - SizeOfUrecNext)

What is its purpose?

It's not required so removed

10.
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+
+/*
+ * Compute the number of bytes of storage that will be required to insert
+ * an undo record.  Sets uur->uur_info as a side effect.
+ */
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
Again, I see duplicate text in .h and .c files, so removed this and
similar comments from .h files. I have tried to move some part of
comments from .h to .c file, so that it is easier to read from one
place rather than referring at two places. See, if I have missed
anything.

Apart from the above, I have done few other cosmetic changes in the
attached delta patch, see, if you like those, kindly include it in
the main patch.

Done

Attachments:

0003-undo-interface-v11.patchapplication/octet-stream; name=0003-undo-interface-v11.patchDownload

From b7c43ff5467ff8f266c3be60c8cbf6bec849da75 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 3 Dec 2018 10:33:50 +0530
Subject: [PATCH] undo-interface-v11

Provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

Dilip Kumar with help from Rafia Sabih based on an early prototype
for forming undo record by Robert Haas and design inputs from Amit Kapila
Reviewed by Amit Kapila.
---
 src/backend/access/transam/xact.c    |   28 +
 src/backend/access/transam/xlog.c    |   30 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1192 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  461 +++++++++++++
 src/include/access/undoinsert.h      |   50 ++
 src/include/access/undorecord.h      |  187 ++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 1952 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d967400..6060013 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/undoinsert.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -66,6 +67,7 @@
 #include "utils/timestamp.h"
 #include "pg_trace.h"
 
+#define	AtAbort_ResetUndoBuffers() ResetUndoBuffers()
 
 /*
  *	User-tweakable parameters
@@ -189,6 +191,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +918,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
@@ -2631,6 +2657,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtAbort_ResetUndoBuffers();
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4815,6 +4842,7 @@ AbortSubTransaction(void)
 		AtEOSubXact_PgStat(false, s->nestingLevel);
 		AtSubAbort_Snapshot(s->nestingLevel);
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
+		AtAbort_ResetUndoBuffers();
 	}
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1064ee0..01815a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8323,6 +8323,36 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/*
+	 * Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..68ed822
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1192 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ * Undo records are stored in sequential order in the undo log.  Each undo
+ * record consists of a variable length header, tuple data, and payload
+ * information.  The first undo record of each transaction contains a
+ * transaction header that points to the next transaction's start header.
+ * This allows us to discard the entire transaction's log at one-shot rather
+ * than record-by-record.  The callers are not aware of transaction header,
+ * this is entirely maintained and used by undo record layer.   See
+ * undorecord.h for detailed information about undo record header.
+ *
+ * Handling multi log:
+ *
+ * It is possible that the undo record of a transaction can be spread across
+ * multiple undo log.  And, we need some special handling while inserting the
+ * undo for discard and rollback to work sanely.
+ *
+ * If the undorecord goes to next log then we insert a transaction header for
+ * the first record in the new log and update the transaction header with this
+ * new log's location. This will allow us to connect transactions across logs
+ * when the same transaction span across log (for this we keep track of the
+ * previous logno in undo log meta) which is required to find the latest undo
+ * record pointer of the aborted transaction for executing the undo actions
+ * before discard. If the next log get processed first in that case we
+ * don't need to trace back the actual start pointer of the transaction,
+ * in such case we can only execute the undo actions from the current log
+ * because the undo pointer in the slot will be rewound and that will be enough
+ * to avoid executing same actions.  However, there is possibility that after
+ * executing the undo actions the undo pointer got discarded, now in later
+ * stage while processing the previous log it might try to fetch the undo
+ * record in the discarded log while chasing the transaction header chain.
+ * To avoid this situation we first check if the next_urec of the transaction
+ * is already discarded then no need to access that and start executing from
+ * the last undo record in the current log.
+ *
+ * We only connect to next log if the same transaction spread to next log
+ * otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record. Hence increased by 1.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + 1) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId prev_txid[UndoPersistenceLevels] = {0};
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber logno;		/* Undo log number */
+	BlockNumber blk;			/* block number */
+	Buffer		buf;			/* buffer allocated for the block */
+	bool		zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr	urp;			/* undo record pointer */
+	UnpackedUndoRecord *urec;	/* undo record */
+	int			undo_buffer_idx[MAX_BUFFER_PER_UNDO];	/* undo_buffer array
+														 * index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace def_prepared[MAX_PREPARED_UNDO];
+static int	prepare_idx;
+static int	max_prepared_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr prepared_urec_ptr = InvalidUndoRecPtr;
+static bool update_prev_header = false;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.  This
+ * is populated while current transaction is updating its undo record pointer
+ * in previous transactions first undo record.
+ */
+typedef struct XactUndoRecordInfo
+{
+	UndoRecPtr	urecptr;		/* txn's start urecptr */
+	int			idx_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;		/* undo record header */
+} XactUndoRecordInfo;
+
+static XactUndoRecordInfo xact_urec_info;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
+				 UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+						   bool log_switched);
+static void UndoRecordUpdateTransInfo(void);
+static int UndoGetBufferSlot(RelFileNode rnode, BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl * log,
+				  UndoRecPtr urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or
+		 * not so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that
+		 * the doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, bool log_switched)
+{
+	UndoRecPtr	xact_urp;
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber cur_blk;
+	RelFileNode rnode;
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urecptr);
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	log = UndoLogGet(logno, false);
+
+	if (log_switched)
+	{
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+		log = UndoLogGet(log->meta.prevlogno, false);
+	}
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * We can read the previous transaction's location without locking,
+	 * because only the backend attached to the log can write to it (or we're
+	 * in recovery).
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery || log_switched);
+
+	if (log->meta.last_xact_start == 0)
+		xact_urp = InvalidUndoRecPtr;
+	else
+		xact_urp = MakeUndoRecPtr(log->logno, log->meta.last_xact_start);
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 * UndoRecordIsValid will release the lock if it returns false.
+	 */
+	if (!UndoRecordIsValid(log, xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(xact_urp);
+
+	/*
+	 * Read undo record header in by calling UnpackUndoRecord, if the undo
+	 * record header is split across buffers then we need to read the complete
+	 * header by invoking UnpackUndoRecord multiple times.
+	 */
+	while (true)
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk,
+								   RBM_NORMAL,
+								   log->meta.persistence);
+		xact_urec_info.idx_undo_buffers[index++] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+
+		if (UnpackUndoRecord(&xact_urec_info.uur, page, starting_byte,
+							 &already_decoded, true))
+			break;
+
+		/* Could not fetch the complete header so go to the next block. */
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	xact_urec_info.uur.uur_next = urecptr;
+	xact_urec_info.urecptr = xact_urp;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(void)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(xact_urec_info.urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			idx = 0;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	urec_ptr = xact_urec_info.urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, urec_ptr))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(urec_ptr);
+
+	do
+	{
+		Buffer		buffer;
+		int			buf_idx;
+
+		buf_idx = xact_urec_info.idx_undo_buffers[idx];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&xact_urec_info.uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		idx++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while (true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+UndoGetBufferSlot(RelFileNode rnode,
+				  BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence)
+{
+	int			i;
+	Buffer		buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g when
+		 * previous transaction start header is in previous undo log) so
+		 * compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+																	   GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId txid, UndoPersistence upersistence)
+{
+	UnpackedUndoRecord *urec = NULL;
+	UndoLogControl *log;
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	bool		need_xact_hdr = false;
+	bool		log_switched = false;
+	int			i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/* Is this the first undo record of the transaction? */
+	if ((InRecovery && IsTransactionFirstRec(txid)) ||
+		(!InRecovery && prev_txid[upersistence] != txid))
+		need_xact_hdr = true;
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		/*
+		 * Prepare the transacion header for the first undo record of
+		 * transaction.
+		 *
+		 * XXX There is also an option that instead of adding the
+		 * information to this record we can prepare a new record which only
+		 * contain transaction informations, but we can't see any clear
+		 * advantage of the same.
+		 */
+		if (need_xact_hdr && i == 0)
+		{
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			/* During recovery, get the database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+
+			/* Set uur_info to include the transaction header. */
+			urec->uur_info |= UREC_INFO_TRANSACTION;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables with invalid values as
+			 * these are used only with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+		}
+
+		/* Calculate the size of the undo record based on the info required. */
+		UndoRecordSetInfo(urec);
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	if (InRecovery)
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+	else
+		urecptr = UndoLogAllocate(size, upersistence);
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+	/*
+	 * By now, we must be attached to some undo log unless we are in recovery.
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * We can consider the log as switched if this is the first record of the
+	 * log and not the first record of the transaction i.e. same transaction
+	 * continued from the previous log.
+	 */
+	if ((UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize) &&
+		log->meta.prevlogno != InvalidUndoLogNumber)
+		log_switched = true;
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space), we'll need a new transaction header.
+	 * If we weren't already generating one, then do it now.
+	 */
+	if (!need_xact_hdr &&
+		(log->meta.insert == log->meta.last_xact_start ||
+		 UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize))
+	{
+		need_xact_hdr = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+		goto resize;
+	}
+
+	/* Update the previous transaction's start undo record, if required. */
+	if (need_xact_hdr || log_switched)
+	{
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr, log_switched);
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+		update_prev_header = false;
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many undo records can be
+ * prepared before we can insert them.  If the size is greater than
+ * MAX_PREPARED_UNDO then it will allocate extra memory to hold the extra
+ * prepared undo.
+ *
+ * This is normally used when more than one undo record needs to be prepared.
+ */
+void
+UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	prepared_urec_ptr = UndoRecordAllocate(undorecords, nrecords, txid,
+										   upersistence);
+	if (nrecords <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(nrecords * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's starting
+	 * undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((nrecords + 1) * MAX_BUFFER_PER_UNDO *
+						  sizeof(UndoBuffers));
+	max_prepared_undo = nrecords;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ *
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, TransactionId xid,
+				  UndoPersistence upersistence)
+{
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	RelFileNode rnode;
+	UndoRecordSize cur_size = 0;
+	BlockNumber cur_blk;
+	TransactionId txid;
+	int			starting_byte;
+	int			index = 0;
+	int			bufidx;
+	ReadBufferMode rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepared_undo)
+		elog(ERROR, "already reached the maximum prepared limit");
+
+
+	if (xid == InvalidTransactionId)
+	{
+		/* During recovery, we must have a valid transaction id. */
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping
+		 * for the top most transactions.
+		 */
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(prepared_urec_ptr))
+		urecptr = UndoRecordAllocate(urec, 1, txid, upersistence);
+	else
+		urecptr = prepared_urec_ptr;
+
+	/* advance the prepared ptr location for next record. */
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(prepared_urec_ptr))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(prepared_urec_ptr);
+
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		prepared_urec_ptr = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned and locked. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+		cur_blk++;
+	} while (cur_size < size);
+
+	/*
+	 * Save the undo record information to be later used by InsertPreparedUndo
+	 * to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  This step should be performed after entering a
+ * criticalsection; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page		page;
+	int			starting_byte;
+	int			already_written;
+	int			bufidx = 0;
+	int			idx;
+	uint16		undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord *uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+
+	/* There must be atleast one prepared undo record. */
+	Assert(prepare_idx > 0);
+
+	/*
+	 * This must be called under a critical section or we must be in recovery.
+	 */
+	Assert(InRecovery || CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+		/*
+		 * Store the previous undo record length in the header.  We can read
+		 * meta.prevlen without locking, because only we can write to it.
+		 */
+		uur->uur_prevlen = log->meta.prevlen;
+
+		/*
+		 * If starting a new log then there is no prevlen to store except when
+		 * the same transaction is continuing from the previous undo log read
+		 * detailed comment atop this file.
+		 */
+		if (offset == UndoLogBlockHeaderSize)
+		{
+			if (log->meta.prevlogno != InvalidUndoLogNumber)
+			{
+				UndoLogControl *prevlog = UndoLogGet(log->meta.prevlogno, false);
+
+				uur->uur_prevlen = prevlog->meta.prevlen;
+			}
+			else
+				uur->uur_prevlen = 0;
+		}
+
+		/*
+		 * if starting from a new page then consider block header size in
+		 * prevlen calculation.
+		 */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+			uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer		buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.  We start writting immediately after the block header.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			starting_byte = UndoLogBlockHeaderSize;
+			undo_len += UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while (true);
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), undo_len);
+
+		/*
+		 * Link the transactions in the same log so that we can discard all
+		 * the transaction's undo log in one-shot.
+		 */
+		if (UndoRecPtrIsValid(xact_urec_info.urecptr))
+			UndoRecordUpdateTransInfo();
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer, so caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if it wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord *
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer		buffer = urec->uur_buffer;
+	Page		page;
+	int			starting_byte = UndoRecPtrGetPageOffset(urp);
+	int			already_decoded = 0;
+	BlockNumber cur_blk;
+	bool		is_undo_rec_split = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a buffer pin then no need to allocate a new one. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * XXX This can be optimized to just fetch header first and only if
+		 * matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_rec_split = true;
+
+		/*
+		 * The record spans more than a page so we would have copied it (see
+		 * UnpackUndoRecord).  In such cases, we can release the buffer.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer, otherwise, just
+	 * unlock it.
+	 */
+	if (is_undo_rec_split)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * ResetUndoRecord - Helper function for UndoFetchRecord to reset the current
+ * record.
+ */
+static void
+ResetUndoRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode *rnode,
+				RelFileNode *prevrec_rnode)
+{
+	/*
+	 * If we have a valid buffer pinned then just ensure that we want to find
+	 * the next tuple from the same block.  Otherwise release the buffer and
+	 * set it invalid
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		/*
+		 * Undo buffer will be changed if the next undo record belongs to a
+		 * different block or undo log.
+		 */
+		if ((UndoRecPtrGetBlockNum(urp) !=
+			 BufferGetBlockNumber(urec->uur_buffer)) ||
+			(prevrec_rnode->relNode != rnode->relNode))
+		{
+			ReleaseBuffer(urec->uur_buffer);
+			urec->uur_buffer = InvalidBuffer;
+		}
+	}
+	else
+	{
+		/*
+		 * If there is not a valid buffer in urec->uur_buffer that means we
+		 * had copied the payload data and tuple data so free them.
+		 */
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	/* Reset the urec before fetching the tuple */
+	urec->uur_tuple.data = NULL;
+	urec->uur_tuple.len = 0;
+	urec->uur_payload.data = NULL;
+	urec->uur_payload.len = 0;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  The same tuple can be modified by multiple transactions, so during
+ * undo chain traversal sometimes we need to distinguish based on transaction
+ * id.  Callers that don't have any such requirement can pass
+ * InvalidTransactionId.
+ *
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ *
+ * callback function decides whether particular undo record satisfies the
+ * condition of caller.
+ *
+ * Returns the required undo record if found, otherwise, return NULL which
+ * means either the record is already discarded or there is no such record
+ * in the undo chain.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode rnode,
+				prevrec_rnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int			logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+	UndoRecPtrAssignRelFileNode(rnode, urp);
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecordIsValid(log, urp))
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+		prevrec_rnode = rnode;
+
+		/* Get rnode for the current undo record pointer. */
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/* Reset the current undorecord before fetching the next. */
+		ResetUndoRecord(urec, urp, &rnode, &prevrec_rnode);
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * This API can switch to the previous log if the current log is exhausted,
+ * so the caller shouldn't use it where that is not expected.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+	UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+	/*
+	 * We have reached to the first undo record of this undo log, so fetch the
+	 * previous undo record of the transaction from the previous log.
+	 */
+	if (offset == UndoLogBlockHeaderSize)
+	{
+		UndoLogControl *prevlog,
+				   *log;
+
+		log = UndoLogGet(logno, false);
+
+		Assert(log->meta.prevlogno != InvalidUndoLogNumber);
+
+		/* Fetch the previous log control. */
+		prevlog = UndoLogGet(log->meta.prevlogno, true);
+		logno = log->meta.prevlogno;
+		offset = prevlog->meta.insert;
+	}
+
+	/* calculate the previous undo record pointer */
+	return MakeUndoRecPtr(logno, offset - prevlen);
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree(urec);
+}
+
+/*
+ * RegisterUndoLogBuffers - Register the undo buffers.
+ */
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int			idx;
+	int			flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+/*
+ * UndoLogBuffersSetLSN - Set LSN on undo page.
+*/
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int			idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Reset the global variables related to undo buffers. This is required at the
+ * transaction abort and while releasing the undo buffers.
+ */
+void
+ResetUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+	{
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	xact_urec_info.urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	prepared_urec_ptr = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepared_undo limit is changed so free the allocated memory and
+	 * reset all the variable back to their default value.
+	 */
+	if (max_prepared_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepared_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Unlock and release the undo buffers.  This step must be performed after
+ * exiting any critical section where we have perfomed undo actions.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+
+	ResetUndoBuffers();
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..b44c41d
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,461 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size		size;
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.
+ *
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin writing,
+ * while *already_written is the number of bytes written to previous pages.
+ *
+ * Returns true if the remainder of the record was written and false if more
+ * bytes remain to be written; in either case, *already_written is set to the
+ * number of bytes written thus far.
+ *
+ * This function assumes that if *already_written is non-zero on entry, the
+ * same UnpackedUndoRecord is passed each time.  It also assumes that
+ * UnpackUndoRecord is not called between successive calls to InsertUndoRecord
+ * for the same UnpackedUndoRecord.
+ *
+ * If this function is called again to continue writing the record, the
+ * previous value for *already_written should be passed again, and
+ * starting_byte should be passed as sizeof(PageHeaderData) (since the record
+ * will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char	   *writeptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_written = *already_written;
+
+	/* The undo record must contain a valid information. */
+	Assert(uur->uur_info != 0);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption that
+	 * it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_reloid = uur->uur_reloid;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before, or
+		 * caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_reloid == uur->uur_reloid);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int			can_write;
+	int			remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing to do
+	 * except update *my_bytes_written, which we must do to ensure that the
+	 * next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool
+UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+				 int *already_decoded, bool header_only)
+{
+	char	   *readptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_decoded = *already_decoded;
+	bool		is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_reloid = work_hdr.urec_reloid;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any
+		 * of the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int			can_read;
+	int			remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..2256dbe
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid);
+
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+				  UndoPersistence);
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+extern void UnlockReleaseUndoBuffers(void);
+
+extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp,
+				BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback);
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+extern void UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence);
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen);
+extern void ResetUndoBuffers(void);
+
+#endif							/* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..ab4dd4c
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,187 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_reloid;	/* relation OID */
+
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older than oldestXidWithEpochHavingUndo, then we can consider
+	 * the tuple in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;		/* Transaction id */
+	CommandId	urec_cid;		/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_TRANSACTION is set, an UndoRecordTransaction structure
+ * follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the fork number.  If the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	ForkNumber	urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.  This
+ * also stores the dbid and the progress of the undo apply during rollback.
+ */
+typedef struct UndoRecordTransaction
+{
+	/*
+	 * This indicates undo action apply progress, 0 means not started, 1 means
+	 * completed.  In future, it can also be used to show the progress of how
+	 * much undo has been applied so far with some formula.
+	 */
+	uint32		urec_progress;
+	uint32		urec_xidepoch;	/* epoch of the current transaction */
+	Oid			urec_dbid;		/* database id */
+	uint64		urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;	/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordSetInfo or InsertUndoRecord.  We do set it in
+ * UndoRecordAllocate for transaction specific header information.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_reloid;		/* relation OID */
+	TransactionId uur_prevxid;	/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id */
+
+	/* undo applying progress, see detail comment in UndoRecordTransaction*/
+	uint32		uur_progress;
+	StringInfoData uur_payload; /* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif							/* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 689c57c..73394c5 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f3a7ba4..d4e742f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -310,6 +310,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

#28

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#27)

Re: Undo logs

On Wed, Dec 12, 2018 at 11:18 AM Dilip Kumar
<dilip.kumar@enterprisedb.com> wrote:

On Sat, Dec 8, 2018 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think I see the problem in the discard mechanism when the log is
spread across multiple logs. Basically, if the second log contains
undo of some transaction prior to the transaction which has just
decided to spread it's undo in the chosen undo log, then we might
discard the undo log of some transaction(s) inadvertently. Am, I
missing something?

Actually, I don't see exactly this problem here because we only process one undo log at a time, so we will not got to the next undo log and discard some transaction for which we supposed to retain the undo.

How will rollbacks work for such a case? I have more to say about
this, see below.

If not, then I guess we need to ensure that we
don't immediately discard the undo in the second log when a single
transactions undo is spreaded across two logs

Before choosing a new undo log to span the undo for a transaction, do
we ensure that it is not already linked with some other undo log for a
similar reason?

You seem to forget answering this.

One more thing in this regard:
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+    TransactionId txid, UndoPersistence upersistence)

Isn't there a hidden assumption in the above code that you will always
get a fresh undo log if the undo doesn't fit in the currently attached
log? What is the guarantee of same?

Yeah it's a problem, we might get the undo which may not be empty. One way to avoid this could be that instead of relying on the check "UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize". We can add some flag to the undo log meta data that whether it's the first record after attach or not and based on that we can decide. But, I want to think some better solution where we can identify without adding anything extra to undo meta.

I think what we need to determine here is whether we have switched the
log for some non-first record of the transaction. If so, can't we
detect by something like:

log = GetLatestUndoLogAmAttachedTo();
UndoLogAllocate();
if (!need_xact_hdr)
{
currnet_log = GetLatestUndoLogAmAttachedTo();
if (currnet_log is not same as log)
{
Assert(currnet_log->meta.prevlogno == log->logno);
log_switched = true;
}
}

Won't the similar problem happens when we read undo records during
rollback? In code below:

+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+ UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+ UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+ /*
+ * We have reached to the first undo record of this undo log, so fetch the
+ * previous undo record of the transaction from the previous log.
+ */
+ if (offset == UndoLogBlockHeaderSize)
+ {
+ UndoLogControl *prevlog,

We seem to be assuming here that the new log starts from the
beginning. IIUC, we can read the record of some other transaction if
the transaction's log spans across two logs and the second log
contains some other transaction's log in the beginning.

8.
+typedef struct UndoRecordTransaction
+{
+ uint32 urec_progress; /* undo applying progress. */
+ uint32 urec_xidepoch; /* epoch of the current transaction */
Can you expand comments about how the progress is defined and used?
Moved your comment from UnpackedUndoRecord to this structure and in UnpackedUndoRecord I have mentioned that we can refer detailed comment in this structure.

Also, write a few sentences about why epoch is captured and or used?

urec_xidepoch is captured mainly for the zheap visibility purpose so isn't it good that we mention it there?

Okay, you can leave it here as it is. One small point about this structure:

+ uint64 urec_next; /* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+ (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)

Isn't it better to define urec_next as UndoRecPtr, even though it is
technically the same as per the current code?

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

#29

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Amit Kapila (#28)

3 attachment(s)

Re: Undo logs

On Wed, Dec 12, 2018 at 3:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Dec 12, 2018 at 11:18 AM Dilip Kumar
<dilip.kumar@enterprisedb.com> wrote:

On Sat, Dec 8, 2018 at 7:52 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

I think I see the problem in the discard mechanism when the log is
spread across multiple logs. Basically, if the second log contains
undo of some transaction prior to the transaction which has just
decided to spread it's undo in the chosen undo log, then we might
discard the undo log of some transaction(s) inadvertently. Am, I
missing something?

Actually, I don't see exactly this problem here because we only process one undo log at a time, so we will not got to the next undo log and discard some transaction for which we supposed to retain the undo.

How will rollbacks work for such a case? I have more to say about
this, see below.

Yeah, I agree rollback will have the problem.

If not, then I guess we need to ensure that we
don't immediately discard the undo in the second log when a single
transactions undo is spreaded across two logs

Before choosing a new undo log to span the undo for a transaction, do
we ensure that it is not already linked with some other undo log for a
similar reason?

You seem to forget answering this.

In current patch I have removed the concept of prevlogno in undo log meta.

One more thing in this regard:
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+    TransactionId txid, UndoPersistence upersistence)
..

Isn't there a hidden assumption in the above code that you will always
get a fresh undo log if the undo doesn't fit in the currently attached
log? What is the guarantee of same?

Yeah it's a problem, we might get the undo which may not be empty. One way to avoid this could be that instead of relying on the check "UndoRecPtrGetOffset(urecptr) == UndoLogBlockHeaderSize". We can add some flag to the undo log meta data that whether it's the first record after attach or not and based on that we can decide. But, I want to think some better solution where we can identify without adding anything extra to undo meta.

I think what we need to determine here is whether we have switched the
log for some non-first record of the transaction. If so, can't we
detect by something like:

log = GetLatestUndoLogAmAttachedTo();
UndoLogAllocate();
if (!need_xact_hdr)
{
currnet_log = GetLatestUndoLogAmAttachedTo();
if (currnet_log is not same as log)
{
Assert(currnet_log->meta.prevlogno == log->logno);
log_switched = true;
}
}

Won't the similar problem happens when we read undo records during
rollback? In code below:
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen)
+{
+ UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+ UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+ /*
+ * We have reached to the first undo record of this undo log, so fetch the
+ * previous undo record of the transaction from the previous log.
+ */
+ if (offset == UndoLogBlockHeaderSize)
+ {
+ UndoLogControl *prevlog,
We seem to be assuming here that the new log starts from the
beginning. IIUC, we can read the record of some other transaction if
the transaction's log spans across two logs and the second log
contains some other transaction's log in the beginning.

For addressing these issues related to multilog I have changed the
design as we discussed offlist.
1) Now, at Do time we identify the log switch as you mentioned above
(identify which log we are attached to before and after allocate).
And, if the log is switched we write a WAL for the same and during
recovery whenever this WAL is replayed we stored the undo record
pointer of the transaction header (which is in the previous undo log)
in UndoLogStateData and read it while allocating space the undo record
and immediately reset it.

2) For handling the discard issue, along with updating the current
transaction's start header in the previous undo log we also update the
previous transaction's start header in the current log if we get
assigned with the non-empty undo log.

3) For, identifying the previous undo record of the transaction during
rollback (when undo log is switched), we store the transaction's last
record's (in previous undo log) undo record pointer in the transaction
header of the first undo record in the new undo log.

8.
+typedef struct UndoRecordTransaction
+{
+ uint32 urec_progress; /* undo applying progress. */
+ uint32 urec_xidepoch; /* epoch of the current transaction */
Can you expand comments about how the progress is defined and used?
Moved your comment from UnpackedUndoRecord to this structure and in UnpackedUndoRecord I have mentioned that we can refer detailed comment in this structure.

Also, write a few sentences about why epoch is captured and or used?

urec_xidepoch is captured mainly for the zheap visibility purpose so isn't it good that we mention it there?
Okay, you can leave it here as it is. One small point about this structure:
+ uint64 urec_next; /* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+ (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
Isn't it better to define urec_next as UndoRecPtr, even though it is
technically the same as per the current code?

While replying I noticed that I haven't address this comment, I will
handle this in next patch. I have to change this at couple of place.

Handling multilog, needed some changes in undo-log-manager patch so
attaching the updated version of the undo-log patches as well.

Commit on which patches created (bf491a9073e12ce1fc3e6facd0ae1308534df570)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-undo-interface-v12.patchapplication/octet-stream; name=0003-undo-interface-v12.patchDownload

From 31830fe1fad25f0e67155bf6e68ddf3071013905 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Sun, 23 Dec 2018 13:14:18 +0530
Subject: [PATCH 3/3] undo-interface-v12

Provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

Dilip Kumar with help from Rafia Sabih based on an early prototype
for forming undo record by Robert Haas and design inputs from Amit Kapila
Reviewed by Amit Kapila.
---
 src/backend/access/transam/xact.c    |   28 +
 src/backend/access/transam/xlog.c    |   30 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1235 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  464 +++++++++++++
 src/include/access/undoinsert.h      |   50 ++
 src/include/access/undorecord.h      |  196 ++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 2007 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index d967400..6060013 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/undoinsert.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -66,6 +67,7 @@
 #include "utils/timestamp.h"
 #include "pg_trace.h"
 
+#define	AtAbort_ResetUndoBuffers() ResetUndoBuffers()
 
 /*
  *	User-tweakable parameters
@@ -189,6 +191,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +918,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
@@ -2631,6 +2657,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtAbort_ResetUndoBuffers();
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4815,6 +4842,7 @@ AbortSubTransaction(void)
 		AtEOSubXact_PgStat(false, s->nestingLevel);
 		AtSubAbort_Snapshot(s->nestingLevel);
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
+		AtAbort_ResetUndoBuffers();
 	}
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1064ee0..01815a6 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8323,6 +8323,36 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/*
+	 * Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..68e6adb
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1235 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ * Undo records are stored in sequential order in the undo log.  Each undo
+ * record consists of a variable length header, tuple data, and payload
+ * information.  The first undo record of each transaction contains a
+ * transaction header that points to the next transaction's start header.
+ * This allows us to discard the entire transaction's log at one-shot rather
+ * than record-by-record.  The callers are not aware of transaction header,
+ * this is entirely maintained and used by undo record layer.   See
+ * undorecord.h for detailed information about undo record header.
+ *
+ * Handling multi log:
+ *
+ * It is possible that the undo record of a transaction can be spread across
+ * multiple undo log.  And, we need some special handling while inserting the
+ * undo for discard and rollback to work sanely.
+ *
+ * If the undorecord goes to next log then we insert a transaction header for
+ * the first record in the new log and update the transaction header with this
+ * new log's location. This will allow us to connect transactions across logs
+ * when the same transaction span across log (for this we keep track of the
+ * previous logno in undo log meta) which is required to find the latest undo
+ * record pointer of the aborted transaction for executing the undo actions
+ * before discard. If the next log get processed first in that case we
+ * don't need to trace back the actual start pointer of the transaction,
+ * in such case we can only execute the undo actions from the current log
+ * because the undo pointer in the slot will be rewound and that will be enough
+ * to avoid executing same actions.  However, there is possibility that after
+ * executing the undo actions the undo pointer got discarded, now in later
+ * stage while processing the previous log it might try to fetch the undo
+ * record in the discarded log while chasing the transaction header chain.
+ * To avoid this situation we first check if the next_urec of the transaction
+ * is already discarded then no need to access that and start executing from
+ * the last undo record in the current log.
+ *
+ * We only connect to next log if the same transaction spread to next log
+ * otherwise don't.
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "access/undolog_xlog.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * This defines the max number of previous xact info we need to update.
+ * Usually it's 1 for updating next link of previous transaction's header
+ * if we are starting a new transaction. But, in some cases where the same
+ * transaction is spilled to the next log that time we update our own
+ * transaction's header in previous undo log as well as the header of the
+ * previous transaction in the new log.
+ */
+#define MAX_XACT_UNDO_INFO	2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record as well.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + MAX_XACT_UNDO_INFO) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId prev_txid[UndoPersistenceLevels] = {0};
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber logno;		/* Undo log number */
+	BlockNumber blk;			/* block number */
+	Buffer		buf;			/* buffer allocated for the block */
+	bool		zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr	urp;			/* undo record pointer */
+	UnpackedUndoRecord *urec;	/* undo record */
+	int			undo_buffer_idx[MAX_BUFFER_PER_UNDO];	/* undo_buffer array
+														 * index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace def_prepared[MAX_PREPARED_UNDO];
+static int	prepare_idx;
+static int	max_prepared_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr prepared_urec_ptr = InvalidUndoRecPtr;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.  This
+ * is populated while current transaction is updating its undo record pointer
+ * in previous transactions first undo record.
+ */
+typedef struct XactUndoRecordInfo
+{
+	UndoRecPtr	urecptr;		/* txn's start urecptr */
+	int			idx_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;		/* undo record header */
+} XactUndoRecordInfo;
+
+static XactUndoRecordInfo xact_urec_info[MAX_XACT_UNDO_INFO];
+static int	xact_urec_info_idx;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
+				 UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+						   UndoRecPtr xact_urp);
+static void UndoRecordUpdateTransInfo(int idx);
+static int UndoGetBufferSlot(RelFileNode rnode, BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl * log,
+				  UndoRecPtr urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or
+		 * not so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that
+		 * the doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, UndoRecPtr xact_urp)
+{
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber cur_blk;
+	RelFileNode rnode;
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(xact_urp))
+		return;
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(xact_urp), false);
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 * UndoRecordIsValid will release the lock if it returns false.
+	 */
+	if (!UndoRecordIsValid(log, xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(xact_urp);
+
+	/*
+	 * Read undo record header in by calling UnpackUndoRecord, if the undo
+	 * record header is split across buffers then we need to read the complete
+	 * header by invoking UnpackUndoRecord multiple times.
+	 */
+	while (true)
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk,
+								   RBM_NORMAL,
+								   log->meta.persistence);
+		xact_urec_info[xact_urec_info_idx].idx_undo_buffers[index++] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+
+		if (UnpackUndoRecord(&xact_urec_info[xact_urec_info_idx].uur, page,
+							 starting_byte, &already_decoded, true))
+			break;
+
+		/* Could not fetch the complete header so go to the next block. */
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	xact_urec_info[xact_urec_info_idx].uur.uur_next = urecptr;
+	xact_urec_info[xact_urec_info_idx].urecptr = xact_urp;
+	xact_urec_info_idx++;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(int idx)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(xact_urec_info[idx].urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			i = 0;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	urec_ptr = xact_urec_info[idx].urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, urec_ptr))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(urec_ptr);
+
+	do
+	{
+		Buffer		buffer;
+		int			buf_idx;
+
+		buf_idx = xact_urec_info[idx].idx_undo_buffers[i];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&xact_urec_info[idx].uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		i++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while (true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+UndoGetBufferSlot(RelFileNode rnode,
+				  BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence)
+{
+	int			i;
+	Buffer		buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g when
+		 * previous transaction start header is in previous undo log) so
+		 * compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+																	   GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId txid, UndoPersistence upersistence)
+{
+	UnpackedUndoRecord *urec = NULL;
+	UndoLogControl *log;
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	UndoRecPtr	prevlogurp = InvalidUndoRecPtr;
+	UndoLogNumber prevlogno = InvalidUndoLogNumber;
+	bool		need_xact_hdr = false;
+	bool		log_switched = false;
+	int			i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/* Is this the first undo record of the transaction? */
+	if ((InRecovery && IsTransactionFirstRec(txid)) ||
+		(!InRecovery && prev_txid[upersistence] != txid))
+		need_xact_hdr = true;
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		/*
+		 * Prepare the transacion header for the first undo record of
+		 * transaction.
+		 *
+		 * XXX There is also an option that instead of adding the information
+		 * to this record we can prepare a new record which only contain
+		 * transaction informations, but we can't see any clear advantage of
+		 * the same.
+		 */
+		if (need_xact_hdr && i == 0)
+		{
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			if (log_switched)
+			{
+				/*
+				 * If undo log is switched then during rollback we can not go
+				 * to the previous undo record of the transaction by prevlen
+				 * so we store the previous undo record pointer in the
+				 * transaction header.
+				 */
+				log = UndoLogGet(prevlogno, false);
+				urec->uur_prevurp = MakeUndoRecPtr(prevlogno,
+												   log->meta.insert - log->meta.prevlen);
+			}
+			else
+				urec->uur_prevurp = InvalidUndoRecPtr;
+
+			/* During recovery, get the database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+
+			/* Set uur_info to include the transaction header. */
+			urec->uur_info |= UREC_INFO_TRANSACTION;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables with invalid values as
+			 * these are used only with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+			urec->uur_prevurp = InvalidUndoRecPtr;
+		}
+
+		/* Calculate the size of the undo record based on the info required. */
+		UndoRecordSetInfo(urec);
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	/*
+	 * Check whether the undo log got switched while we are in a transaction.
+	 */
+	if (InRecovery)
+	{
+		/*
+		 * During recovery we can directly identify by checking the prevlogurp
+		 * from the MyUndoLogState which is stored in it by WAL and we
+		 * immediately reset it.
+		 */
+		prevlogurp = UndoLogStateGetAndClearPrevLogXactUrp();
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+		if (UndoRecPtrIsValid(prevlogurp))
+		{
+			prevlogno = UndoRecPtrGetLogNo(prevlogurp);
+			log_switched = true;
+		}
+	}
+	else
+	{
+		/*
+		 * Just check the current log which we are attached to, and if this
+		 * got switched after the allocation then the undo log got switched.
+		 */
+		prevlogno = UndoLogAmAttachedTo(upersistence);
+		urecptr = UndoLogAllocate(size, upersistence);
+		if (!need_xact_hdr &&
+			prevlogno != InvalidUndoLogNumber &&
+			prevlogno != UndoRecPtrGetLogNo(urecptr))
+		{
+			log = UndoLogGet(prevlogno, false);
+			prevlogurp = MakeUndoRecPtr(prevlogno, log->meta.last_xact_start);
+			log_switched = true;
+		}
+	}
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+	/*
+	 * By now, we must be attached to some undo log unless we are in recovery.
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space) or the undo log got switched, we'll
+	 * need a new transaction header. If we weren't already generating one,
+	 * then do it now.
+	 */
+	if (!need_xact_hdr &&
+		(log->meta.insert == log->meta.last_xact_start || log_switched))
+	{
+		need_xact_hdr = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+		goto resize;
+	}
+
+	/* Update the previous transaction's start undo record, if required. */
+	if (need_xact_hdr || log_switched)
+	{
+		/*
+		 * If the undo log is switched then we need to update our own
+		 * transaction header in the previous log as well as the previous
+		 * transaction's header in the new log.  Read detail comments for
+		 * multi-log handling atop this file.
+		 */
+		if (log_switched)
+			UndoRecordPrepareTransInfo(urecptr, prevlogurp);
+
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr,
+									   MakeUndoRecPtr(log->logno, log->meta.last_xact_start));
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	/*
+	 * WAL log, for log switch.  This is required to identify the log switch
+	 * during recovery.
+	 */
+	if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
+		XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
+	}
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many undo records can be
+ * prepared before we can insert them.  If the size is greater than
+ * MAX_PREPARED_UNDO then it will allocate extra memory to hold the extra
+ * prepared undo.
+ *
+ * This is normally used when more than one undo record needs to be prepared.
+ */
+void
+UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	prepared_urec_ptr = UndoRecordAllocate(undorecords, nrecords, txid,
+										   upersistence);
+	if (nrecords <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(nrecords * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's starting
+	 * undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((nrecords + 1) * MAX_BUFFER_PER_UNDO *
+						  sizeof(UndoBuffers));
+	max_prepared_undo = nrecords;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ *
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, TransactionId xid,
+				  UndoPersistence upersistence)
+{
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	RelFileNode rnode;
+	UndoRecordSize cur_size = 0;
+	BlockNumber cur_blk;
+	TransactionId txid;
+	int			starting_byte;
+	int			index = 0;
+	int			bufidx;
+	ReadBufferMode rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepared_undo)
+		elog(ERROR, "already reached the maximum prepared limit");
+
+
+	if (xid == InvalidTransactionId)
+	{
+		/* During recovery, we must have a valid transaction id. */
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping
+		 * for the top most transactions.
+		 */
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(prepared_urec_ptr))
+		urecptr = UndoRecordAllocate(urec, 1, txid, upersistence);
+	else
+		urecptr = prepared_urec_ptr;
+
+	/* advance the prepared ptr location for next record. */
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(prepared_urec_ptr))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(prepared_urec_ptr);
+
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		prepared_urec_ptr = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned and locked. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+		cur_blk++;
+	} while (cur_size < size);
+
+	/*
+	 * Save the undo record information to be later used by InsertPreparedUndo
+	 * to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  This step should be performed after entering a
+ * criticalsection; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page		page;
+	int			starting_byte;
+	int			already_written;
+	int			bufidx = 0;
+	int			idx;
+	uint16		undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord *uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+
+	/* There must be atleast one prepared undo record. */
+	Assert(prepare_idx > 0);
+
+	/*
+	 * This must be called under a critical section or we must be in recovery.
+	 */
+	Assert(InRecovery || CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+		/*
+		 * Store the previous undo record length in the header.  We can read
+		 * meta.prevlen without locking, because only we can write to it.
+		 */
+		uur->uur_prevlen = log->meta.prevlen;
+
+		/*
+		 * If starting a new log then there is no prevlen to store.
+		 */
+		if (offset == UndoLogBlockHeaderSize)
+			uur->uur_prevlen = 0;
+
+		/*
+		 * if starting from a new page then consider block header size in
+		 * prevlen calculation.
+		 */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+			uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer		buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.  We start writting immediately after the block header.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			starting_byte = UndoLogBlockHeaderSize;
+			undo_len += UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while (true);
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), undo_len);
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+
+	/* Update previous transaction header. */
+	if (xact_urec_info_idx > 0)
+	{
+		int			i = 0;
+
+		for (i = 0; i < xact_urec_info_idx; i++)
+			UndoRecordUpdateTransInfo(i);
+	}
+
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer, so caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if it wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord *
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer		buffer = urec->uur_buffer;
+	Page		page;
+	int			starting_byte = UndoRecPtrGetPageOffset(urp);
+	int			already_decoded = 0;
+	BlockNumber cur_blk;
+	bool		is_undo_rec_split = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a buffer pin then no need to allocate a new one. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * XXX This can be optimized to just fetch header first and only if
+		 * matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_rec_split = true;
+
+		/*
+		 * The record spans more than a page so we would have copied it (see
+		 * UnpackUndoRecord).  In such cases, we can release the buffer.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer, otherwise, just
+	 * unlock it.
+	 */
+	if (is_undo_rec_split)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * ResetUndoRecord - Helper function for UndoFetchRecord to reset the current
+ * record.
+ */
+static void
+ResetUndoRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode *rnode,
+				RelFileNode *prevrec_rnode)
+{
+	/*
+	 * If we have a valid buffer pinned then just ensure that we want to find
+	 * the next tuple from the same block.  Otherwise release the buffer and
+	 * set it invalid
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		/*
+		 * Undo buffer will be changed if the next undo record belongs to a
+		 * different block or undo log.
+		 */
+		if ((UndoRecPtrGetBlockNum(urp) !=
+			 BufferGetBlockNumber(urec->uur_buffer)) ||
+			(prevrec_rnode->relNode != rnode->relNode))
+		{
+			ReleaseBuffer(urec->uur_buffer);
+			urec->uur_buffer = InvalidBuffer;
+		}
+	}
+	else
+	{
+		/*
+		 * If there is not a valid buffer in urec->uur_buffer that means we
+		 * had copied the payload data and tuple data so free them.
+		 */
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	/* Reset the urec before fetching the tuple */
+	urec->uur_tuple.data = NULL;
+	urec->uur_tuple.len = 0;
+	urec->uur_payload.data = NULL;
+	urec->uur_payload.len = 0;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  The same tuple can be modified by multiple transactions, so during
+ * undo chain traversal sometimes we need to distinguish based on transaction
+ * id.  Callers that don't have any such requirement can pass
+ * InvalidTransactionId.
+ *
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ *
+ * callback function decides whether particular undo record satisfies the
+ * condition of caller.
+ *
+ * Returns the required undo record if found, otherwise, return NULL which
+ * means either the record is already discarded or there is no such record
+ * in the undo chain.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode rnode,
+				prevrec_rnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int			logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+	UndoRecPtrAssignRelFileNode(rnode, urp);
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecordIsValid(log, urp))
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+		prevrec_rnode = rnode;
+
+		/* Get rnode for the current undo record pointer. */
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/* Reset the current undorecord before fetching the next. */
+		ResetUndoRecord(urec, urp, &rnode, &prevrec_rnode);
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * If prevurp is valid undo record pointer then it will directly
+ * return that assuming the caller has detected the undo log got
+ * switched during the transaction and prevurp is a valid previous
+ * undo record pointer of the transaction in the previous undo log.
+ * Otherwise this will calculate the previous undo record pointer
+ * by using current urp and the prevlen.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen, UndoRecPtr prevurp)
+{
+	if (UndoRecPtrIsValid(prevurp))
+		return prevurp;
+	else
+	{
+		UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+		UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+		/* calculate the previous undo record pointer */
+		return MakeUndoRecPtr(logno, offset - prevlen);
+	}
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree(urec);
+}
+
+/*
+ * RegisterUndoLogBuffers - Register the undo buffers.
+ */
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int			idx;
+	int			flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+/*
+ * UndoLogBuffersSetLSN - Set LSN on undo page.
+*/
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int			idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Reset the global variables related to undo buffers. This is required at the
+ * transaction abort and while releasing the undo buffers.
+ */
+void
+ResetUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+	{
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	for (i = 0; i < xact_urec_info_idx; i++)
+		xact_urec_info[i].urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	xact_urec_info_idx = 0;
+	prepared_urec_ptr = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepared_undo limit is changed so free the allocated memory and
+	 * reset all the variable back to their default value.
+	 */
+	if (max_prepared_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepared_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Unlock and release the undo buffers.  This step must be performed after
+ * exiting any critical section where we have perfomed undo actions.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+
+	ResetUndoBuffers();
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..a13abe3
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,464 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size		size;
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.
+ *
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin writing,
+ * while *already_written is the number of bytes written to previous pages.
+ *
+ * Returns true if the remainder of the record was written and false if more
+ * bytes remain to be written; in either case, *already_written is set to the
+ * number of bytes written thus far.
+ *
+ * This function assumes that if *already_written is non-zero on entry, the
+ * same UnpackedUndoRecord is passed each time.  It also assumes that
+ * UnpackUndoRecord is not called between successive calls to InsertUndoRecord
+ * for the same UnpackedUndoRecord.
+ *
+ * If this function is called again to continue writing the record, the
+ * previous value for *already_written should be passed again, and
+ * starting_byte should be passed as sizeof(PageHeaderData) (since the record
+ * will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char	   *writeptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_written = *already_written;
+
+	/* The undo record must contain a valid information. */
+	Assert(uur->uur_info != 0);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption that
+	 * it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_reloid = uur->uur_reloid;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_prevurp = uur->uur_prevurp;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before, or
+		 * caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_reloid == uur->uur_reloid);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_prevurp == uur->uur_prevurp);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int			can_write;
+	int			remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing to do
+	 * except update *my_bytes_written, which we must do to ensure that the
+	 * next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool
+UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+				 int *already_decoded, bool header_only)
+{
+	char	   *readptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_decoded = *already_decoded;
+	bool		is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_reloid = work_hdr.urec_reloid;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_prevurp = work_txn.urec_prevurp;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any
+		 * of the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int			can_read;
+	int			remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..41384e1
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid);
+
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+				  UndoPersistence);
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+extern void UnlockReleaseUndoBuffers(void);
+
+extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp,
+				BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback);
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+extern void UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence);
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen, UndoRecPtr prevurp);
+extern void ResetUndoBuffers(void);
+
+#endif							/* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..9f09055
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,196 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_reloid;	/* relation OID */
+
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older than oldestXidWithEpochHavingUndo, then we can
+	 * consider the tuple in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;		/* Transaction id */
+	CommandId	urec_cid;		/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_TRANSACTION is set, an UndoRecordTransaction structure
+ * follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the fork number.  If the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	ForkNumber	urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	uint64		urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.  This
+ * also stores the dbid and the progress of the undo apply during rollback.
+ */
+typedef struct UndoRecordTransaction
+{
+	/*
+	 * This indicates undo action apply progress, 0 means not started, 1 means
+	 * completed.  In future, it can also be used to show the progress of how
+	 * much undo has been applied so far with some formula.
+	 */
+	uint32		urec_progress;
+	uint32		urec_xidepoch;	/* epoch of the current transaction */
+	Oid			urec_dbid;		/* database id */
+
+	/*
+	 * Transaction previous undo record pointer when transaction split across
+	 * undo log.  The first undo record in the new log will stores the
+	 * previous undo record pointer in the previous log as we can not
+	 * calculate that directly using prevlen during rollback.
+	 */
+	uint64		urec_prevurp;
+	uint64		urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;	/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordSetInfo or InsertUndoRecord.  We do set it in
+ * UndoRecordAllocate for transaction specific header information.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_reloid;		/* relation OID */
+	TransactionId uur_prevxid;	/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	ForkNumber	uur_fork;		/* fork number */
+	uint64		uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	uint64		uur_prevurp;
+	uint64		uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id */
+
+	/* undo applying progress, see detail comment in UndoRecordTransaction */
+	uint32		uur_progress;
+	StringInfoData uur_payload; /* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif							/* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 689c57c..73394c5 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f3a7ba4..d4e742f 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -310,6 +310,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

0001-Add-undo-log-manager_v4.patchapplication/octet-stream; name=0001-Add-undo-log-manager_v4.patchDownload

From b09b91c8d70317585094b309b3629825d97c8b8c Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Sun, 23 Dec 2018 15:12:05 +0530
Subject: [PATCH 1/3] Add undo log manager.

Add a new subsystem to manage undo logs.  Undo logs allow data to be appended
efficiently, like logs.  They also allow data to be discarded efficiently from
the other end, like a queue.  Thirdly, they allow efficient buffered random
access, like a relation.

Undo logs physically consist of a set of 1MB segment files under
$PGDATA/base/undo (or per-tablespace equivalent) that are created, deleted or
renamed as required, similarly to the way that WAL segments are managed.
Meta-data about the set of undo logs is stored in shared memory, and written
to per-checkpoint files under $PGDATA/pg_undo.

This commit provides an API for allocating and discarding undo log storage
space and managing the files in a crash-safe way.  A later commit will provide
support for accessing the data stored inside them.

XXX Status: WIP.  Some details around WAL are being reconsidered, as noted in
comments.

Author: Thomas Munro, with contributions from Dilip Kumar and input from
        Amit Kapila and Robert Haas
Tested-By: Neha Sharma
Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com
---
 src/backend/access/Makefile               |    2 +-
 src/backend/access/rmgrdesc/Makefile      |    2 +-
 src/backend/access/rmgrdesc/undologdesc.c |   97 ++
 src/backend/access/transam/rmgr.c         |    1 +
 src/backend/access/transam/xlog.c         |   17 +
 src/backend/access/undo/Makefile          |   17 +
 src/backend/access/undo/undolog.c         | 2682 +++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql      |    4 +
 src/backend/commands/tablespace.c         |   23 +
 src/backend/replication/logical/decode.c  |    1 +
 src/backend/storage/ipc/ipci.c            |    3 +
 src/backend/storage/lmgr/lwlock.c         |    2 +
 src/backend/storage/lmgr/lwlocknames.txt  |    1 +
 src/backend/utils/init/postinit.c         |    1 +
 src/backend/utils/misc/guc.c              |   12 +
 src/bin/initdb/initdb.c                   |    2 +
 src/bin/pg_waldump/rmgrdesc.c             |    1 +
 src/include/access/rmgrlist.h             |    1 +
 src/include/access/undolog.h              |  398 +++++
 src/include/access/undolog_xlog.h         |   73 +
 src/include/catalog/pg_proc.dat           |    7 +
 src/include/storage/lwlock.h              |    2 +
 src/include/utils/guc.h                   |    2 +
 src/test/regress/expected/rules.out       |   11 +
 24 files changed, 3360 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/undologdesc.c
 create mode 100644 src/backend/access/undo/Makefile
 create mode 100644 src/backend/access/undo/undolog.c
 create mode 100644 src/include/access/undolog.h
 create mode 100644 src/include/access/undolog_xlog.h

diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index bd93a6a..7f7380c 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  tablesample transam
+			  tablesample transam undo
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..91ad1ef 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -11,6 +11,6 @@ include $(top_builddir)/src/Makefile.global
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
 	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
 	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o undologdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/undologdesc.c b/src/backend/access/rmgrdesc/undologdesc.c
new file mode 100644
index 0000000..1053dc7
--- /dev/null
+++ b/src/backend/access/rmgrdesc/undologdesc.c
@@ -0,0 +1,97 @@
+/*-------------------------------------------------------------------------
+ *
+ * undologdesc.c
+ *	  rmgr descriptor routines for access/undo/undolog.c
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/undologdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+
+void
+undolog_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_UNDOLOG_CREATE)
+	{
+		xl_undolog_create *xlrec = (xl_undolog_create *) rec;
+
+		appendStringInfo(buf, "logno %u", xlrec->logno);
+	}
+	else if (info == XLOG_UNDOLOG_EXTEND)
+	{
+		xl_undolog_extend *xlrec = (xl_undolog_extend *) rec;
+
+		appendStringInfo(buf, "logno %u end " UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_ATTACH)
+	{
+		xl_undolog_attach *xlrec = (xl_undolog_attach *) rec;
+
+		appendStringInfo(buf, "logno %u xid %u", xlrec->logno, xlrec->xid);
+	}
+	else if (info == XLOG_UNDOLOG_DISCARD)
+	{
+		xl_undolog_discard *xlrec = (xl_undolog_discard *) rec;
+
+		appendStringInfo(buf, "logno %u discard " UndoLogOffsetFormat " end "
+						 UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->discard, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_REWIND)
+	{
+		xl_undolog_rewind *xlrec = (xl_undolog_rewind *) rec;
+
+		appendStringInfo(buf, "logno %u insert " UndoLogOffsetFormat " prevlen %d",
+						 xlrec->logno, xlrec->insert, xlrec->prevlen);
+	}
+	else if (info == XLOG_UNDOLOG_SWITCH)
+	{
+		UndoRecPtr prevlogurp = *(UndoRecPtr *) rec;
+
+		appendStringInfo(buf, "previous log urp " UndoRecPtrFormat, prevlogurp);
+	}	
+
+}
+
+const char *
+undolog_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			id = "CREATE";
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			id = "EXTEND";
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			id = "ATTACH";
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			id = "DISCARD";
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			id = "REWIND";
+			break;
+		case XLOG_UNDOLOG_SWITCH:
+			id = "SWITCH";
+			break;			
+	}
+
+	return id;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..8b05374 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -18,6 +18,7 @@
 #include "access/multixact.h"
 #include "access/nbtxlog.h"
 #include "access/spgxlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c80b14e..1064ee0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/transam.h"
 #include "access/tuptoaster.h"
 #include "access/twophase.h"
+#include "access/undolog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
@@ -6693,6 +6694,9 @@ StartupXLOG(void)
 	 */
 	restoreTwoPhaseData();
 
+	/* Recover undo log meta data corresponding to this checkpoint. */
+	StartupUndoLogs(ControlFile->checkPointCopy.redo);
+
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
 	RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
@@ -7315,7 +7319,13 @@ StartupXLOG(void)
 	 * end-of-recovery steps fail.
 	 */
 	if (InRecovery)
+	{
 		ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+		ResetUndoLogs(UNDO_UNLOGGED);
+	}
+
+	/* Always reset temporary undo logs. */
+	ResetUndoLogs(UNDO_TEMP);
 
 	/*
 	 * We don't need the latch anymore. It's not strictly necessary to disown
@@ -9020,6 +9030,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointBuffers(flags);	/* performs all required fsyncs */
+	CheckPointUndoLogs(checkPointRedo, ControlFile->checkPointCopy.redo);
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
@@ -9726,6 +9737,9 @@ xlog_redo(XLogReaderState *record)
 		XLogCtl->ckptXid = checkPoint.nextXid;
 		SpinLockRelease(&XLogCtl->info_lck);
 
+		/* Write an undo log metadata snapshot. */
+		CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo);
+
 		/*
 		 * We should've already switched to the new TLI before replaying this
 		 * record.
@@ -9785,6 +9799,9 @@ xlog_redo(XLogReaderState *record)
 		XLogCtl->ckptXid = checkPoint.nextXid;
 		SpinLockRelease(&XLogCtl->info_lck);
 
+		/* Write an undo log metadata snapshot. */
+		CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo);
+
 		/* TLI should not change in an on-line checkpoint */
 		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
 			ereport(PANIC,
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
new file mode 100644
index 0000000..219c696
--- /dev/null
+++ b/src/backend/access/undo/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/undo
+#
+# IDENTIFICATION
+#    src/backend/access/undo/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/undo
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = undolog.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undolog.c b/src/backend/access/undo/undolog.c
new file mode 100644
index 0000000..42a9590
--- /dev/null
+++ b/src/backend/access/undo/undolog.c
@@ -0,0 +1,2682 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.c
+ *	  management of undo logs
+ *
+ * PostgreSQL undo log manager.  This module is responsible for managing the
+ * lifecycle of undo logs and their segment files, associating undo logs with
+ * backends, and allocating space within undo logs.
+ *
+ * For the code that reads and writes blocks of data, see undofile.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undolog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogreader.h"
+#include "catalog/catalog.h"
+#include "catalog/pg_tablespace.h"
+#include "commands/tablespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "nodes/execnodes.h"
+#include "pgstat.h"
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/standby.h"
+#include "storage/undofile.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/varlena.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+/*
+ * During recovery we maintain a mapping of transaction ID to undo logs
+ * numbers.  We do this with a two-level array, so that we use memory only for
+ * chunks of the array that overlap with the range of active xids.
+ */
+#define UndoLogXidLowBits 16
+
+/*
+ * Number of high bits.
+ */
+#define UndoLogXidHighBits \
+	(sizeof(TransactionId) * CHAR_BIT - UndoLogXidLowBits)
+
+/* Extract the upper bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidHigh(xid) ((xid) >> UndoLogXidLowBits)
+
+/* Extract the lower bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidLow(xid) ((xid) & ((1 << UndoLogXidLowBits) - 1))
+
+/*
+ * Main control structure for undo log management in shared memory.
+ * UndoLogControl objects are arranged in a fixed-size array, at a position
+ * determined by the undo log number.
+ */
+typedef struct UndoLogSharedData
+{
+	UndoLogNumber free_lists[UndoPersistenceLevels];
+	UndoLogNumber low_logno; /* the lowest logno */
+	UndoLogNumber next_logno; /* one past the highest logno */
+	UndoLogNumber array_size; /* how many UndoLogControl objects do we have? */
+	UndoLogControl logs[FLEXIBLE_ARRAY_MEMBER];
+} UndoLogSharedData;
+
+/*
+ * Per-backend state for the undo log module.
+ * Backend-local pointers to undo subsystem state in shared memory.
+ */
+typedef struct UndoLogSession
+{
+	UndoLogSharedData *shared;
+
+	/*
+	 * The control object for the undo logs that this session is currently
+	 * attached to at each persistence level.  This is where it will write new
+	 * undo data.
+	 */
+	UndoLogControl *logs[UndoPersistenceLevels];
+
+	/*
+	 * If the undo_tablespaces GUC changes we'll remember to examine it and
+	 * attach to a new undo log using this flag.
+	 */
+	bool			need_to_choose_tablespace;
+
+	/*
+	 * During recovery, the startup process maintains a mapping of xid to undo
+	 * log number, instead of using 'log' above.  This is not used in regular
+	 * backends and can be in backend-private memory so long as recovery is
+	 * single-process.  This map references UNDO_PERMANENT logs only, since
+	 * temporary and unlogged relations don't have WAL to replay.
+	 */
+	UndoLogNumber **xid_map;
+
+	/*
+	 * The slot for the oldest xids still running.  We advance this during
+	 * checkpoints to free up chunks of the map.
+	 */
+	uint16			xid_map_oldest_chunk;
+
+	/* Current dbid.  Used during recovery. */
+	Oid				dbid;
+
+	/*
+	 * Transaction's start header undo record pointer in the previous
+	 * undo log when transaction spills across multiple undo log.  This
+	 * is used for identifying the log switch during recovery and updating
+	 * the transaction header in the previous log. 
+	 */
+	UndoRecPtr	prevlogurp;	
+} UndoLogSession;
+
+UndoLogSession MyUndoLogState;
+
+undologtable_hash *undologtable_cache;
+
+/* GUC variables */
+char	   *undo_tablespaces = NULL;
+
+static UndoLogControl *get_undo_log(UndoLogNumber logno, bool locked);
+static UndoLogControl *allocate_undo_log(void);
+static void free_undo_log(UndoLogControl *log);
+static void attach_undo_log(UndoPersistence level, Oid tablespace);
+static void detach_current_undo_log(UndoPersistence level, bool full);
+static void extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end);
+static void undo_log_before_exit(int code, Datum value);
+static void forget_undo_buffers(int logno, UndoLogOffset old_discard,
+								UndoLogOffset new_discard,
+								bool drop_tail);
+static bool choose_undo_tablespace(bool force_detach, Oid *oid);
+static void undolog_xid_map_gc(void);
+
+PG_FUNCTION_INFO_V1(pg_stat_get_undo_logs);
+
+/*
+ * How many undo logs can be active at a time?  This creates a theoretical
+ * maximum transaction size, but it we set it to a factor the maximum number
+ * of backends it will be a very high limit.  Alternative designs involving
+ * demand paging or dynamic shared memory could remove this limit but
+ * introduce other problems.
+ */
+static inline size_t
+UndoLogNumSlots(void)
+{
+	return MaxBackends * 4;
+}
+
+/*
+ * Return the amount of traditional smhem required for undo log management.
+ * Extra shared memory will be managed using DSM segments.
+ */
+Size
+UndoLogShmemSize(void)
+{
+	return sizeof(UndoLogSharedData) +
+		UndoLogNumSlots() * sizeof(UndoLogControl);
+}
+
+/*
+ * Initialize the undo log subsystem.  Called in each backend.
+ */
+void
+UndoLogShmemInit(void)
+{
+	bool found;
+
+	MyUndoLogState.shared = (UndoLogSharedData *)
+		ShmemInitStruct("UndoLogShared", UndoLogShmemSize(), &found);
+
+	/* The postmaster initialized the shared memory state. */
+	if (!IsUnderPostmaster)
+	{
+		UndoLogSharedData *shared = MyUndoLogState.shared;
+		int		i;
+
+		Assert(!found);
+
+		/*
+		 * We start with no active undo logs.  StartUpUndoLogs() will recreate
+		 * the undo logs that were known at the last checkpoint.
+		 */
+		memset(shared, 0, sizeof(*shared));
+		shared->array_size = UndoLogNumSlots();
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+			shared->free_lists[i] = InvalidUndoLogNumber;
+		for (i = 0; i < shared->array_size; ++i)
+		{
+			memset(&shared->logs[i], 0, sizeof(shared->logs[i]));
+			shared->logs[i].logno = InvalidUndoLogNumber;
+			LWLockInitialize(&shared->logs[i].mutex,
+							 LWTRANCHE_UNDOLOG);
+			LWLockInitialize(&shared->logs[i].discard_lock,
+							 LWTRANCHE_UNDODISCARD);
+		}
+	}
+	else
+		Assert(found);
+
+	/* All backends prepare their per-backend lookup table. */
+	undologtable_cache = undologtable_create(TopMemoryContext,
+											 UndoLogNumSlots(),
+											 NULL);
+}
+
+void
+UndoLogInit(void)
+{
+	before_shmem_exit(undo_log_before_exit, 0);
+}
+
+/*
+ * Figure out which directory holds an undo log based on tablespace.
+ */
+static void
+UndoLogDirectory(Oid tablespace, char *dir)
+{
+	if (tablespace == DEFAULTTABLESPACE_OID ||
+		tablespace == InvalidOid)
+		snprintf(dir, MAXPGPATH, "base/undo");
+	else
+		snprintf(dir, MAXPGPATH, "pg_tblspc/%u/%s/undo",
+				 tablespace, TABLESPACE_VERSION_DIRECTORY);
+}
+
+/*
+ * Compute the pathname to use for an undo log segment file.
+ */
+void
+UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace, char *path)
+{
+	char		dir[MAXPGPATH];
+
+	/* Figure out which directory holds the segment, based on tablespace. */
+	UndoLogDirectory(tablespace, dir);
+
+	/*
+	 * Build the path from log number and offset.  The pathname is the
+	 * UndoRecPtr of the first byte in the segment in hexadecimal, with a
+	 * period inserted between the components.
+	 */
+	snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno,
+			 segno * UndoLogSegmentSize);
+}
+
+/*
+ * Iterate through the set of currently active logs.  Pass in NULL to get the
+ * first undo log.  NULL indicates the end of the set of logs.  The caller
+ * must lock the returned log before accessing its members, and must skip if
+ * logno is not valid.
+ */
+UndoLogControl *
+UndoLogNext(UndoLogControl *log)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+	for (;;)
+	{
+		/* Advance to the next log. */
+		if (log == NULL)
+		{
+			/* Start at the beginning. */
+			log = &shared->logs[0];
+		}
+		else if (++log == &shared->logs[shared->array_size])
+		{
+			/* Past the end. */
+			log = NULL;
+			break;
+		}
+		/* Have we found a slot with a valid log? */
+		if (log->logno != InvalidUndoLogNumber)
+			break;
+	}
+	LWLockRelease(UndoLogLock);
+
+	/* XXX: erm, which lock should the caller hold!? */
+	return log;
+}
+
+/*
+ * Check if an undo log position has been discarded.  'point' must be an undo
+ * log pointer that was allocated at some point in the past, otherwise the
+ * result is undefined.
+ */
+bool
+UndoLogIsDiscarded(UndoRecPtr point)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(point);
+	UndoLogControl *log;
+	bool	result;
+
+	log = get_undo_log(logno, false);
+
+	/*
+	 * If we couldn't find the undo log number, then it must be entirely
+	 * discarded.
+	 */
+	if (log == NULL)
+		return true;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	if (unlikely(logno != log->logno))
+	{
+		/*
+		 * The undo log has been entirely discarded since we looked it up, and
+		 * the UndoLogControl slot is now unused or being used for some other
+		 * undo log.  That means that any pointer within it must be discarded.
+		 */
+		result = true;
+	}
+	else
+	{
+		/* Check if this point is before the discard pointer. */
+		result = UndoRecPtrGetOffset(point) < log->meta.discard;
+	}
+	LWLockRelease(&log->mutex);
+
+	return result;
+}
+
+/*
+ * Store latest transaction's start undo record point in undo meta data.  It
+ * will fetched by the backend when it's reusing the undo log and preparing
+ * its first undo.
+ */
+void
+UndoLogSetLastXactStartPoint(UndoRecPtr point)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(point);
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	/* TODO: review */
+	log->meta.last_xact_start = UndoRecPtrGetOffset(point);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Fetch the previous transaction's start undo record point.
+ */
+UndoRecPtr
+UndoLogGetLastXactStartPoint(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	uint64 last_xact_start = 0;
+
+	if (unlikely(log == NULL))
+		return InvalidUndoRecPtr;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	/* TODO: review */
+	last_xact_start = log->meta.last_xact_start;
+	LWLockRelease(&log->mutex);
+
+	if (last_xact_start == 0)
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, last_xact_start);
+}
+
+/*
+ * Store the last undo record's length in undo meta-data so that it can be
+ * persistent across restart.
+ */
+void
+UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	/* TODO review */
+	log->meta.prevlen = prevlen;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get the last undo record's length.
+ */
+uint16
+UndoLogGetPrevLen(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	uint16	prevlen;
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	/* TODO review */
+	prevlen = log->meta.prevlen;
+	LWLockRelease(&log->mutex);
+
+	return prevlen;
+}
+
+/*
+ * Is this record is the first record for any transaction.
+ */
+bool
+IsTransactionFirstRec(TransactionId xid)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	Assert(InRecovery);
+
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	log = get_undo_log(logno, false);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	/* TODO review */
+	return log->meta.is_first_rec;
+}
+
+/*
+ * Detach from the undo log we are currently attached to, returning it to the
+ * appropriate free list if it still has space.
+ */
+static void
+detach_current_undo_log(UndoPersistence persistence, bool full)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+
+	MyUndoLogState.logs[persistence] = NULL;
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = InvalidPid;
+	log->xid = InvalidTransactionId;
+	if (full)
+		log->meta.status = UNDO_LOG_STATUS_FULL;
+	LWLockRelease(&log->mutex);
+
+	/* Push back onto the appropriate free list, unless it's full. */
+	if (!full)
+	{
+		LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+		log->next_free = shared->free_lists[persistence];
+		shared->free_lists[persistence] = log->logno;
+		LWLockRelease(UndoLogLock);
+	}
+}
+
+/*
+ * Exit handler, detaching from all undo logs.
+ */
+static void
+undo_log_before_exit(int code, Datum arg)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		if (MyUndoLogState.logs[i] != NULL)
+			detach_current_undo_log(i, false);
+	}
+}
+
+/*
+ * Create a new empty segment file on disk for the byte starting at 'end'.
+ */
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+							UndoLogOffset end)
+{
+	struct stat	stat_buffer;
+	off_t	size;
+	char	path[MAXPGPATH];
+	void   *zeroes;
+	size_t	nzeroes = 8192;
+	int		fd;
+
+	UndoLogSegmentPath(logno, end / UndoLogSegmentSize, tablespace, path);
+
+	/*
+	 * Create and fully allocate a new file.  If we crashed and recovered
+	 * then the file might already exist, so use flags that tolerate that.
+	 * It's also possible that it exists but is too short, in which case
+	 * we'll write the rest.  We don't really care what's in the file, we
+	 * just want to make sure that the filesystem has allocated physical
+	 * blocks for it, so that non-COW filesystems will report ENOSPC now
+	 * rather than later when the space is needed and we'll avoid creating
+	 * files with holes.
+	 */
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0 && tablespace != 0)
+	{
+		char undo_path[MAXPGPATH];
+
+		/* Try creating the undo directory for this tablespace. */
+		UndoLogDirectory(tablespace, undo_path);
+		if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+		{
+			char	   *parentdir;
+
+			if (errno != ENOENT || !InRecovery)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+
+			/*
+			 * In recovery, it's possible that the tablespace directory
+			 * doesn't exist because a later WAL record removed the whole
+			 * tablespace.  In that case we create a regular directory to
+			 * stand in for it.  This is similar to the logic in
+			 * TablespaceCreateDbspace().
+			 */
+
+			/* create two parents up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			/* create one parent up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+		}
+
+		fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	}
+	if (fd < 0)
+		elog(ERROR, "could not create new file \"%s\": %m", path);
+	if (fstat(fd, &stat_buffer) < 0)
+		elog(ERROR, "could not stat \"%s\": %m", path);
+	size = stat_buffer.st_size;
+
+	/* A buffer full of zeroes we'll use to fill up new segment files. */
+	zeroes = palloc0(nzeroes);
+
+	while (size < UndoLogSegmentSize)
+	{
+		ssize_t written;
+
+		written = write(fd, zeroes, Min(nzeroes, UndoLogSegmentSize - size));
+		if (written < 0)
+			elog(ERROR, "cannot initialize undo log segment file \"%s\": %m",
+				 path);
+		size += written;
+	}
+
+	/* Flush the contents of the file to disk. */
+	if (pg_fsync(fd) != 0)
+		elog(ERROR, "cannot fsync file \"%s\": %m", path);
+	CloseTransientFile(fd);
+
+	pfree(zeroes);
+
+	elog(LOG, "created undo segment \"%s\"", path); /* XXX: remove me */
+}
+
+/*
+ * Create a new undo segment, when it is unexpectedly not present.
+ */
+void
+UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno)
+{
+	Assert(InRecovery);
+	allocate_empty_undo_segment(logno, tablespace, segno * UndoLogSegmentSize);
+}
+
+/*
+ * Create and zero-fill a new segment for a given undo log number.
+ */
+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
+{
+	UndoLogControl *log;
+	char		dir[MAXPGPATH];
+	size_t		end;
+
+	log = get_undo_log(logno, false);
+
+	/* TODO review interlocking */
+
+	Assert(log != NULL);
+	Assert(log->meta.end % UndoLogSegmentSize == 0);
+	Assert(new_end % UndoLogSegmentSize == 0);
+	Assert(MyUndoLogState.logs[log->meta.persistence] == log || InRecovery);
+
+	/*
+	 * Create all the segments needed to increase 'end' to the requested
+	 * size.  This is quite expensive, so we will try to avoid it completely
+	 * by renaming files into place in UndoLogDiscard instead.
+	 */
+	end = log->meta.end;
+	while (end < new_end)
+	{
+		allocate_empty_undo_segment(logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	Assert(end == new_end);
+
+	/*
+	 * Flush the parent dir so that the directory metadata survives a crash
+	 * after this point.
+	 */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/*
+	 * If we're not in recovery, we need to WAL-log the creation of the new
+	 * file(s).  We do that after the above filesystem modifications, in
+	 * violation of the data-before-WAL rule as exempted by
+	 * src/backend/access/transam/README.  This means that it's possible for
+	 * us to crash having made some or all of the filesystem changes but
+	 * before WAL logging, but in that case we'll eventually try to create the
+	 * same segment(s) again, which is tolerated.
+	 */
+	if (!InRecovery)
+	{
+		xl_undolog_extend xlrec;
+		XLogRecPtr	ptr;
+
+		xlrec.logno = logno;
+		xlrec.end = end;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+		XLogFlush(ptr);
+	}
+
+	/*
+	 * We didn't need to acquire the mutex to read 'end' above because only
+	 * we write to it.  But we need the mutex to update it, because the
+	 * checkpointer might read it concurrently.
+	 *
+	 * XXX It's possible for meta.end to be higher already during
+	 * recovery, because of the timing of a checkpoint; in that case we did
+	 * nothing above and we shouldn't update shmem here.  That interaction
+	 * needs more analysis.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (log->meta.end < end)
+		log->meta.end = end;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get an insertion point that is guaranteed to be backed by enough space to
+ * hold 'size' bytes of data.  To actually write into the undo log, client
+ * code should call this first and then use bufmgr routines to access buffers
+ * and provide WAL logs and redo handlers.  In other words, while this module
+ * looks after making sure the undo log has sufficient space and the undo meta
+ * data is crash safe, the *contents* of the undo log and (indirectly) the
+ * insertion point are the responsibility of client code.
+ *
+ * Return an undo log insertion point that can be converted to a buffer tag
+ * and an insertion point within a buffer page.
+ *
+ * XXX For now an xl_undolog_meta object is filled in, in case it turns out
+ * to be necessary to write it into the WAL record (like FPI, this must be
+ * logged once for each undo log after each checkpoint).  I think this should
+ * be moved out of this interface and done differently -- to review.
+ */
+UndoRecPtr
+UndoLogAllocate(size_t size, UndoPersistence persistence)
+{
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+	UndoLogOffset new_insert;
+	TransactionId logxid;
+
+	/*
+	 * We may need to attach to an undo log, either because this is the first
+	 * time this backend as needed to write to an undo log at all or because
+	 * the undo_tablespaces GUC was changed.  When doing that, we'll need
+	 * interlocking against tablespaces being concurrently dropped.
+	 */
+
+ retry:
+	/* See if we need to check the undo_tablespaces GUC. */
+	if (unlikely(MyUndoLogState.need_to_choose_tablespace || log == NULL))
+	{
+		Oid		tablespace;
+		bool	need_to_unlock;
+
+		need_to_unlock =
+			choose_undo_tablespace(MyUndoLogState.need_to_choose_tablespace,
+								   &tablespace);
+		attach_undo_log(persistence, tablespace);
+		if (need_to_unlock)
+			LWLockRelease(TablespaceCreateLock);
+		log = MyUndoLogState.logs[persistence];
+		MyUndoLogState.need_to_choose_tablespace = false;
+	}
+
+	/*
+	 * If this is the first time we've allocated undo log space in this
+	 * transaction, we'll record the xid->undo log association so that it can
+	 * be replayed correctly. Before that, we set the first record flag to
+	 * false.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.is_first_rec = false;
+	logxid = log->xid;
+
+	if (logxid != GetTopTransactionId())
+	{
+		xl_undolog_attach xlrec;
+
+		/*
+		 * While we have the lock, check if we have been forcibly detached by
+		 * DROP TABLESPACE.  That can only happen between transactions (see
+		 * DropUndoLogsInsTablespace()).
+		 */
+		if (log->pid == InvalidPid)
+		{
+			LWLockRelease(&log->mutex);
+			log = NULL;
+			goto retry;
+		}
+		log->xid = GetTopTransactionId();
+		log->meta.is_first_rec = true;
+		LWLockRelease(&log->mutex);
+
+		/* Skip the attach record for unlogged and temporary tables. */
+		if (persistence == UNDO_PERMANENT)
+		{
+			xlrec.xid = GetTopTransactionId();
+			xlrec.logno = log->logno;
+			xlrec.dbid = MyDatabaseId;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_ATTACH);
+		}
+	}
+	else
+	{
+		LWLockRelease(&log->mutex);
+	}
+
+	/*
+	 * 'size' is expressed in usable non-header bytes.  Figure out how far we
+	 * have to move insert to create space for 'size' usable bytes, stepping
+	 * over any intervening headers.
+	 */
+	Assert(log->meta.insert % BLCKSZ >= UndoLogBlockHeaderSize);
+	new_insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	Assert(new_insert % BLCKSZ >= UndoLogBlockHeaderSize);
+
+	/*
+	 * We don't need to acquire log->mutex to read log->meta.insert and
+	 * log->meta.end, because this backend is the only one that can
+	 * modify them.
+	 */
+	if (unlikely(new_insert > log->meta.end))
+	{
+		if (new_insert > UndoLogMaxSize)
+		{
+			/* This undo log is entirely full.  Get a new one. */
+			elog(LOG, "undo log %u is full, switching to a new one", log->logno);
+			log = NULL;
+			detach_current_undo_log(persistence, true);
+			goto retry;
+		}
+		/*
+		 * Extend the end of this undo log to cover new_insert (in other words
+		 * round up to the segment size).
+		 */
+		extend_undo_log(log->logno,
+						new_insert + UndoLogSegmentSize -
+						new_insert % UndoLogSegmentSize);
+		Assert(new_insert <= log->meta.end);
+	}
+
+	return MakeUndoRecPtr(log->logno, log->meta.insert);
+}
+
+/*
+ * In recovery, we expect the xid to map to a known log which already has
+ * enough space in it.
+ */
+UndoRecPtr
+UndoLogAllocateInRecovery(TransactionId xid, size_t size,
+						  UndoPersistence level)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	/*
+	 * The sequence of calls to UndoLogAllocateRecovery() during REDO
+	 * (recovery) must match the sequence of calls to UndoLogAllocate during
+	 * DO, for any given session.  The XXX_redo code for any UNDO-generating
+	 * operation must use UndoLogAllocateRecovery() rather than
+	 * UndoLogAllocate(), because it must supply the extra 'xid' argument so
+	 * that we can find out which undo log number to use.  During DO, that's
+	 * tracked per-backend, but during REDO the original backends/sessions are
+	 * lost and we have only the Xids.
+	 */
+	Assert(InRecovery);
+
+	/*
+	 * Look up the undo log number for this xid.  The mapping must already
+	 * have been created by an XLOG_UNDOLOG_ATTACH record emitted during the
+	 * first call to UndoLogAllocate for this xid after the most recent
+	 * checkpoint.
+	 */
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	if (logno == InvalidUndoLogNumber)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	/*
+	 * This log must already have been created by an XLOG_UNDOLOG_CREATE
+	 * record emitted by UndoLogAllocate().
+	 */
+	log = get_undo_log(logno, false);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	/*
+	 * This log must already have been extended to cover the requested size by
+	 * XLOG_UNDOLOG_EXTEND records emitted by UndoLogAllocate(), or by
+	 * XLOG_UNDLOG_DISCARD records recycling segments.
+	 */
+	if (log->meta.end < UndoLogOffsetPlusUsableBytes(log->meta.insert, size))
+		elog(ERROR,
+			 "unexpectedly couldn't allocate %zu bytes in undo log number %d",
+			 size, logno);
+
+	/*
+	 * By this time we have allocated a undo log in transaction so after this
+	 * it will not be first undo record for the transaction.
+	 */
+	log->meta.is_first_rec = false;
+
+	return MakeUndoRecPtr(logno, log->meta.insert);
+}
+
+/*
+ * Advance the insertion pointer by 'size' usable (non-header) bytes.
+ */
+void
+UndoLogAdvance(UndoRecPtr insertion_point, size_t size, UndoPersistence persistence)
+{
+	UndoLogControl *log = NULL;
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insertion_point) ;
+
+	/*
+	 * During recovery, MyUndoLogState is uninitialized. Hence, we need to work
+	 * more.
+	 */
+	log = (InRecovery) ? get_undo_log(logno, false)
+		: MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+	Assert(InRecovery || logno == log->logno);
+	Assert(UndoRecPtrGetOffset(insertion_point) == log->meta.insert);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Advance the discard pointer in one undo log, discarding all undo data
+ * relating to one or more whole transactions.  The passed in undo pointer is
+ * the address of the oldest data that the called would like to keep, and the
+ * affected undo log is implied by this pointer, ie
+ * UndoRecPtrGetLogNo(discard_pointer).
+ *
+ * The caller asserts that there will be no attempts to access the undo log
+ * region being discarded after this moment.  This operation will cause the
+ * relevant buffers to be dropped immediately, without writing any data out to
+ * disk.  Any attempt to read the buffers (except a partial buffer at the end
+ * of this range which will remain) may result in IO errors, because the
+ * underlying segment file may have been physically removed.
+ *
+ * Only one backend should call this for a given undo log concurrently, or
+ * data structures will become corrupted.  It is expected that the caller will
+ * be an undo worker; only one undo worker should be working on a given undo
+ * log at a time.
+ */
+void
+UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(discard_point);
+	UndoLogOffset discard = UndoRecPtrGetOffset(discard_point);
+	UndoLogOffset old_discard;
+	UndoLogOffset end;
+	UndoLogControl *log;
+	int			segno;
+	int			new_segno;
+	bool		need_to_flush_wal = false;
+	bool		entirely_discarded = false;
+
+	log = get_undo_log(logno, false);
+	if (unlikely(log == NULL))
+		elog(ERROR,
+			 "cannot advance discard pointer for undo log %d because it is already entirely discarded",
+			 logno);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (unlikely(log->logno != logno))
+		elog(ERROR,
+			 "cannot advance discard pointer for undo log %d because it is entirely discarded",
+			 logno);
+	if (discard > log->meta.insert)
+		elog(ERROR, "cannot move discard point past insert point");
+	old_discard = log->meta.discard;
+	if (discard < old_discard)
+		elog(ERROR, "cannot move discard pointer backwards");
+	end = log->meta.end;
+	/* Are we discarding the last remaining data in a log marked as full? */
+	if (log->meta.status == UNDO_LOG_STATUS_FULL &&
+		discard == log->meta.insert)
+	{
+		/*
+		 * Adjust the discard and insert pointers so that the final segment is
+		 * deleted from disk, and remember not to recycle it.
+		 */
+		entirely_discarded = true;
+		log->meta.insert = log->meta.end;
+		discard = log->meta.end;
+	}
+	LWLockRelease(&log->mutex);
+
+	/*
+	 * Drop all buffers holding this undo data out of the buffer pool (except
+	 * the last one, if the new location is in the middle of it somewhere), so
+	 * that the contained data doesn't ever touch the disk.  The caller
+	 * promises that this data will not be needed again.  We have to drop the
+	 * buffers from the buffer pool before removing files, otherwise a
+	 * concurrent session might try to write the block to evict the buffer.
+	 */
+	forget_undo_buffers(logno, old_discard, discard, entirely_discarded);
+
+	/*
+	 * Check if we crossed a segment boundary and need to do some synchronous
+	 * filesystem operations.
+	 */
+	segno = old_discard / UndoLogSegmentSize;
+	new_segno = discard / UndoLogSegmentSize;
+	if (segno < new_segno)
+	{
+		int		recycle;
+		UndoLogOffset pointer;
+
+		/*
+		 * We always WAL-log discards, but we only need to flush the WAL if we
+		 * have performed a filesystem operation.
+		 */
+		need_to_flush_wal = true;
+
+		/*
+		 * XXX When we rename or unlink a file, it's possible that some
+		 * backend still has it open because it has recently read a page from
+		 * it.  smgr/undofile.c in any such backend will eventually close it,
+		 * because it considers that fd to belong to the file with the name
+		 * that we're unlinking or renaming and it doesn't like to keep more
+		 * than one open at a time.  No backend should ever try to read from
+		 * such a file descriptor; that is what it means when we say that the
+		 * caller of UndoLogDiscard() asserts that there will be no attempts
+		 * to access the discarded range of undo log.  In the case of a
+		 * rename, if a backend were to attempt to read undo data in the range
+		 * being discarded, it would read entirely the wrong data.
+		 */
+
+		/*
+		 * How many segments should we recycle (= rename from tail position to
+		 * head position)?  For now it's always 1 unless there is already a
+		 * spare one, but we could have an adaptive algorithm that recycles
+		 * multiple segments at a time and pays just one fsync().
+		 */
+		LWLockAcquire(&log->mutex, LW_SHARED);
+		if ((log->meta.end - log->meta.insert) < UndoLogSegmentSize &&
+			log->meta.status == UNDO_LOG_STATUS_ACTIVE)
+			recycle = 1;
+		else
+			recycle = 0;
+		LWLockRelease(&log->mutex);
+
+		/* Rewind to the start of the segment. */
+		pointer = segno * UndoLogSegmentSize;
+
+		while (pointer < new_segno * UndoLogSegmentSize)
+		{
+			char	discard_path[MAXPGPATH];
+
+			/*
+			 * Before removing the file, make sure that undofile_sync knows
+			 * that it might be missing.
+			 */
+			undofile_forgetsync(log->logno,
+								log->meta.tablespace,
+								pointer / UndoLogSegmentSize);
+
+			UndoLogSegmentPath(logno, pointer / UndoLogSegmentSize,
+							   log->meta.tablespace, discard_path);
+
+			/* Can we recycle the oldest segment? */
+			if (recycle > 0)
+			{
+				char	recycle_path[MAXPGPATH];
+
+				/*
+				 * End points one byte past the end of the current undo space,
+				 * ie to the first byte of the segment file we want to create.
+				 */
+				UndoLogSegmentPath(logno, end / UndoLogSegmentSize,
+								   log->meta.tablespace, recycle_path);
+				if (rename(discard_path, recycle_path) == 0)
+				{
+					elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+					end += UndoLogSegmentSize;
+					--recycle;
+				}
+				else
+				{
+					elog(ERROR, "could not rename \"%s\" to \"%s\": %m",
+						 discard_path, recycle_path);
+				}
+			}
+			else
+			{
+				if (unlink(discard_path) == 0)
+					elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+				else
+					elog(ERROR, "could not unlink \"%s\": %m", discard_path);
+			}
+			pointer += UndoLogSegmentSize;
+		}
+	}
+
+	/* WAL log the discard. */
+	{
+		xl_undolog_discard xlrec;
+		XLogRecPtr ptr;
+
+		xlrec.logno = logno;
+		xlrec.discard = discard;
+		xlrec.end = end;
+		xlrec.latestxid = xid;
+		xlrec.entirely_discarded = entirely_discarded;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_DISCARD);
+
+		if (need_to_flush_wal)
+			XLogFlush(ptr);
+	}
+
+	/* Update shmem to show the new discard and end pointers. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+
+	/* If we discarded everything, the slot can be given up. */
+	if (entirely_discarded)
+		free_undo_log(log);
+}
+
+/*
+ * Return an UndoRecPtr to the oldest valid data in an undo log, or
+ * InvalidUndoRecPtr if it is empty.
+ */
+UndoRecPtr
+UndoLogGetFirstValidRecord(UndoLogControl *log, bool *full)
+{
+	UndoRecPtr	result;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	if (log->meta.discard == log->meta.insert)
+		result = InvalidUndoRecPtr;
+	else
+		result = MakeUndoRecPtr(log->logno, log->meta.discard);
+	*full = log->meta.status == UNDO_LOG_STATUS_FULL;
+	LWLockRelease(&log->mutex);
+
+	return result;
+}
+
+/*
+ * Return the Next insert location.  This will also validate the input xid
+ * if latest insert point is not for the same transaction id then this will
+ * return Invalid Undo pointer.
+ */
+UndoRecPtr
+UndoLogGetNextInsertPtr(UndoLogNumber logno, TransactionId xid)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	TransactionId	logxid;
+	UndoRecPtr	insert;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	insert = log->meta.insert;
+	logxid = log->xid;
+	LWLockRelease(&log->mutex);
+
+	if (TransactionIdIsValid(logxid) && !TransactionIdEquals(logxid, xid))
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, insert);
+}
+
+/*
+ * Get the address of the most recently inserted record.
+ */
+UndoRecPtr
+UndoLogGetLastRecordPtr(UndoLogNumber logno, TransactionId xid)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	TransactionId logxid;
+	UndoRecPtr insert;
+	uint16 prevlen;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	insert = log->meta.insert;
+	logxid = log->xid;
+	prevlen = log->meta.prevlen;
+	LWLockRelease(&log->mutex);
+
+	if (TransactionIdIsValid(logxid) &&
+		TransactionIdIsValid(xid) &&
+		!TransactionIdEquals(logxid, xid))
+		return InvalidUndoRecPtr;
+
+	if (prevlen == 0)
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, insert - prevlen);
+}
+
+/*
+ * Rewind the undo log insert position also set the prevlen in the mata
+ */
+void
+UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen)
+{
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insert_urp);
+	UndoLogControl *log = get_undo_log(logno, false);
+	UndoLogOffset	insert = UndoRecPtrGetOffset(insert_urp);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = insert;
+	log->meta.prevlen = prevlen;
+
+	/*
+	 * Force the wal log on next undo allocation. So that during recovery undo
+	 * insert location is consistent with normal allocation.
+	 */
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	/* WAL log the rewind. */
+	{
+		xl_undolog_rewind xlrec;
+
+		xlrec.logno = logno;
+		xlrec.insert = insert;
+		xlrec.prevlen = prevlen;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_REWIND);
+	}
+}
+
+/*
+ * Delete unreachable files under pg_undo.  Any files corresponding to LSN
+ * positions before the previous checkpoint are no longer needed.
+ */
+static void
+CleanUpUndoCheckPointFiles(XLogRecPtr checkPointRedo)
+{
+	DIR	   *dir;
+	struct dirent *de;
+	char	path[MAXPGPATH];
+	char	oldest_path[MAXPGPATH];
+
+	/*
+	 * If a base backup is in progress, we can't delete any checkpoint
+	 * snapshot files because one of them corresponds to the backup label but
+	 * there could be any number of checkpoints during the backup.
+	 */
+	if (BackupInProgress())
+		return;
+
+	/* Otherwise keep only those >= the previous checkpoint's redo point. */
+	snprintf(oldest_path, MAXPGPATH, "%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	dir = AllocateDir("pg_undo");
+	while ((de = ReadDir(dir, "pg_undo")) != NULL)
+	{
+		/*
+		 * Assume that fixed width uppercase hex strings sort the same way as
+		 * the values they represent, so we can use strcmp to identify undo
+		 * log snapshot files corresponding to checkpoints that we don't need
+		 * anymore.  This assumption holds for ASCII.
+		 */
+		if (!(strlen(de->d_name) == UNDO_CHECKPOINT_FILENAME_LENGTH))
+			continue;
+
+		if (UndoCheckPointFilenamePrecedes(de->d_name, oldest_path))
+		{
+			snprintf(path, MAXPGPATH, "pg_undo/%s", de->d_name);
+			if (unlink(path) != 0)
+				elog(ERROR, "could not unlink file \"%s\": %m", path);
+		}
+	}
+	FreeDir(dir);
+}
+
+/*
+ * Write out the undo log meta data to the pg_undo directory.  The actual
+ * contents of undo logs is in shared buffers and therefore handled by
+ * CheckPointBuffers(), but here we record the table of undo logs and their
+ * properties.
+ */
+void
+CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogMetaData *serialized = NULL;
+	size_t	serialized_size = 0;
+	char   *data;
+	char	path[MAXPGPATH];
+	int		num_logs;
+	int		fd;
+	int		i;
+	pg_crc32c crc;
+
+	/*
+	 * We acquire UndoLogLock to prevent any undo logs from being created or
+	 * discarded while we build a snapshot of them.  This isn't expected to
+	 * take long on a healthy system because the number of active logs should
+	 * be around the number of backends.  Holding this lock won't prevent
+	 * concurrent access to the undo log, except when segments need to be
+	 * added or removed.
+	 */
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+
+	/*
+	 * Rather than doing the file IO while we hold locks, we'll copy the
+	 * meta-data into a palloc'd buffer.
+	 */
+	serialized_size = sizeof(UndoLogMetaData) * UndoLogNumSlots();
+	serialized = (UndoLogMetaData *) palloc0(serialized_size);
+
+	/* Scan through all slots looking for non-empty ones. */
+	num_logs = 0;
+	for (i = 0; i < UndoLogNumSlots(); ++i)
+	{
+		UndoLogControl *slot = &shared->logs[i];
+
+		/* Skip empty slots. */
+		if (slot->logno == InvalidUndoLogNumber)
+			continue;
+
+		/* Capture snapshot while holding each mutex. */
+		LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
+		serialized[num_logs++] = slot->meta;
+		slot->need_attach_wal_record = true; /* XXX: ?!? */
+		LWLockRelease(&slot->mutex);
+	}
+
+	LWLockRelease(UndoLogLock);
+
+	/* Dump into a file under pg_undo. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE);
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", path)));
+
+	/* Compute header checksum. */
+	INIT_CRC32C(crc);
+	COMP_CRC32C(crc, &shared->low_logno, sizeof(shared->low_logno));
+	COMP_CRC32C(crc, &shared->next_logno, sizeof(shared->next_logno));
+	COMP_CRC32C(crc, &num_logs, sizeof(num_logs));
+	FIN_CRC32C(crc);
+
+	/* Write out the number of active logs + crc. */
+	if ((write(fd, &shared->low_logno, sizeof(shared->low_logno)) != sizeof(shared->low_logno)) ||
+		(write(fd, &shared->next_logno, sizeof(shared->next_logno)) != sizeof(shared->next_logno)) ||
+		(write(fd, &num_logs, sizeof(num_logs)) != sizeof(num_logs)) ||
+		(write(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+	/* Write out the meta data for all active undo logs. */
+	data = (char *) serialized;
+	INIT_CRC32C(crc);
+	serialized_size = num_logs * sizeof(UndoLogMetaData);
+	while (serialized_size > 0)
+	{
+		ssize_t written;
+
+		written = write(fd, data, serialized_size);
+		if (written < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write to file \"%s\": %m", path)));
+		COMP_CRC32C(crc, data, written);
+		serialized_size -= written;
+		data += written;
+	}
+	FIN_CRC32C(crc);
+
+	if (write(fd, &crc, sizeof(crc)) != sizeof(crc))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+
+	/* Flush file and directory entry. */
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC);
+	pg_fsync(fd);
+	CloseTransientFile(fd);
+	fsync_fname("pg_undo", true);
+	pgstat_report_wait_end();
+
+	if (serialized)
+		pfree(serialized);
+
+	CleanUpUndoCheckPointFiles(priorCheckPointRedo);
+	undolog_xid_map_gc();
+}
+
+void
+StartupUndoLogs(XLogRecPtr checkPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char	path[MAXPGPATH];
+	int		i;
+	int		fd;
+	int		nlogs;
+	pg_crc32c crc;
+	pg_crc32c new_crc;
+
+	/* If initdb is calling, there is no file to read yet. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	/* Open the pg_undo file corresponding to the given checkpoint. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_READ);
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+		elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
+
+	/* Read the active log number range. */
+	if ((read(fd, &shared->low_logno, sizeof(shared->low_logno))
+		 != sizeof(shared->low_logno)) ||
+		(read(fd, &shared->next_logno, sizeof(shared->next_logno))
+		 != sizeof(shared->next_logno)) ||
+		(read(fd, &nlogs, sizeof(nlogs)) != sizeof(nlogs)) ||
+		(read(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+
+	/* Verify the header checksum. */
+	INIT_CRC32C(new_crc);
+	COMP_CRC32C(new_crc, &shared->low_logno, sizeof(shared->low_logno));
+	COMP_CRC32C(new_crc, &shared->next_logno, sizeof(shared->next_logno));
+	COMP_CRC32C(new_crc, &nlogs, sizeof(shared->next_logno));
+	FIN_CRC32C(new_crc);
+
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	/*
+	 * We'll acquire UndoLogLock just because allocate_undo_log() asserts we
+	 * hold it (we don't actually expect concurrent access yet).
+	 */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+	/* Initialize all the logs and set up the freelist. */
+	INIT_CRC32C(new_crc);
+	for (i = 0; i < nlogs; ++i)
+	{
+		ssize_t size;
+		UndoLogControl *log;
+
+		/*
+		 * Get a new slot to hold this UndoLogControl object.  If this
+		 * checkpoint was created on a system with a higher max_connections
+		 * setting, it's theoretically possible that we don't have enough
+		 * space and cannot start up.
+		 */
+		log = allocate_undo_log();
+		if (!log)
+			ereport(ERROR,
+					(errmsg("not enough undo log slots to recover from checkpoint: need at least %d, have %zu",
+							nlogs, UndoLogNumSlots()),
+					 errhint("Consider increasing max_connections")));
+
+		/* Read in the meta data for this undo log. */
+		if ((size = read(fd, &log->meta, sizeof(log->meta))) != sizeof(log->meta))
+			elog(ERROR, "short read of pg_undo meta data in file \"%s\": %m (got %zu, wanted %zu)",
+				 path, size, sizeof(log->meta));
+		COMP_CRC32C(new_crc, &log->meta, sizeof(log->meta));
+
+		/*
+		 * At normal start-up, or during recovery, all active undo logs start
+		 * out on the appropriate free list.
+		 */
+		log->logno = log->meta.logno;
+		log->pid = InvalidPid;
+		log->xid = InvalidTransactionId;
+		if (log->meta.status == UNDO_LOG_STATUS_ACTIVE)
+		{
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = log->logno;
+		}
+	}
+	FIN_CRC32C(new_crc);
+
+	LWLockRelease(UndoLogLock);
+
+	/* Verify body checksum. */
+	if (read(fd, &crc, sizeof(crc)) != sizeof(crc))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	CloseTransientFile(fd);
+	pgstat_report_wait_end();
+}
+
+/*
+ * Return a pointer to a newly allocated UndoLogControl object in shared
+ * memory, or return NULL if there are no free slots.  The caller should
+ * acquire the mutex and set up the object.
+ */
+static UndoLogControl *
+allocate_undo_log(void)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log;
+	int		i;
+
+	Assert(LWLockHeldByMeInMode(UndoLogLock, LW_EXCLUSIVE));
+
+	for (i = 0; i < UndoLogNumSlots(); ++i)
+	{
+		log = &shared->logs[i];
+		if (log->logno == InvalidUndoLogNumber)
+		{
+			memset(&log->meta, 0, sizeof(log->meta));
+			log->next_free = InvalidUndoLogNumber;
+			/* TODO: oldest_xid etc? */
+			return log;
+		}
+	}
+
+	return NULL;
+}
+
+/*
+ * Free an UndoLogControl object in shared memory, so that it can be reused.
+ */
+static void
+free_undo_log(UndoLogControl *log)
+{
+	/*
+	 * When removing an undo log from a slot in shared memory, we acquire
+	 * UndoLogLock and log->mutex, so that other code can hold either lock to
+	 * prevent the object from disappearing.
+	 */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	Assert(log->logno != InvalidUndoLogNumber);
+	log->logno = InvalidUndoLogNumber;
+	memset(&log->meta, 0, sizeof(log->meta));
+	LWLockRelease(&log->mutex);
+	LWLockRelease(UndoLogLock);
+}
+
+/*
+ * Get the UndoLogControl object for a given log number.
+ *
+ * The caller may or may not already hold UndoLogLock, and should indicate
+ * this by passing 'locked'.  We'll acquire it in the slow path if necessary.
+ * Either way, the caller must deal with the possibility that the returned
+ * UndoLogControl object pointed to no longer contains the requested logno by
+ * the time it is accessed.
+ *
+ * To do that, one of the following approaches must be taken by the calling
+ * code:
+ *
+ * 1.  If it is known that the calling backend is attached to the log, then it
+ * can be assumed that the UndoLogControl slot still holds the same undo log
+ * number.  The UndoLogControl slot can only change with the cooperation of
+ * the undo log that is attached to it (it must first be marked as
+ * UNDO_LOG_STATUS_FULL, which happens when a backend detaches).  Calling
+ * code should probably assert that it is attached and the logno is as
+ * expected, however.
+ *
+ * 2.  Acquire log->mutex before accessing any members, and after doing so,
+ * check that the logno is as expected.  If it is not, the entire undo log
+ * must be assumed to be discarded and the caller must behave accordingly.
+ *
+ * Return NULL if the undo log has been entirely discarded.  It is an error to
+ * ask for undo logs that have never been created.
+ */
+static UndoLogControl *
+get_undo_log(UndoLogNumber logno, bool locked)
+{
+	UndoLogControl *result = NULL;
+	UndoLogTableEntry *entry;
+	bool	   found;
+
+	Assert(locked == LWLockHeldByMe(UndoLogLock));
+
+	/* First see if we already have it in our cache. */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	if (likely(entry))
+		result = entry->control;
+	else
+	{
+		UndoLogSharedData *shared = MyUndoLogState.shared;
+		int		i;
+
+		/* Nope.  Linear search for the slot in shared memory. */
+		if (!locked)
+			LWLockAcquire(UndoLogLock, LW_SHARED);
+		for (i = 0; i < UndoLogNumSlots(); ++i)
+		{
+			if (shared->logs[i].logno == logno)
+			{
+				/* Found it. */
+
+				/*
+				 * TODO: Should this function be usable in a critical section?
+				 * Woudl it make sense to detect that we are in a critical
+				 * section and just return the pointer to the log without
+				 * updating the cache, to avoid any chance of allocating
+				 * memory?
+				 */
+
+				entry = undologtable_insert(undologtable_cache, logno, &found);
+				entry->number = logno;
+				entry->control = &shared->logs[i];
+				entry->tablespace = entry->control->meta.tablespace;
+				result = entry->control;
+				break;
+			}
+		}
+
+		/*
+		 * If we didn't find it, then it must already have been entirely
+		 * discarded.  We create a negative cache entry so that we can answer
+		 * this question quickly next time.
+		 *
+		 * TODO: We could track the lowest known undo log number, to reduce
+		 * the negative cache entry bloat.
+		 */
+		if (result == NULL)
+		{
+			/*
+			 * Sanity check: the caller should not be asking about undo logs
+			 * that have never existed.
+			 */
+			if (logno >= shared->next_logno)
+				elog(PANIC, "undo log %u hasn't been created yet", logno);
+			entry = undologtable_insert(undologtable_cache, logno, &found);
+			entry->number = logno;
+			entry->control = NULL;
+			entry->tablespace = 0;
+		}
+		if (!locked)
+			LWLockRelease(UndoLogLock);
+	}
+
+	return result;
+}
+
+/*
+ * Get a pointer to an UndoLogControl object corresponding to a given logno.
+ *
+ * In general, the caller must acquire the UndoLogControl's mutex to access
+ * the contents, and at that time must consider that the logno might have
+ * changed because the undo log it contained has been entirely discarded.
+ *
+ * If the calling backend is currently attached to the undo log, that is not
+ * possible, because logs can only reach UNDO_LOG_STATUS_DISCARDED after first
+ * reaching UNDO_LOG_STATUS_FULL, and that only happens while detaching.
+ */
+UndoLogControl *
+UndoLogGet(UndoLogNumber logno, bool missing_ok)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	if (log == NULL && !missing_ok)
+		elog(ERROR, "unknown undo log number %d", logno);
+
+	return log;
+}
+
+/*
+ * Attach to an undo log, possibly creating or recycling one as required.
+ */
+static void
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = NULL;
+	UndoLogNumber logno;
+	UndoLogNumber *place;
+
+	Assert(!InRecovery);
+	Assert(MyUndoLogState.logs[persistence] == NULL);
+
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+	/*
+	 * For now we have a simple linked list of unattached undo logs for each
+	 * persistence level.  We'll grovel though it to find something for the
+	 * tablespace you asked for.  If you're not using multiple tablespaces
+	 * it'll be able to pop one off the front.  We might need a hash table
+	 * keyed by tablespace if this simple scheme turns out to be too slow when
+	 * using many tablespaces and many undo logs, but that seems like an
+	 * unusual use case not worth optimizing for.
+	 */
+	place = &shared->free_lists[persistence];
+	while (*place != InvalidUndoLogNumber)
+	{
+		UndoLogControl *candidate = get_undo_log(*place, true);
+
+		/*
+		 * There should never be an undo log on the freelist that has been
+		 * entirely discarded, or hasn't been created yet.  The persistence
+		 * level should match the freelist.
+		 */
+		if (unlikely(candidate == NULL))
+			elog(ERROR,
+				 "corrupted undo log freelist, no such undo log %u", *place);
+		if (unlikely(candidate->meta.persistence != persistence))
+			elog(ERROR,
+				 "corrupted undo log freelist, undo log %u with persistence %d found on freelist %d",
+				 *place, candidate->meta.persistence, persistence);
+
+		if (candidate->meta.tablespace == tablespace)
+		{
+			logno = *place;
+			log = candidate;
+			*place = candidate->next_free;
+			break;
+		}
+		place = &candidate->next_free;
+	}
+
+	/*
+	 * All existing undo logs for this tablespace and persistence level are
+	 * busy, so we'll have to create a new one.
+	 */
+	if (log == NULL)
+	{
+		if (shared->next_logno > MaxUndoLogNumber)
+		{
+			/*
+			 * You've used up all 16 exabytes of undo log addressing space.
+			 * This is a difficult state to reach using only 16 exabytes of
+			 * WAL.
+			 */
+			elog(ERROR, "undo log address space exhausted");
+		}
+
+		/* Allocate a slot from the UndoLogControl pool. */
+		log = allocate_undo_log();
+		if (unlikely(!log))
+			ereport(ERROR,
+					(errmsg("could not create new undo log"),
+					 errdetail("The maximum number of active undo logs is %zu.",
+							   UndoLogNumSlots()),
+					 errhint("Consider increasing max_connections.")));
+		log->logno = logno = shared->next_logno;
+
+		/*
+		 * The insert and discard pointers start after the first block's
+		 * header.  XXX That means that insert is > end for a short time in a
+		 * newly created undo log.  Is there any problem with that?
+		 */
+		log->meta.insert = UndoLogBlockHeaderSize;
+		log->meta.discard = UndoLogBlockHeaderSize;
+
+		log->meta.logno = logno;
+		log->meta.tablespace = tablespace;
+		log->meta.persistence = persistence;
+		log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+
+		/* Move the high log number pointer past this one. */
+		++shared->next_logno;
+
+		/* WAL-log the creation of this new undo log. */
+		{
+			xl_undolog_create xlrec;
+
+			xlrec.logno = logno;
+			xlrec.tablespace = log->meta.tablespace;
+			xlrec.persistence = log->meta.persistence;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE);
+		}
+
+		/*
+		 * This undo log has no segments.  UndoLogAllocate will create the
+		 * first one on demand.
+		 */
+	}
+	LWLockRelease(UndoLogLock);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = MyProcPid;
+	log->xid = InvalidTransactionId;
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	MyUndoLogState.logs[persistence] = log;
+}
+
+/*
+ * Free chunks of the xid/undo log map that relate to transactions that are no
+ * longer running.  This is run at each checkpoint.
+ */
+static void
+undolog_xid_map_gc(void)
+{
+	UndoLogNumber **xid_map = MyUndoLogState.xid_map;
+	TransactionId oldest_xid;
+	uint16 new_oldest_chunk;
+	uint16 oldest_chunk;
+
+	if (xid_map == NULL)
+		return;
+
+	/*
+	 * During crash recovery, it may not be possible to call GetOldestXmin()
+	 * yet because latestCompletedXid is invalid.
+	 */
+	if (!TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid))
+		return;
+
+	oldest_xid = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
+	new_oldest_chunk = UndoLogGetXidHigh(oldest_xid);
+	oldest_chunk = MyUndoLogState.xid_map_oldest_chunk;
+
+	while (oldest_chunk != new_oldest_chunk)
+	{
+		if (xid_map[oldest_chunk])
+		{
+			pfree(xid_map[oldest_chunk]);
+			xid_map[oldest_chunk] = NULL;
+		}
+		oldest_chunk = (oldest_chunk + 1) % (1 << UndoLogXidHighBits);
+	}
+	MyUndoLogState.xid_map_oldest_chunk = new_oldest_chunk;
+}
+
+/*
+ * Associate a xid with an undo log, during recovery.  In a primary server,
+ * this isn't necessary because backends know which undo log they're attached
+ * to.  During recovery, the natural association between backends and xids is
+ * lost, so we need to manage that explicitly.
+ */
+static void
+undolog_xid_map_add(TransactionId xid, UndoLogNumber logno)
+{
+	uint16		high_bits;
+	uint16		low_bits;
+
+	high_bits = UndoLogGetXidHigh(xid);
+	low_bits = UndoLogGetXidLow(xid);
+
+	if (unlikely(MyUndoLogState.xid_map == NULL))
+	{
+		/* First time through.  Create mapping array. */
+		MyUndoLogState.xid_map =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber *) *
+								   (1 << (32 - UndoLogXidLowBits)));
+		MyUndoLogState.xid_map_oldest_chunk = high_bits;
+	}
+
+	if (unlikely(MyUndoLogState.xid_map[high_bits] == NULL))
+	{
+		/* This bank of mappings doesn't exist yet.  Create it. */
+		MyUndoLogState.xid_map[high_bits] =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber) *
+								   (1 << UndoLogXidLowBits));
+	}
+
+	/* Associate this xid with this undo log number. */
+	MyUndoLogState.xid_map[high_bits][low_bits] = logno;
+}
+
+/* check_hook: validate new undo_tablespaces */
+bool
+check_undo_tablespaces(char **newval, void **extra, GucSource source)
+{
+	char	   *rawname;
+	List	   *namelist;
+
+	/* Need a modifiable copy of string */
+	rawname = pstrdup(*newval);
+
+	/*
+	 * Parse string into list of identifiers, just to check for
+	 * well-formedness (unfortunateley we can't validate the names in the
+	 * catalog yet).
+	 */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+	{
+		/* syntax error in name list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawname);
+		list_free(namelist);
+		return false;
+	}
+
+	/*
+	 * Make sure we aren't already in a transaction that has been assigned an
+	 * XID.  This ensures we don't detach from an undo log that we might have
+	 * started writing undo data into for this transaction.
+	 */
+	if (GetTopTransactionIdIfAny() != InvalidTransactionId)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 (errmsg("undo_tablespaces cannot be changed while a transaction is in progress"))));
+	list_free(namelist);
+
+	return true;
+}
+
+/* assign_hook: do extra actions as needed */
+void
+assign_undo_tablespaces(const char *newval, void *extra)
+{
+	/*
+	 * This is normally called only when GetTopTransactionIdIfAny() ==
+	 * InvalidTransactionId (because you can't change undo_tablespaces in the
+	 * middle of a transaction that's been asigned an xid), but we can't
+	 * assert that because it's also called at the end of a transaction that's
+	 * rolling back, to reset the GUC if it was set inside the transaction.
+	 */
+
+	/* Tell UndoLogAllocate() to reexamine undo_tablespaces. */
+	MyUndoLogState.need_to_choose_tablespace = true;
+}
+
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+	char   *rawname;
+	List   *namelist;
+	bool	need_to_unlock;
+	int		length;
+	int		i;
+
+	/* We need a modifiable copy of string. */
+	rawname = pstrdup(undo_tablespaces);
+
+	/* Break string into list of identifiers. */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+		elog(ERROR, "undo_tablespaces is unexpectedly malformed");
+
+	length = list_length(namelist);
+	if (length == 0 ||
+		(length == 1 && ((char *) linitial(namelist))[0] == '\0'))
+	{
+		/*
+		 * If it's an empty string, then we'll use the default tablespace.  No
+		 * locking is required because it can't be dropped.
+		 */
+		*tablespace = DEFAULTTABLESPACE_OID;
+		need_to_unlock = false;
+	}
+	else
+	{
+		/*
+		 * Choose an OID using our pid, so that if several backends have the
+		 * same multi-tablespace setting they'll spread out.  We could easily
+		 * do better than this if more serious load balancing is judged
+		 * useful.
+		 */
+		int		index = MyProcPid % length;
+		int		first_index = index;
+		Oid		oid = InvalidOid;
+
+		/*
+		 * Take the tablespace create/drop lock while we look the name up.
+		 * This prevents the tablespace from being dropped while we're trying
+		 * to resolve the name, or while the called is trying to create an
+		 * undo log in it.  The caller will have to release this lock.
+		 */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		for (;;)
+		{
+			const char *name = list_nth(namelist, index);
+
+			oid = get_tablespace_oid(name, true);
+			if (oid == InvalidOid)
+			{
+				/* Unknown tablespace, try the next one. */
+				index = (index + 1) % length;
+				/*
+				 * But if we've tried them all, it's time to complain.  We'll
+				 * arbitrarily complain about the last one we tried in the
+				 * error message.
+				 */
+				if (index == first_index)
+					ereport(ERROR,
+							(errcode(ERRCODE_UNDEFINED_OBJECT),
+							 errmsg("tablespace \"%s\" does not exist", name),
+							 errhint("Create the tablespace or set undo_tablespaces to a valid or empty list.")));
+				continue;
+			}
+			if (oid == GLOBALTABLESPACE_OID)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("undo logs cannot be placed in pg_global tablespace")));
+			/* If we got here we succeeded in finding one. */
+			break;
+		}
+
+		Assert(oid != InvalidOid);
+		*tablespace = oid;
+		need_to_unlock = true;
+	}
+
+	/*
+	 * If we came here because the user changed undo_tablesaces, then detach
+	 * from any undo logs we happen to be attached to.
+	 */
+	if (force_detach)
+	{
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+		{
+			UndoLogControl *log = MyUndoLogState.logs[i];
+			UndoLogSharedData *shared = MyUndoLogState.shared;
+
+			if (log != NULL)
+			{
+				LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+				log->pid = InvalidPid;
+				log->xid = InvalidTransactionId;
+				LWLockRelease(&log->mutex);
+
+				LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+				log->next_free = shared->free_lists[i];
+				shared->free_lists[i] = log->logno;
+				LWLockRelease(UndoLogLock);
+
+				MyUndoLogState.logs[i] = NULL;
+			}
+		}
+	}
+
+	return need_to_unlock;
+}
+
+bool
+DropUndoLogsInTablespace(Oid tablespace)
+{
+	DIR *dir;
+	char undo_path[MAXPGPATH];
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log;
+	int		i;
+
+	Assert(LWLockHeldByMe(TablespaceCreateLock));
+	Assert(tablespace != DEFAULTTABLESPACE_OID);
+
+	/* First, try to kick everyone off any undo logs in this tablespace. */
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		bool ok;
+		bool return_to_freelist = false;
+
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/* Check if this undo log can be forcibly detached. */
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		if (log->meta.discard == log->meta.insert &&
+			(log->xid == InvalidTransactionId ||
+			 !TransactionIdIsInProgress(log->xid)))
+		{
+			log->xid = InvalidTransactionId;
+			if (log->pid != InvalidPid)
+			{
+				log->pid = InvalidPid;
+				return_to_freelist = true;
+			}
+			ok = true;
+		}
+		else
+		{
+			/*
+			 * There is data we need in this undo log.  We can't force it to
+			 * be detached.
+			 */
+			ok = false;
+		}
+		LWLockRelease(&log->mutex);
+
+		/* If we failed, then give up now and report failure. */
+		if (!ok)
+			return false;
+
+		/*
+		 * Put this undo log back on the appropriate free-list.  No one can
+		 * attach to it while we hold TablespaceCreateLock, but if we return
+		 * earlier in a future go around this loop, we need the undo log to
+		 * remain usable.  We'll remove all appropriate logs from the
+		 * free-lists in a separate step below.
+		 */
+		if (return_to_freelist)
+		{
+			LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = log->logno;
+			LWLockRelease(UndoLogLock);
+		}
+	}
+
+	/*
+	 * We detached all backends from undo logs in this tablespace, and no one
+	 * can attach to any non-default-tablespace undo logs while we hold
+	 * TablespaceCreateLock.  We can now drop the undo logs.
+	 */
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/*
+		 * Make sure no buffers remain.  When that is done by UndoDiscard(),
+		 * the final page is left in shared_buffers because it may contain
+		 * data, or at least be needed again very soon.  Here we need to drop
+		 * even that page from the buffer pool.
+		 */
+		forget_undo_buffers(log->logno, log->meta.discard, log->meta.discard, true);
+
+		/*
+		 * TODO: For now we drop the undo log, meaning that it will never be
+		 * used again.  That wastes the rest of its address space.  Instead,
+		 * we should put it onto a special list of 'offline' undo logs, ready
+		 * to be reactivated in some other tablespace.  Then we can keep the
+		 * unused portion of its address space.
+		 */
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		log->meta.status = UNDO_LOG_STATUS_DISCARDED;
+		LWLockRelease(&log->mutex);
+	}
+
+	/* Unlink all undo segment files in this tablespace. */
+	UndoLogDirectory(tablespace, undo_path);
+
+	dir = AllocateDir(undo_path);
+	if (dir != NULL)
+	{
+		struct dirent *de;
+
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 ||
+				strcmp(de->d_name, "..") == 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+	}
+
+	/* Remove all dropped undo logs from the free-lists. */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		UndoLogControl *log;
+		UndoLogNumber *place;
+
+		place = &shared->free_lists[i];
+		while (*place != InvalidUndoLogNumber)
+		{
+			log = get_undo_log(*place, true);
+			if (!log)
+				elog(ERROR,
+					 "corrupted undo log freelist, unknown log %u", *place);
+			if (log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+				*place = log->next_free;
+			else
+				place = &log->next_free;
+		}
+	}
+	LWLockRelease(UndoLogLock);
+
+	return true;
+}
+
+void
+ResetUndoLogs(UndoPersistence persistence)
+{
+	UndoLogControl *log;
+
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		DIR	   *dir;
+		struct dirent *de;
+		char	undo_path[MAXPGPATH];
+		char	segment_prefix[MAXPGPATH];
+		size_t	segment_prefix_size;
+
+		if (log->meta.persistence != persistence)
+			continue;
+
+		/* Scan the directory for files belonging to this undo log. */
+		snprintf(segment_prefix, sizeof(segment_prefix), "%06X.", log->logno);
+		segment_prefix_size = strlen(segment_prefix);
+		UndoLogDirectory(log->meta.tablespace, undo_path);
+		dir = AllocateDir(undo_path);
+		if (dir == NULL)
+			continue;
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			elog(LOG, "unlinked undo segment \"%s\"", segment_path); /* XXX: remove me */
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+
+		/*
+		 * We have no segment files.  Set the pointers to indicate that there
+		 * is no data.  The discard and insert pointers point to the first
+		 * usable byte in the segment we will create when we next try to
+		 * allocate.  This is a bit strange, because it means that they are
+		 * past the end pointer.  That's the same as when new undo logs are
+		 * created.
+		 *
+		 * TODO: Should we rewind to zero instead, so we can reuse that (now)
+		 * unreferenced address space?
+		 */
+		log->meta.insert = log->meta.discard = log->meta.end +
+			UndoLogBlockHeaderSize;
+	}
+}
+
+Datum
+pg_stat_get_undo_logs(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_UNDO_LOGS_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char *tablespace_name = NULL;
+	Oid last_tablespace = InvalidOid;
+	int			i;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Scan all undo logs to build the results. */
+	for (i = 0; i < shared->array_size; ++i)
+	{
+		UndoLogControl *log = &shared->logs[i];
+		char buffer[17];
+		Datum values[PG_STAT_GET_UNDO_LOGS_COLS];
+		bool nulls[PG_STAT_GET_UNDO_LOGS_COLS] = { false };
+		Oid tablespace;
+
+		if (log == NULL)
+			continue;
+
+		/*
+		 * This won't be a consistent result overall, but the values for each
+		 * log will be consistent because we'll take the per-log lock while
+		 * copying them.
+		 */
+		LWLockAcquire(&log->mutex, LW_SHARED);
+
+		/* Skip unused slots and entirely discarded undo logs. */
+		if (log->logno == InvalidUndoLogNumber ||
+			log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+		{
+			LWLockRelease(&log->mutex);
+			continue;
+		}
+
+		values[0] = ObjectIdGetDatum((Oid) log->logno);
+		values[1] = CStringGetTextDatum(
+			log->meta.persistence == UNDO_PERMANENT ? "permanent" :
+			log->meta.persistence == UNDO_UNLOGGED ? "unlogged" :
+			log->meta.persistence == UNDO_TEMP ? "temporary" : "<uknown>");
+		tablespace = log->meta.tablespace;
+
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.discard));
+		values[3] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.insert));
+		values[4] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.end));
+		values[5] = CStringGetTextDatum(buffer);
+		if (log->xid == InvalidTransactionId)
+			nulls[6] = true;
+		else
+			values[6] = TransactionIdGetDatum(log->xid);
+		if (log->pid == InvalidPid)
+			nulls[7] = true;
+		else
+			values[7] = Int32GetDatum((int64) log->pid);
+		switch (log->meta.status)
+		{
+		case UNDO_LOG_STATUS_ACTIVE:
+			values[8] = CStringGetTextDatum("ACTIVE"); break;
+		case UNDO_LOG_STATUS_FULL:
+			values[8] = CStringGetTextDatum("FULL"); break;
+		default:
+			nulls[8] = true;
+		}
+		LWLockRelease(&log->mutex);
+
+		/*
+		 * Deal with potentially slow tablespace name lookup without the lock.
+		 * Avoid making multiple calls to that expensive function for the
+		 * common case of repeating tablespace.
+		 */
+		if (tablespace != last_tablespace)
+		{
+			if (tablespace_name)
+				pfree(tablespace_name);
+			tablespace_name = get_tablespace_name(tablespace);
+			last_tablespace = tablespace;
+		}
+		if (tablespace_name)
+		{
+			values[2] = CStringGetTextDatum(tablespace_name);
+			nulls[2] = false;
+		}
+		else
+			nulls[2] = true;
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	if (tablespace_name)
+		pfree(tablespace_name);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * replay the creation of a new undo log
+ */
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+	xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	/* Create meta-data space in shared memory. */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	/* TODO: assert that it doesn't exist already? */
+	log = allocate_undo_log();
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->logno = xlrec->logno;
+	log->meta.logno = xlrec->logno;
+	log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+	log->meta.persistence = xlrec->persistence;
+	log->meta.tablespace = xlrec->tablespace;
+	log->meta.insert = UndoLogBlockHeaderSize;
+	log->meta.discard = UndoLogBlockHeaderSize;
+	shared->next_logno = Max(xlrec->logno + 1, shared->next_logno);
+	LWLockRelease(&log->mutex);
+	LWLockRelease(UndoLogLock);
+}
+
+/*
+ * replay the addition of a new segment to an undo log
+ */
+static void
+undolog_xlog_extend(XLogReaderState *record)
+{
+	xl_undolog_extend *xlrec = (xl_undolog_extend *) XLogRecGetData(record);
+
+	/* Extend exactly as we would during DO phase. */
+	extend_undo_log(xlrec->logno, xlrec->end);
+}
+
+/*
+ * replay the association of an xid with a specific undo log
+ */
+static void
+undolog_xlog_attach(XLogReaderState *record)
+{
+	xl_undolog_attach *xlrec = (xl_undolog_attach *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	undolog_xid_map_add(xlrec->xid, xlrec->logno);
+
+	/* Restore current dbid */
+	MyUndoLogState.dbid = xlrec->dbid;
+
+	/*
+	 * Whatever follows is the first record for this transaction.  Zheap will
+	 * use this to add UREC_INFO_TRANSACTION.
+	 */
+	log = get_undo_log(xlrec->logno, false);
+	/* TODO */
+	log->meta.is_first_rec = true;
+	log->xid = xlrec->xid;
+}
+
+/*
+ * replay the undo-log switch wal.  Store the transaction's undo record
+ * pointer of the previous log in MyUndoLogState temporarily, which will
+ * be reset after reading first time.
+ */
+static void
+undolog_xlog_switch(XLogReaderState *record)
+{
+	UndoRecPtr prevlogurp = *((UndoRecPtr *) XLogRecGetData(record));
+
+	MyUndoLogState.prevlogurp = prevlogurp;
+}
+/*
+ * Drop all buffers for the given undo log, from the old_discard to up
+ * new_discard.  If drop_tail is true, also drop the buffer that holds
+ * new_discard; this is used when discarding undo logs completely, for example
+ * via DROP TABLESPACE.  If it is false, then the final buffer is not dropped
+ * because it may contain data.
+ *
+ */
+static void
+forget_undo_buffers(int logno, UndoLogOffset old_discard,
+					UndoLogOffset new_discard, bool drop_tail)
+{
+	BlockNumber old_blockno;
+	BlockNumber new_blockno;
+	RelFileNode	rnode;
+
+	UndoRecPtrAssignRelFileNode(rnode, MakeUndoRecPtr(logno, old_discard));
+	old_blockno = old_discard / BLCKSZ;
+	new_blockno = new_discard / BLCKSZ;
+	if (drop_tail)
+		++new_blockno;
+	while (old_blockno < new_blockno)
+		ForgetBuffer(rnode, UndoLogForkNum, old_blockno++);
+}
+
+/*
+ * replay an undo segment discard record
+ */
+static void
+undolog_xlog_discard(XLogReaderState *record)
+{
+	xl_undolog_discard *xlrec = (xl_undolog_discard *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogOffset discard;
+	UndoLogOffset end;
+	UndoLogOffset old_segment_begin;
+	UndoLogOffset new_segment_begin;
+	RelFileNode rnode = {0};
+	char	dir[MAXPGPATH];
+
+	log = get_undo_log(xlrec->logno, false);
+	if (log == NULL)
+		elog(ERROR, "unknown undo log %d", xlrec->logno);
+
+	/*
+	 * We're about to discard undologs. In Hot Standby mode, ensure that
+	 * there's no queries running which need to get tuple from discarded undo.
+	 *
+	 * XXX we are passing empty rnode to the conflict function so that it can
+	 * check conflict in all the backend regardless of which database the
+	 * backend is connected.
+	 */
+	if (InHotStandby && TransactionIdIsValid(xlrec->latestxid))
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestxid, rnode);
+
+	/*
+	 * See if we need to unlink or rename any files, but don't consider it an
+	 * error if we find that files are missing.  Since UndoLogDiscard()
+	 * performs filesystem operations before WAL logging or updating shmem
+	 * which could be checkpointed, a crash could have left files already
+	 * deleted, but we could replay WAL that expects the files to be there.
+	 */
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	Assert(log->logno == xlrec->logno);
+	discard = log->meta.discard;
+	end = log->meta.end;
+	LWLockRelease(&log->mutex);
+
+	/* Drop buffers before we remove/recycle any files. */
+	forget_undo_buffers(xlrec->logno, discard, xlrec->discard,
+						xlrec->entirely_discarded);
+
+	/* Rewind to the start of the segment. */
+	old_segment_begin = discard - discard % UndoLogSegmentSize;
+	new_segment_begin = xlrec->discard - xlrec->discard % UndoLogSegmentSize;
+
+	/* Unlink or rename segments that are no longer in range. */
+	while (old_segment_begin < new_segment_begin)
+	{
+		char	discard_path[MAXPGPATH];
+
+		/*
+		 * Before removing the file, make sure that undofile_sync knows that
+		 * it might be missing.
+		 */
+		undofile_forgetsync(log->logno,
+							log->meta.tablespace,
+							old_segment_begin / UndoLogSegmentSize);
+
+		UndoLogSegmentPath(xlrec->logno, old_segment_begin / UndoLogSegmentSize,
+						   log->meta.tablespace, discard_path);
+
+		/* Can we recycle the oldest segment? */
+		if (end < xlrec->end)
+		{
+			char	recycle_path[MAXPGPATH];
+
+			UndoLogSegmentPath(xlrec->logno, end / UndoLogSegmentSize,
+							   log->meta.tablespace, recycle_path);
+			if (rename(discard_path, recycle_path) == 0)
+			{
+				elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+				end += UndoLogSegmentSize;
+			}
+			else
+			{
+				elog(LOG, "could not rename \"%s\" to \"%s\": %m",
+					 discard_path, recycle_path);
+			}
+		}
+		else
+		{
+			if (unlink(discard_path) == 0)
+				elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+			else
+				elog(LOG, "could not unlink \"%s\": %m", discard_path);
+		}
+		old_segment_begin += UndoLogSegmentSize;
+	}
+
+	/* Create any further new segments that are needed the slow way. */
+	while (end < xlrec->end)
+	{
+		allocate_empty_undo_segment(xlrec->logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	/* Flush the directory entries. */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/* Update shmem. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = xlrec->discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+
+	/* If we discarded everything, the slot can be given up. */
+	if (xlrec->entirely_discarded)
+		free_undo_log(log);
+}
+
+/*
+ * replay the rewind of a undo log
+ */
+static void
+undolog_xlog_rewind(XLogReaderState *record)
+{
+	xl_undolog_rewind *xlrec = (xl_undolog_rewind *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	log = get_undo_log(xlrec->logno, false);
+	log->meta.insert = xlrec->insert;
+	log->meta.prevlen = xlrec->prevlen;
+}
+
+void
+undolog_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			undolog_xlog_create(record);
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			undolog_xlog_extend(record);
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			undolog_xlog_attach(record);
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			undolog_xlog_discard(record);
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			undolog_xlog_rewind(record);
+			break;
+		case XLOG_UNDOLOG_SWITCH:
+			undolog_xlog_switch(record);
+			break;
+		default:
+			elog(PANIC, "undo_redo: unknown op code %u", info);
+	}
+}
+
+/*
+ * For assertions only.
+ */
+bool
+AmAttachedToUndoLog(UndoLogControl *log)
+{
+	/*
+	 * In general, we can't access log's members without locking.  But this
+	 * function is intended only for asserting that you are attached, and
+	 * while you're attached the slot can't be recycled, so don't bother
+	 * locking.
+	 */
+	return MyUndoLogState.logs[log->meta.persistence] == log;
+}
+
+/*
+ * For testing use only.  This function is only used by the test_undo module.
+ */
+void
+UndoLogDetachFull(void)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+		if (MyUndoLogState.logs[i])
+			detach_current_undo_log(i, true);
+}
+
+/*
+ * Fetch database id from the undo log state
+ */
+Oid
+UndoLogStateGetDatabaseId()
+{
+	Assert(InRecovery);
+	return MyUndoLogState.dbid;
+}
+
+/*
+ * Get transaction start header in the previous log
+ * 
+ * This should be only called during recovery.  The value of prevlogurp
+ * is restored in MyUndoLogState while replying the UNDOLOG_XLOG_SWITCH
+ * wal and it will be cleared in this function.
+ */
+UndoRecPtr
+UndoLogStateGetAndClearPrevLogXactUrp()
+{
+	UndoRecPtr	prevlogurp;
+
+	Assert(InRecovery);
+	prevlogurp = MyUndoLogState.prevlogurp;
+	MyUndoLogState.prevlogurp = InvalidUndoRecPtr;
+
+	return prevlogurp;
+}
+
+/*
+ * Get the undo log number my backend is attached to
+ */
+UndoLogNumber
+UndoLogAmAttachedTo(UndoPersistence persistence)
+{
+	if (MyUndoLogState.logs[persistence] == NULL)
+		return InvalidUndoLogNumber;
+	return MyUndoLogState.logs[persistence]->logno;
+}
+
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 8630542..8f83ce1 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -940,6 +940,10 @@ GRANT SELECT (subdbid, subname, subowner, subenabled, subslotname, subpublicatio
     ON pg_subscription TO public;
 
 
+CREATE VIEW pg_stat_undo_logs AS
+    SELECT *
+    FROM pg_stat_get_undo_logs();
+
 --
 -- We have a few function definitions in here, too.
 -- At some point there might be enough to justify breaking them out into
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 4a714f6..281c1e9 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -54,6 +54,7 @@
 #include "access/reloptions.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/undolog.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -488,6 +489,20 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 	LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
 
 	/*
+	 * Drop the undo logs in this tablespace.  This will fail (without
+	 * dropping anything) if there are undo logs that we can't afford to drop
+	 * because they contain non-discarded data or a transaction is in
+	 * progress.  Since we hold TablespaceCreateLock, no other session will be
+	 * able to attach to an undo log in this tablespace (or any tablespace
+	 * except default) concurrently.
+	 */
+	if (!DropUndoLogsInTablespace(tablespaceoid))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs",
+						tablespacename)));
+
+	/*
 	 * Try to remove the physical infrastructure.
 	 */
 	if (!destroy_tablespace_directories(tablespaceoid, false))
@@ -1487,6 +1502,14 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		/* This shouldn't be able to fail in recovery. */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		if (!DropUndoLogsInTablespace(xlrec->ts_id))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("tablespace cannot be dropped because it contains non-empty undo logs")));
+		LWLockRelease(TablespaceCreateLock);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index e3b0565..1a7a381 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -154,6 +154,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
+		case RM_UNDOLOG_ID:
 			/* just deal with xid, and done */
 			ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
 									buf.origptr);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 0c86a58..4725cbe 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/undolog.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -127,6 +128,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, UndoLogShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
@@ -219,6 +221,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	UndoLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index a6fda81..b6c0b00 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,8 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+	LWLockRegisterTranche(LWTRANCHE_UNDOLOG, "undo_log");
+	LWLockRegisterTranche(LWTRANCHE_UNDODISCARD, "undo_discard");
 
 	/* Register named tranches. */
 	for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index e6025ec..554af46 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -50,3 +50,4 @@ OldSnapshotTimeMapLock				42
 BackendRandomLock					43
 LogicalRepWorkerLock				44
 CLogTruncationLock					45
+UndoLogLock							46
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index b636b1e..fcc7a86 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -556,6 +556,7 @@ BaseInit(void)
 	InitFileAccess();
 	smgrinit();
 	InitBufferPoolAccess();
+	UndoLogInit();
 }
 
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 6fe1939..b188657 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -119,6 +119,7 @@ extern int	CommitDelay;
 extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
+extern char *undo_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
 
@@ -3534,6 +3535,17 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Sets the tablespace(s) to use for undo logs."),
+			NULL,
+			GUC_LIST_INPUT | GUC_LIST_QUOTE
+		},
+		&undo_tablespaces,
+		"",
+		check_undo_tablespaces, assign_undo_tablespaces, NULL
+	},
+
+	{
 		{"dynamic_library_path", PGC_SUSET, CLIENT_CONN_OTHER,
 			gettext_noop("Sets the path for dynamically loadable modules."),
 			gettext_noop("If a dynamically loadable module needs to be opened and "
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 211a963..ea02210 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -209,11 +209,13 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_undo",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
+	"base/undo",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..938150d 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -20,6 +20,7 @@
 #include "access/nbtxlog.h"
 #include "access/rmgr.h"
 #include "access/spgxlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 0bbe9879..9c6fca4 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_UNDOLOG_ID, "UndoLog", undolog_redo, undolog_desc, undolog_identify, NULL, NULL, NULL)
diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h
new file mode 100644
index 0000000..8a7e1e4
--- /dev/null
+++ b/src/include/access/undolog.h
@@ -0,0 +1,398 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.h
+ *
+ * PostgreSQL undo log manager.  This module is responsible for lifecycle
+ * management of undo logs and backing files, associating undo logs with
+ * backends, allocating and managing space within undo logs.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_H
+#define UNDOLOG_H
+
+#include "access/xlogreader.h"
+#include "catalog/pg_class.h"
+#include "common/relpath.h"
+#include "storage/bufpage.h"
+
+#ifndef FRONTEND
+#include "storage/lwlock.h"
+#endif
+
+/* The type used to identify an undo log and position within it. */
+typedef uint64 UndoRecPtr;
+
+/* The type used for undo record lengths. */
+typedef uint16 UndoRecordSize;
+
+/* Undo log statuses. */
+typedef enum
+{
+	UNDO_LOG_STATUS_UNUSED = 0,
+	UNDO_LOG_STATUS_ACTIVE,
+	UNDO_LOG_STATUS_FULL,
+	UNDO_LOG_STATUS_DISCARDED
+} UndoLogStatus;
+
+/*
+ * Undo log persistence levels.  These have a one-to-one correspondence with
+ * relpersistence values, but are small integers so that we can use them as an
+ * index into the "logs" and "lognos" arrays.
+ */
+typedef enum
+{
+	UNDO_PERMANENT = 0,
+	UNDO_UNLOGGED = 1,
+	UNDO_TEMP = 2
+} UndoPersistence;
+
+#define UndoPersistenceLevels 3
+
+/*
+ * Convert from relpersistence ('p', 'u', 't') to an UndoPersistence
+ * enumerator.
+ */
+#define UndoPersistenceForRelPersistence(rp)						\
+	((rp) == RELPERSISTENCE_PERMANENT ? UNDO_PERMANENT :			\
+	 (rp) == RELPERSISTENCE_UNLOGGED ? UNDO_UNLOGGED : UNDO_TEMP)
+
+/*
+ * Convert from UndoPersistence to a relpersistence value.
+ */
+#define RelPersistenceForUndoPersistence(up)				\
+	((up) == UNDO_PERMANENT ? RELPERSISTENCE_PERMANENT :	\
+	 (up) == UNDO_UNLOGGED ? RELPERSISTENCE_UNLOGGED :		\
+	 RELPERSISTENCE_TEMP)
+
+/*
+ * Get the appropriate UndoPersistence value from a Relation.
+ */
+#define UndoPersistenceForRelation(rel)									\
+	(UndoPersistenceForRelPersistence((rel)->rd_rel->relpersistence))
+
+/* Type for offsets within undo logs */
+typedef uint64 UndoLogOffset;
+
+/* printf-family format string for UndoRecPtr. */
+#define UndoRecPtrFormat "%016" INT64_MODIFIER "X"
+
+/* printf-family format string for UndoLogOffset. */
+#define UndoLogOffsetFormat UINT64_FORMAT
+
+/* Number of blocks of BLCKSZ in an undo log segment file.  128 = 1MB. */
+#define UNDOSEG_SIZE 128
+
+/* Size of an undo log segment file in bytes. */
+#define UndoLogSegmentSize ((size_t) BLCKSZ * UNDOSEG_SIZE)
+
+/* The width of an undo log number in bits.  24 allows for 16.7m logs. */
+#define UndoLogNumberBits 24
+
+/* The maximum valid undo log number. */
+#define MaxUndoLogNumber ((1 << UndoLogNumberBits) - 1)
+
+/* The width of an undo log offset in bits.  40 allows for 1TB per log.*/
+#define UndoLogOffsetBits (64 - UndoLogNumberBits)
+
+/* Special value for undo record pointer which indicates that it is invalid. */
+#define	InvalidUndoRecPtr	((UndoRecPtr) 0)
+
+/* End-of-list value when building linked lists of undo logs. */
+#define InvalidUndoLogNumber -1
+
+/*
+ * This undo record pointer will be used in the transaction header this special
+ * value is the indication that currently we don't have the value of the the
+ * next transactions start point but it will be updated with a valid value
+ * in the future.
+ */
+#define SpecialUndoRecPtr	((UndoRecPtr) 0xFFFFFFFFFFFFFFFF)
+
+/*
+ * The maximum amount of data that can be stored in an undo log.  Can be set
+ * artificially low to test full log behavior.
+ */
+#define UndoLogMaxSize ((UndoLogOffset) 1 << UndoLogOffsetBits)
+
+/* Type for numbering undo logs. */
+typedef int UndoLogNumber;
+
+/* Extract the undo log number from an UndoRecPtr. */
+#define UndoRecPtrGetLogNo(urp)					\
+	((urp) >> UndoLogOffsetBits)
+
+/* Extract the offset from an UndoRecPtr. */
+#define UndoRecPtrGetOffset(urp)				\
+	((urp) & ((UINT64CONST(1) << UndoLogOffsetBits) - 1))
+
+/* Make an UndoRecPtr from an log number and offset. */
+#define MakeUndoRecPtr(logno, offset)			\
+	(((uint64) (logno) << UndoLogOffsetBits) | (offset))
+
+/* The number of unusable bytes in the header of each block. */
+#define UndoLogBlockHeaderSize SizeOfPageHeaderData
+
+/* The number of usable bytes we can store per block. */
+#define UndoLogUsableBytesPerPage (BLCKSZ - UndoLogBlockHeaderSize)
+
+/* The pseudo-database OID used for undo logs. */
+#define UndoLogDatabaseOid 9
+
+/* Length of undo checkpoint filename */
+#define UNDO_CHECKPOINT_FILENAME_LENGTH	16
+
+/*
+ * UndoRecPtrIsValid
+ *		True iff undoRecPtr is valid.
+ */
+#define UndoRecPtrIsValid(undoRecPtr) \
+	((bool) ((UndoRecPtr) (undoRecPtr) != InvalidUndoRecPtr))
+
+/* Extract the relnode for an undo log. */
+#define UndoRecPtrGetRelNode(urp)				\
+	UndoRecPtrGetLogNo(urp)
+
+/* The only valid fork number for undo log buffers. */
+#define UndoLogForkNum MAIN_FORKNUM
+
+/* Compute the block number that holds a given UndoRecPtr. */
+#define UndoRecPtrGetBlockNum(urp)				\
+	(UndoRecPtrGetOffset(urp) / BLCKSZ)
+
+/* Compute the offset of a given UndoRecPtr in the page that holds it. */
+#define UndoRecPtrGetPageOffset(urp)			\
+	(UndoRecPtrGetOffset(urp) % BLCKSZ)
+
+/* Compare two undo checkpoint files to find the oldest file. */
+#define UndoCheckPointFilenamePrecedes(file1, file2)	\
+	(strcmp(file1, file2) < 0)
+
+/* What is the offset of the i'th non-header byte? */
+#define UndoLogOffsetFromUsableByteNo(i)								\
+	(((i) / UndoLogUsableBytesPerPage) * BLCKSZ +						\
+	 UndoLogBlockHeaderSize +											\
+	 ((i) % UndoLogUsableBytesPerPage))
+
+/* How many non-header bytes are there before a given offset? */
+#define UndoLogOffsetToUsableByteNo(offset)				\
+	(((offset) % BLCKSZ - UndoLogBlockHeaderSize) +		\
+	 ((offset) / BLCKSZ) * UndoLogUsableBytesPerPage)
+
+/* Add 'n' usable bytes to offset stepping over headers to find new offset. */
+#define UndoLogOffsetPlusUsableBytes(offset, n)							\
+	UndoLogOffsetFromUsableByteNo(UndoLogOffsetToUsableByteNo(offset) + (n))
+
+/* Populate a RelFileNode from an UndoRecPtr. */
+#define UndoRecPtrAssignRelFileNode(rfn, urp)			\
+	do													\
+	{													\
+		(rfn).spcNode = UndoRecPtrGetTablespace(urp);	\
+		(rfn).dbNode = UndoLogDatabaseOid;				\
+		(rfn).relNode = UndoRecPtrGetRelNode(urp);		\
+	} while (false);
+
+/*
+ * Control metadata for an active undo log.  Lives in shared memory inside an
+ * UndoLogControl object, but also written to disk during checkpoints.
+ */
+typedef struct UndoLogMetaData
+{
+	UndoLogNumber logno;
+	UndoLogStatus status;
+	Oid		tablespace;
+	UndoPersistence persistence;	/* permanent, unlogged, temp? */
+	UndoLogOffset insert;			/* next insertion point (head) */
+	UndoLogOffset end;				/* one past end of highest segment */
+	UndoLogOffset discard;			/* oldest data needed (tail) */
+	UndoLogOffset last_xact_start;	/* last transactions start undo offset */
+
+	bool	is_first_rec;
+
+	/*
+	 * last undo record's length. We need to save this in undo meta and WAL
+	 * log so that the value can be preserved across restart so that the first
+	 * undo record after the restart can get this value properly.  This will be
+	 * used going to the previous record of the transaction during rollback.
+	 * In case the transaction have done some operation before checkpoint and
+	 * remaining after checkpoint in such case if we can't get the previous
+	 * record prevlen which which before checkpoint we can not properly
+	 * rollback.  And, undo worker is also fetch this value when rolling back
+	 * the last transaction in the undo log for locating the last undo record
+	 * of the transaction.
+	 */
+	uint16	prevlen;
+} UndoLogMetaData;
+
+#ifndef FRONTEND
+
+/*
+ * The in-memory control object for an undo log.  We have a fixed-sized array
+ * of these.
+ */
+typedef struct UndoLogControl
+{
+	/*
+	 * Protected by UndoLogLock and 'mutex'.  Both must be held to steal this
+	 * slot for another undolog.  Either may be held to prevent that from
+	 * happening.
+	 */
+	UndoLogNumber logno;			/* InvalidUndoLogNumber for unused slots */
+
+	/* Protected by UndoLogLock. */
+	UndoLogNumber next_free;		/* link for active unattached undo logs */
+
+	/* Protected by 'mutex'. */
+	LWLock	mutex;
+	UndoLogMetaData meta;			/* current meta-data */
+	XLogRecPtr      lsn;
+	bool	need_attach_wal_record;	/* need_attach_wal_record */
+	pid_t		pid;				/* InvalidPid for unattached */
+	TransactionId xid;
+
+	/* Protected by 'discard_lock'.  State used by undo workers. */
+	LWLock		discard_lock;		/* prevents discarding while reading */
+	TransactionId	oldest_xid;		/* cache of oldest transaction's xid */
+	uint32		oldest_xidepoch;
+	UndoRecPtr	oldest_data;
+
+} UndoLogControl;
+
+extern UndoLogControl *UndoLogGet(UndoLogNumber logno, bool missing_ok);
+extern UndoLogControl *UndoLogNext(UndoLogControl *log);
+extern bool AmAttachedToUndoLog(UndoLogControl *log);
+extern UndoRecPtr UndoLogGetFirstValidRecord(UndoLogControl *log, bool *full);
+
+/*
+ * Each backend maintains a small hash table mapping undo log numbers to
+ * UndoLogControl objects in shared memory.
+ *
+ * We also cache the tablespace here, since we need fast access to that when
+ * resolving UndoRecPtr to an buffer tag.  We could also reach that via
+ * control->meta.tablespace, but that can't be accessed without locking (since
+ * the UndoLogControl object might be recycled).  Since the tablespace for a
+ * given undo log is constant for the whole life of the undo log, there is no
+ * invalidation problem to worry about.
+ */
+typedef struct UndoLogTableEntry
+{
+	UndoLogNumber	number;
+	UndoLogControl *control;
+	Oid				tablespace;
+	char			status;
+} UndoLogTableEntry;
+
+/*
+ * Instantiate fast inline hash table access functions.  We use an identity
+ * hash function for speed, since we already have integers and don't expect
+ * many collisions.
+ */
+#define SH_PREFIX undologtable
+#define SH_ELEMENT_TYPE UndoLogTableEntry
+#define SH_KEY_TYPE UndoLogNumber
+#define SH_KEY number
+#define SH_HASH_KEY(tb, key) (key)
+#define SH_EQUAL(tb, a, b) ((a) == (b))
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+extern PGDLLIMPORT undologtable_hash *undologtable_cache;
+
+/*
+ * Find the OID of the tablespace that holds a given UndoRecPtr.  This is
+ * included in the header so it can be inlined by UndoRecPtrAssignRelFileNode.
+ */
+static inline Oid
+UndoRecPtrGetTablespace(UndoRecPtr urp)
+{
+	UndoLogNumber		logno = UndoRecPtrGetLogNo(urp);
+	UndoLogTableEntry  *entry;
+
+	/*
+	 * Fast path, for undo logs we've seen before.  This is safe because
+	 * tablespaces are constant for the lifetime of an undo log number.
+	 */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	if (likely(entry))
+		return entry->tablespace;
+
+	/*
+	 * Slow path: force cache entry to be created.  Raises an error if the
+	 * undo log has been entirely discarded, or hasn't been created yet.  That
+	 * is appropriate here, because this interface is designed for accessing
+	 * undo pages via bufmgr, and we should never be trying to access undo
+	 * pages that have been discarded.
+	 */
+	UndoLogGet(logno, false);
+
+	/*
+	 * We use the value from the newly created cache entry, because it's
+	 * cheaper than acquiring log->mutex and reading log->meta.tablespace.
+	 */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	return entry->tablespace;
+}
+#endif
+
+/* Space management. */
+extern UndoRecPtr UndoLogAllocate(size_t size,
+								  UndoPersistence level);
+extern UndoRecPtr UndoLogAllocateInRecovery(TransactionId xid,
+											size_t size,
+											UndoPersistence persistence);
+extern void UndoLogAdvance(UndoRecPtr insertion_point,
+						   size_t size,
+						   UndoPersistence persistence);
+extern void UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid);
+extern bool UndoLogIsDiscarded(UndoRecPtr point);
+
+/* Initialization interfaces. */
+extern void StartupUndoLogs(XLogRecPtr checkPointRedo);
+extern void UndoLogShmemInit(void);
+extern Size UndoLogShmemSize(void);
+extern void UndoLogInit(void);
+extern void UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace,
+							   char *path);
+extern void ResetUndoLogs(UndoPersistence persistence);
+
+/* Interface use by tablespace.c. */
+extern bool DropUndoLogsInTablespace(Oid tablespace);
+
+/* GUC interfaces. */
+extern void assign_undo_tablespaces(const char *newval, void *extra);
+
+/* Checkpointing interfaces. */
+extern void CheckPointUndoLogs(XLogRecPtr checkPointRedo,
+							   XLogRecPtr priorCheckPointRedo);
+
+extern void UndoLogSetLastXactStartPoint(UndoRecPtr point);
+extern UndoRecPtr UndoLogGetLastXactStartPoint(UndoLogNumber logno);
+extern UndoRecPtr UndoLogGetNextInsertPtr(UndoLogNumber logno,
+										  TransactionId xid);
+extern UndoRecPtr UndoLogGetLastRecordPtr(UndoLogNumber,
+										  TransactionId xid);
+extern void UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen);
+extern bool IsTransactionFirstRec(TransactionId xid);
+extern void UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen);
+extern uint16 UndoLogGetPrevLen(UndoLogNumber logno);
+extern void UndoLogSetLSN(XLogRecPtr lsn);
+void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno);
+/* Redo interface. */
+extern void undolog_redo(XLogReaderState *record);
+/* Discard the undo logs for temp tables */
+extern void TempUndoDiscard(UndoLogNumber);
+extern UndoRecPtr UndoLogStateGetAndClearPrevLogXactUrp(void);
+extern UndoLogNumber UndoLogAmAttachedTo(UndoPersistence persistence);
+extern Oid UndoLogStateGetDatabaseId(void);
+
+/* Test-only interfacing. */
+extern void UndoLogDetachFull(void);
+
+#endif
diff --git a/src/include/access/undolog_xlog.h b/src/include/access/undolog_xlog.h
new file mode 100644
index 0000000..34a622e
--- /dev/null
+++ b/src/include/access/undolog_xlog.h
@@ -0,0 +1,73 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog_xlog.h
+ *	  undo log access XLOG definitions.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_XLOG_H
+#define UNDOLOG_XLOG_H
+
+#include "access/undolog.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+
+/* XLOG records */
+#define XLOG_UNDOLOG_CREATE		0x00
+#define XLOG_UNDOLOG_EXTEND		0x10
+#define XLOG_UNDOLOG_ATTACH		0x20
+#define XLOG_UNDOLOG_DISCARD	0x30
+#define XLOG_UNDOLOG_REWIND		0x40
+#define XLOG_UNDOLOG_META		0x50
+#define XLOG_UNDOLOG_SWITCH		0x60
+
+/* Create a new undo log. */
+typedef struct xl_undolog_create
+{
+	UndoLogNumber logno;
+	Oid		tablespace;
+	UndoPersistence persistence;
+} xl_undolog_create;
+
+/* Extend an undo log by adding a new segment. */
+typedef struct xl_undolog_extend
+{
+	UndoLogNumber logno;
+	UndoLogOffset end;
+} xl_undolog_extend;
+
+/* Record the undo log number used for a transaction. */
+typedef struct xl_undolog_attach
+{
+	TransactionId xid;
+	UndoLogNumber logno;
+	Oid				dbid;
+} xl_undolog_attach;
+
+/* Discard space, and possibly destroy or recycle undo log segments. */
+typedef struct xl_undolog_discard
+{
+	UndoLogNumber logno;
+	UndoLogOffset discard;
+	UndoLogOffset end;
+	TransactionId latestxid;	/* latest xid whose undolog are discarded. */
+	bool		  entirely_discarded;
+} xl_undolog_discard;
+
+/* Rewind insert location of the undo log. */
+typedef struct xl_undolog_rewind
+{
+	UndoLogNumber logno;
+	UndoLogOffset insert;
+	uint16		  prevlen;
+} xl_undolog_rewind;
+
+extern void undolog_desc(StringInfo buf,XLogReaderState *record);
+extern const char *undolog_identify(uint8 info);
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f79fcfe..e464091 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10048,4 +10048,11 @@
   proargnames => '{rootrelid,relid,parentrelid,isleaf,level}',
   prosrc => 'pg_partition_tree' },
 
+# undo logs
+{ oid => '5032', descr => 'list undo logs',
+  proname => 'pg_stat_get_undo_logs', procost => '1', prorows => '10', proretset => 't',
+  prorettype => 'record', proargtypes => '',
+  proallargtypes => '{oid,text,text,text,text,text,xid,int4,oid,text}', proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{logno,persistence,tablespace,discard,insert,end,xid,pid,prev_logno,status}', prosrc => 'pg_stat_get_undo_logs' },
+
 ]
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index b2dcb73..4305af6 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,8 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_TBM,
 	LWTRANCHE_PARALLEL_APPEND,
+	LWTRANCHE_UNDOLOG,
+	LWTRANCHE_UNDODISCARD,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index 64457c7..8b30828 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -426,6 +426,8 @@ extern void GUC_check_errcode(int sqlerrcode);
 extern bool check_default_tablespace(char **newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra, GucSource source);
 extern void assign_temp_tablespaces(const char *newval, void *extra);
+extern bool check_undo_tablespaces(char **newval, void **extra, GucSource source);
+extern void assign_undo_tablespaces(const char *newval, void *extra);
 
 /* in catalog/namespace.c */
 extern bool check_search_path(char **newval, void **extra, GucSource source);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index b68b8d2..740431f 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1919,6 +1919,17 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
+pg_stat_undo_logs| SELECT pg_stat_get_undo_logs.logno,
+    pg_stat_get_undo_logs.persistence,
+    pg_stat_get_undo_logs.tablespace,
+    pg_stat_get_undo_logs.discard,
+    pg_stat_get_undo_logs.insert,
+    pg_stat_get_undo_logs."end",
+    pg_stat_get_undo_logs.xid,
+    pg_stat_get_undo_logs.pid,
+    pg_stat_get_undo_logs.prev_logno,
+    pg_stat_get_undo_logs.status
+   FROM pg_stat_get_undo_logs() pg_stat_get_undo_logs(logno, persistence, tablespace, discard, insert, "end", xid, pid, prev_logno, status);
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
-- 
1.8.3.1

0002-Provide-access-to-undo-log-data-via-the-buffer-manag_v4.patchapplication/octet-stream; name=0002-Provide-access-to-undo-log-data-via-the-buffer-manag_v4.patchDownload

From 8751db6e46166a3c54c456c6d981c74cdcd526e3 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Mon, 3 Dec 2018 10:12:34 +0530
Subject: [PATCH 2/3] Provide access to undo log data via the buffer manager.

In ancient Berkeley POSTGRES, smgr.c allowed for different storage engines, of
which only md.c survives.  Revive this mechanism to provide access to undo log
data through the existing buffer manager.

Undo logs exist in a pseudo-database whose OID is used to dispatch IO requests
to undofile.c instead of md.c.

Note: a separate proposal generalizes the fsync request machinery, see
https://commitfest.postgresql.org/20/1829/.  This patch has some stand-in
fsync machinery, but will be rebased on that other one depending on progress.
It seems better to avoid tangling up too many concurrently proposals so for
now this patch has its own fsync queue, duplicating some code from md.c.

Author: Thomas Munro, though ForgetBuffer() was contributed by Robert Haas
Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com
---
 src/backend/access/transam/xlogutils.c |  10 +-
 src/backend/postmaster/checkpointer.c  |   2 +-
 src/backend/postmaster/pgstat.c        |  24 +-
 src/backend/storage/buffer/bufmgr.c    |  82 ++++-
 src/backend/storage/smgr/Makefile      |   2 +-
 src/backend/storage/smgr/md.c          |  15 +-
 src/backend/storage/smgr/smgr.c        |  49 ++-
 src/backend/storage/smgr/undofile.c    | 546 +++++++++++++++++++++++++++++++++
 src/include/pgstat.h                   |  16 +-
 src/include/storage/bufmgr.h           |  14 +-
 src/include/storage/smgr.h             |  35 ++-
 src/include/storage/undofile.h         |  50 +++
 12 files changed, 810 insertions(+), 35 deletions(-)
 create mode 100644 src/backend/storage/smgr/undofile.c
 create mode 100644 src/include/storage/undofile.h

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 4ecdc92..8fed7b1 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -346,7 +346,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * Make sure that if the block is marked with WILL_INIT, the caller is
 	 * going to initialize it. And vice versa.
 	 */
-	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+	zeromode = (mode == RBM_ZERO || mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
 	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
@@ -462,7 +462,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -487,7 +487,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -497,7 +498,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index b9c118e..b2505c8 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1314,7 +1314,7 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		smgrrequestsync(request->rnode, request->forknum, request->segno);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 8676088..9d717d9 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3515,7 +3515,7 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_WAL_WRITER_MAIN:
 			event_name = "WalWriterMain";
 			break;
-			/* no default case, so that compiler will warn */
+		/* no default case, so that compiler will warn */
 	}
 
 	return event_name;
@@ -3897,6 +3897,28 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_READ:
+			event_name = "UndoCheckpointRead";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_WRITE:
+			event_name = "UndoCheckpointWrite";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_SYNC:
+			event_name = "UndoCheckpointSync";
+			break;
+		case WAIT_EVENT_UNDO_FILE_READ:
+			event_name = "UndoFileRead";
+			break;
+		case WAIT_EVENT_UNDO_FILE_WRITE:
+			event_name = "UndoFileWrite";
+			break;
+		case WAIT_EVENT_UNDO_FILE_FLUSH:
+			event_name = "UndoFileFlush";
+			break;
+		case WAIT_EVENT_UNDO_FILE_SYNC:
+			event_name = "UndoFileSync";
+			break;
+
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 9817770..bf2408a 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -176,6 +176,7 @@ static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
 static inline int32 GetPrivateRefCount(Buffer buffer);
 static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
+static void InvalidateBuffer(BufferDesc *buf);
 
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
@@ -618,10 +619,12 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * valid, the page is zeroed instead of throwing an error. This is intended
  * for non-critical data, where the caller is prepared to repair errors.
  *
- * In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's
+ * In RBM_ZERO mode, if the page isn't in buffer cache already, it's
  * filled with zeros instead of reading it from disk.  Useful when the caller
  * is going to fill the page from scratch, since this saves I/O and avoids
  * unnecessary failure if the page-on-disk has corrupt page headers.
+ *
+ * In RBM_ZERO_AND_LOCK mode, the page is zeroed and also locked.
  * The page is returned locked to ensure that the caller has a chance to
  * initialize the page before it's made visible to others.
  * Caution: do not use this mode to read a page that is beyond the relation's
@@ -672,24 +675,20 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy,
+						  char relpersistence)
 {
 	bool		hit;
 
-	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
-
-	Assert(InRecovery);
+	SMgrRelation smgr = smgropen(rnode,
+								 relpersistence == RELPERSISTENCE_TEMP
+								 ? MyBackendId : InvalidBackendId);
 
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -883,7 +882,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Read in the page, unless the caller intends to overwrite it and
 		 * just wants us to allocate a buffer.
 		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+		if (mode == RBM_ZERO ||
+			mode == RBM_ZERO_AND_LOCK ||
+			mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
@@ -1338,6 +1339,61 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 }
 
 /*
+ * ForgetBuffer -- drop a buffer from shared buffers
+ *
+ * If the buffer isn't present in shared buffers, nothing happens.  If it is
+ * present, it is discarded without making any attempt to write it back out to
+ * the operating system.  The caller must therefore somehow be sure that the
+ * data won't be needed for anything now or in the future.  It assumes that
+ * there is no concurrent access to the block, except that it might be being
+ * concurrently written.
+ */
+void
+ForgetBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum)
+{
+	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
+	BufferTag	tag;			/* identity of target block */
+	uint32		hash;			/* hash value for tag */
+	LWLock	   *partitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(tag, smgr->smgr_rnode.node, forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	hash = BufTableHashCode(&tag);
+	partitionLock = BufMappingPartitionLock(hash);
+
+	/* see if the block is in the buffer pool */
+	LWLockAcquire(partitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&tag, hash);
+	LWLockRelease(partitionLock);
+
+	/* didn't find it, so nothing to do */
+	if (buf_id < 0)
+		return;
+
+	/* take the buffer header lock */
+	bufHdr = GetBufferDescriptor(buf_id);
+	buf_state = LockBufHdr(bufHdr);
+
+	/*
+	 * The buffer might been evicted after we released the partition lock and
+	 * before we acquired the buffer header lock.  If so, the buffer we've
+	 * locked might contain some other data which we shouldn't touch. If the
+	 * buffer hasn't been recycled, we proceed to invalidate it.
+	 */
+	if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+		bufHdr->tag.blockNum == blockNum &&
+		bufHdr->tag.forkNum == forkNum)
+		InvalidateBuffer(bufHdr);		/* releases spinlock */
+	else
+		UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
  * InvalidateBuffer -- mark a shared buffer invalid and return it to the
  * freelist.
  *
@@ -1412,7 +1468,7 @@ retry:
 		LWLockRelease(oldPartitionLock);
 		/* safety check: should definitely not be our *own* pin */
 		if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
-			elog(ERROR, "buffer is pinned in InvalidateBuffer");
+			elog(PANIC, "buffer is pinned in InvalidateBuffer");
 		WaitIO(buf);
 		goto retry;
 	}
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0..b657eb2 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrtype.o undofile.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index 4c6a505..4c489a2 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -45,7 +45,7 @@
 #define UNLINKS_PER_ABSORB		10
 
 /*
- * Special values for the segno arg to RememberFsyncRequest.
+ * Special values for the segno arg to mdrequestsync.
  *
  * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
  * fsync request from the queue if an identical, subsequent request is found.
@@ -1420,7 +1420,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+		mdrequestsync(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
 	}
 	else
 	{
@@ -1456,8 +1456,7 @@ register_unlink(RelFileNodeBackend rnode)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
+		mdrequestsync(rnode.node, MAIN_FORKNUM, UNLINK_RELATION_REQUEST);
 	}
 	else
 	{
@@ -1476,7 +1475,7 @@ register_unlink(RelFileNodeBackend rnode)
 }
 
 /*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ * mdrequestsync() -- callback from checkpointer side of fsync request
  *
  * We stuff fsync requests into the local hash table for execution
  * during the checkpointer's next checkpoint.  UNLINK requests go into a
@@ -1497,7 +1496,7 @@ register_unlink(RelFileNodeBackend rnode)
  * heavyweight operation anyhow, so we'll live with it.)
  */
 void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
 {
 	Assert(pendingOpsTable);
 
@@ -1640,7 +1639,7 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+		mdrequestsync(rnode, forknum, FORGET_RELATION_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
@@ -1679,7 +1678,7 @@ ForgetDatabaseFsyncRequests(Oid dbid)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
+		mdrequestsync(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 189342e..d0b2c0d 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,6 +58,8 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
+	void		(*smgr_requestsync) (RelFileNode rnode, ForkNumber forknum,
+									 int segno);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
 	void		(*smgr_sync) (void);	/* may be NULL */
@@ -81,15 +83,45 @@ static const f_smgr smgrsw[] = {
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
+		.smgr_requestsync = mdrequestsync,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_pre_ckpt = mdpreckpt,
 		.smgr_sync = mdsync,
 		.smgr_post_ckpt = mdpostckpt
+	},
+	/* undo logs */
+	{
+		.smgr_init = undofile_init,
+		.smgr_shutdown = undofile_shutdown,
+		.smgr_close = undofile_close,
+		.smgr_create = undofile_create,
+		.smgr_exists = undofile_exists,
+		.smgr_unlink = undofile_unlink,
+		.smgr_extend = undofile_extend,
+		.smgr_prefetch = undofile_prefetch,
+		.smgr_read = undofile_read,
+		.smgr_write = undofile_write,
+		.smgr_writeback = undofile_writeback,
+		.smgr_nblocks = undofile_nblocks,
+		.smgr_truncate = undofile_truncate,
+		.smgr_requestsync = undofile_requestsync,
+		.smgr_immedsync = undofile_immedsync,
+		.smgr_pre_ckpt = undofile_preckpt,
+		.smgr_sync = undofile_sync,
+		.smgr_post_ckpt = undofile_postckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
+/*
+ * In ancient Postgres the catalog entry for each relation controlled the
+ * choice of storage manager implementation.  Now we have only md.c for
+ * regular relations, and undofile.c for undo log storage in the undolog
+ * pseudo-database.
+ */
+#define SmgrWhichForRelFileNode(rfn)			\
+	((rfn).dbNode == 9 ? 1 : 0)
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -185,11 +217,18 @@ smgropen(RelFileNode rnode, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		reln->smgr_fsm_nblocks = InvalidBlockNumber;
 		reln->smgr_vm_nblocks = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+
+		/* Which storage manager implementation? */
+		reln->smgr_which = SmgrWhichForRelFileNode(rnode);
 
 		/* mark it not open */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
 			reln->md_num_open_segs[forknum] = 0;
+			reln->md_seg_fds[forknum] = NULL;
+		}
+
+		reln->private_data = NULL;
 
 		/* it has no owner yet */
 		add_to_unowned_list(reln);
@@ -723,6 +762,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 }
 
 /*
+ *	smgrrequestsync() -- Enqueue a request for smgrsync() to flush data.
+ */
+void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	smgrsw[SmgrWhichForRelFileNode(rnode)].smgr_requestsync(rnode, forknum, segno);
+}
+
+/*
  *	smgrimmedsync() -- Force the specified relation to stable storage.
  *
  *		Synchronously force all previous writes to the specified relation
diff --git a/src/backend/storage/smgr/undofile.c b/src/backend/storage/smgr/undofile.c
new file mode 100644
index 0000000..afba64e
--- /dev/null
+++ b/src/backend/storage/smgr/undofile.c
@@ -0,0 +1,546 @@
+/*
+ * undofile.h
+ *
+ * PostgreSQL undo file manager.  This module provides SMGR-compatible
+ * interface to the files that back undo logs on the filesystem, so that undo
+ * log data can use the shared buffer pool.  Other aspects of undo log
+ * management are provided by undolog.c, so the SMGR interfaces not directly
+ * concerned with reading, writing and flushing data are unimplemented.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/storage/smgr/undofile.c
+ */
+
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "storage/fd.h"
+#include "storage/undofile.h"
+#include "utils/memutils.h"
+
+/* intervals for calling AbsorbFsyncRequests in undofile_sync */
+#define FSYNCS_PER_ABSORB		10
+
+/*
+ * Special values for the fork arg to undofile_requestsync.
+ */
+#define FORGET_UNDO_SEGMENT_FSYNC	(InvalidBlockNumber)
+
+/*
+ * While md.c expects random access and has a small number of huge
+ * segments, undofile.c manages a potentially very large number of smaller
+ * segments and has a less random access pattern.  Therefore, instead of
+ * keeping a potentially huge array of vfds we'll just keep the most
+ * recently accessed N.
+ *
+ * For now, N == 1, so we just need to hold onto one 'File' handle.
+ */
+typedef struct UndoFileState
+{
+	int		mru_segno;
+	File	mru_file;
+} UndoFileState;
+
+static MemoryContext UndoFileCxt;
+
+typedef uint16 CycleCtr;
+
+/*
+ * An entry recording the segments that need to be fsynced by undofile_sync().
+ * This is a bit simpler than md.c's version, though it could perhaps be
+ * merged into a common struct.  One difference is that we can have much
+ * larger segment numbers, so we'll adjust for that to avoid having a lot of
+ * leading zero bits.
+ */
+typedef struct
+{
+	RelFileNode rnode;
+	Bitmapset  *requests;
+	CycleCtr	cycle_ctr;
+} PendingOperationEntry;
+
+static HTAB *pendingOpsTable = NULL;
+static MemoryContext pendingOpsCxt;
+
+static CycleCtr undofile_sync_cycle_ctr = 0;
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok);
+static File undofile_get_segment_file(SMgrRelation reln, int segno);
+
+void
+undofile_init(void)
+{
+	UndoFileCxt = AllocSetContextCreate(TopMemoryContext,
+										"UndoFileSmgr",
+										ALLOCSET_DEFAULT_SIZES);
+
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		pendingOpsCxt = AllocSetContextCreate(UndoFileCxt,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(RelFileNode);
+		hash_ctl.entrysize = sizeof(PendingOperationEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOpsTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+}
+
+void
+undofile_shutdown(void)
+{
+}
+
+void
+undofile_close(SMgrRelation reln, ForkNumber forknum)
+{
+}
+
+void
+undofile_create(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_create is not supported");
+}
+
+bool
+undofile_exists(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_exists is not supported");
+}
+
+void
+undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_unlink is not supported");
+}
+
+void
+undofile_extend(SMgrRelation reln, ForkNumber forknum,
+				BlockNumber blocknum, char *buffer,
+				bool skipFsync)
+{
+	elog(ERROR, "undofile_extend is not supported");
+}
+
+void
+undofile_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	elog(ERROR, "undofile_prefetch is not supported");
+}
+
+void
+undofile_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  char *buffer)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	nbytes = FileRead(file, buffer, BLCKSZ, seekpos, WAIT_EVENT_UNDO_FILE_READ);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+}
+
+static void
+register_dirty_segment(SMgrRelation reln, ForkNumber forknum, int segno, File file)
+{
+	/* Temp relations should never be fsync'd */
+	Assert(!SmgrIsTemp(reln));
+
+	if (pendingOpsTable)
+	{
+		/* push it into local pending-ops table */
+		undofile_requestsync(reln->smgr_rnode.node, forknum, segno);
+	}
+	else
+	{
+		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, segno))
+			return;				/* passed it off successfully */
+
+		ereport(DEBUG1,
+				(errmsg("could not forward fsync request because request queue is full")));
+
+		if (FileSync(file, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(file))));
+	}
+}
+
+void
+undofile_write(SMgrRelation reln, ForkNumber forknum,
+			   BlockNumber blocknum, char *buffer,
+			   bool skipFsync)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	nbytes = FileWrite(file, buffer, BLCKSZ, seekpos, WAIT_EVENT_UNDO_FILE_WRITE);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		/*
+		 * short write: unexpected, because this should be overwriting an
+		 * entirely pre-allocated segment file
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_DISK_FULL),
+				 errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+
+	if (!skipFsync && !SmgrIsTemp(reln))
+		register_dirty_segment(reln, forknum, blocknum / UNDOSEG_SIZE, file);
+}
+
+void
+undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+				   BlockNumber blocknum, BlockNumber nblocks)
+{
+	while (nblocks > 0)
+	{
+		File	file;
+		int		nflush;
+
+		file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+
+		/* compute number of desired writes within the current segment */
+		nflush = Min(nblocks,
+					 1 + UNDOSEG_SIZE - (blocknum % UNDOSEG_SIZE));
+
+		FileWriteback(file,
+					  (blocknum % UNDOSEG_SIZE) * BLCKSZ,
+					  nflush * BLCKSZ, WAIT_EVENT_UNDO_FILE_FLUSH);
+
+		nblocks -= nflush;
+		blocknum += nflush;
+	}
+}
+
+BlockNumber
+undofile_nblocks(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_nblocks is not supported");
+	return 0;
+}
+
+void
+undofile_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
+{
+	elog(ERROR, "undofile_truncate is not supported");
+}
+
+void
+undofile_immedsync(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_immedsync is not supported");
+}
+
+void
+undofile_preckpt(void)
+{
+}
+
+void
+undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+	PendingOperationEntry *entry;
+	bool		found;
+
+	Assert(pendingOpsTable);
+
+	if (forknum == FORGET_UNDO_SEGMENT_FSYNC)
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_FIND,
+													  NULL);
+		if (entry)
+			entry->requests = bms_del_member(entry->requests, segno);
+	}
+	else
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_ENTER,
+													  &found);
+		if (!found)
+		{
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+			entry->requests = bms_make_singleton(segno);
+		}
+		else
+			entry->requests = bms_add_member(entry->requests, segno);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+void
+undofile_forgetsync(Oid logno, Oid tablespace, int segno)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = 9;
+	rnode.spcNode = tablespace;
+	rnode.relNode = logno;
+
+	if (pendingOpsTable)
+		undofile_requestsync(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno);
+	else if (IsUnderPostmaster)
+	{
+		while (!ForwardFsyncRequest(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno))
+			pg_usleep(10000L);
+	}
+}
+
+void
+undofile_sync(void)
+{
+	static bool undofile_sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingOperationEntry *entry;
+	int			absorb_counter;
+	int			segno;
+
+	if (!pendingOpsTable)
+		elog(ERROR, "cannot sync without a pendingOpsTable");
+
+	AbsorbFsyncRequests();
+
+	if (undofile_sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOpsTable);
+		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+	}
+
+	undofile_sync_cycle_ctr++;
+	undofile_sync_in_progress = true;
+
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOpsTable);
+	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		Bitmapset	   *requests;
+
+		/* Skip entries that arrived after we arrived. */
+		if (entry->cycle_ctr == undofile_sync_cycle_ctr)
+			continue;
+
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == undofile_sync_cycle_ctr);
+
+		if (!enableFsync)
+			continue;
+
+		requests = entry->requests;
+		entry->requests = NULL;
+
+		segno = -1;
+		while ((segno = bms_next_member(requests, segno)) >= 0)
+		{
+			File		file;
+
+			if (!enableFsync)
+				continue;
+
+			file = undofile_open_segment_file(entry->rnode.relNode,
+											  entry->rnode.spcNode,
+											  segno, true /* missing_ok */);
+
+			/*
+			 * The file may be gone due to concurrent discard.  We'll ignore
+			 * that, but only if we find a cancel request for this segment in
+			 * the queue.
+			 *
+			 * It's also possible that we succeed in opening a segment file
+			 * that is subsequently recycled (renamed to represent a new range
+			 * of undo log), in which case we'll fsync that later file
+			 * instead.  That is rare and harmless.
+			 */
+			if (file <= 0)
+			{
+				char		name[MAXPGPATH];
+
+				/*
+				 * Put the request back into the bitset in a way that can't
+				 * fail due to memory allocation.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				/*
+				 * Check if a forgetsync request has arrived to delete that
+				 * segment.
+				 */
+				AbsorbFsyncRequests();
+				if (bms_is_member(segno, entry->requests))
+				{
+					UndoLogSegmentPath(entry->rnode.relNode,
+									   segno,
+									   entry->rnode.spcNode,
+									   name);
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not fsync file \"%s\": %m", name)));
+				}
+				/* It must have been removed, so we can safely skip it. */
+				continue;
+			}
+
+			elog(LOG, "fsync()ing %s", FilePathName(file));	/* TODO: remove me */
+			if (FileSync(file, WAIT_EVENT_UNDO_FILE_SYNC) < 0)
+			{
+				char		name[MAXPGPATH];
+
+				strcpy(name, FilePathName(file));
+				FileClose(file);
+
+				/*
+				 * Keep the failed requests, but merge with any new ones.  The
+				 * requirement to be able to do this without risk of failure
+				 * prevents us from using a smaller bitmap that doesn't bother
+				 * tracking leading zeros.  Perhaps another data structure
+				 * would be better.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m", name)));
+			}
+			requests = bms_del_member(requests, segno);
+			FileClose(file);
+
+			if (--absorb_counter <= 0)
+			{
+				AbsorbFsyncRequests();
+				absorb_counter = FSYNCS_PER_ABSORB;
+			}
+		}
+
+		bms_free(requests);
+	}
+
+	undofile_sync_in_progress = true;
+}
+
+void undofile_postckpt(void)
+{
+}
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok)
+{
+	File		file;
+	char		path[MAXPGPATH];
+
+	UndoLogSegmentPath(relNode, segno, spcNode, path);
+	file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+
+	if (file <= 0 && (!missing_ok || errno != ENOENT))
+		elog(ERROR, "cannot open undo segment file '%s': %m", path);
+
+	return file;
+}
+
+/*
+ * Get a File for a particular segment of a SMgrRelation representing an undo
+ * log.
+ */
+static File undofile_get_segment_file(SMgrRelation reln, int segno)
+{
+	UndoFileState *state;
+
+
+	/*
+	 * Create private state space on demand.
+	 *
+	 * XXX There should probably be a smgr 'open' or 'init' interface that
+	 * would do this.  smgr.c currently initializes reln->md_XXX stuff
+	 * directly...
+	 */
+	state = (UndoFileState *) reln->private_data;
+	if (unlikely(state == NULL))
+	{
+		state = MemoryContextAllocZero(UndoFileCxt, sizeof(UndoFileState));
+		reln->private_data = state;
+	}
+
+	/* If we have a file open already, check if we need to close it. */
+	if (state->mru_file > 0 && state->mru_segno != segno)
+	{
+		/* These are not the blocks we're looking for. */
+		FileClose(state->mru_file);
+		state->mru_file = 0;
+	}
+
+	/* Check if we need to open a new file. */
+	if (state->mru_file <= 0)
+	{
+		state->mru_file =
+			undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+									   reln->smgr_rnode.node.spcNode,
+									   segno, InRecovery);
+		if (InRecovery && state->mru_file <= 0)
+		{
+			/*
+			 * If in recovery, we may be trying to access a file that will
+			 * later be unlinked.  Tolerate missing files, creating a new
+			 * zero-filled file as required.
+			 */
+			UndoLogNewSegment(reln->smgr_rnode.node.relNode,
+							  reln->smgr_rnode.node.spcNode,
+							  segno);
+			state->mru_file =
+				undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+										   reln->smgr_rnode.node.spcNode,
+										   segno, false);
+			Assert(state->mru_file > 0);
+		}
+		state->mru_segno = segno;
+	}
+
+	return state->mru_file;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f1c10d1..763379e 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -624,6 +624,11 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter tuples_inserted;
 	PgStat_Counter tuples_updated;
 	PgStat_Counter tuples_deleted;
+
+	/*
+	 * Counter tuples_hot_updated stores number of hot updates for heap table
+	 * and the number of inplace updates for zheap table.
+	 */
 	PgStat_Counter tuples_hot_updated;
 
 	PgStat_Counter n_live_tuples;
@@ -743,6 +748,7 @@ typedef enum BackendState
 #define PG_WAIT_IPC					0x08000000U
 #define PG_WAIT_TIMEOUT				0x09000000U
 #define PG_WAIT_IO					0x0A000000U
+#define PG_WAIT_PAGE_TRANS_SLOT		0x0B000000U
 
 /* ----------
  * Wait Events - Activity
@@ -767,7 +773,7 @@ typedef enum
 	WAIT_EVENT_SYSLOGGER_MAIN,
 	WAIT_EVENT_WAL_RECEIVER_MAIN,
 	WAIT_EVENT_WAL_SENDER_MAIN,
-	WAIT_EVENT_WAL_WRITER_MAIN
+	WAIT_EVENT_WAL_WRITER_MAIN,
 } WaitEventActivity;
 
 /* ----------
@@ -913,6 +919,13 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_UNDO_CHECKPOINT_READ,
+	WAIT_EVENT_UNDO_CHECKPOINT_WRITE,
+	WAIT_EVENT_UNDO_CHECKPOINT_SYNC,
+	WAIT_EVENT_UNDO_FILE_READ,
+	WAIT_EVENT_UNDO_FILE_WRITE,
+	WAIT_EVENT_UNDO_FILE_FLUSH,
+	WAIT_EVENT_UNDO_FILE_SYNC,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
@@ -1317,6 +1330,7 @@ pgstat_report_wait_end(void)
 
 extern void pgstat_count_heap_insert(Relation rel, PgStat_Counter n);
 extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_zheap_update(Relation rel);
 extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index 3cce390..5b13556 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -38,8 +38,9 @@ typedef enum BufferAccessStrategyType
 typedef enum
 {
 	RBM_NORMAL,					/* Normal read */
-	RBM_ZERO_AND_LOCK,			/* Don't read from disk, caller will
-								 * initialize. Also locks the page. */
+	RBM_ZERO,					/* Don't read from disk, caller will
+								 * initialize. */
+	RBM_ZERO_AND_LOCK,			/* Like RBM_ZERO, but also locks the page. */
 	RBM_ZERO_AND_CLEANUP_LOCK,	/* Like RBM_ZERO_AND_LOCK, but locks the page
 								 * in "cleanup" mode */
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
@@ -171,7 +172,10 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 				   BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 						  ForkNumber forkNum, BlockNumber blockNum,
-						  ReadBufferMode mode, BufferAccessStrategy strategy);
+						  ReadBufferMode mode, BufferAccessStrategy strategy,
+						  char relpersistence);
+extern void ForgetBuffer(RelFileNode rnode, ForkNumber forkNum,
+			 BlockNumber blockNum);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -228,6 +232,10 @@ extern void AtProcExit_LocalBuffers(void);
 
 extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
 
+/* in localbuf.c */
+extern void ForgetLocalBuffer(RelFileNode rnode, ForkNumber forkNum,
+				  BlockNumber blockNum);
+
 /* in freelist.c */
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index c843bbc..65d164b 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -71,6 +71,9 @@ typedef struct SMgrRelationData
 	int			md_num_open_segs[MAX_FORKNUM + 1];
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
+	/* For use by implementations. */
+	void	   *private_data;
+
 	/* if unowned, list link in list of all unowned SMgrRelations */
 	struct SMgrRelationData *next_unowned_reln;
 } SMgrRelationData;
@@ -105,6 +108,7 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
+extern void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
 extern void smgrsync(void);
@@ -133,14 +137,41 @@ extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
+extern void mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);
 
+/* in undofile.c */
+extern void undofile_init(void);
+extern void undofile_shutdown(void);
+extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
+							bool isRedo);
+extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
+							bool isRedo);
+extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber blocknum);
+extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum, char *buffer);
+extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber nblocks);
+extern void undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno);
+extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_preckpt(void);
+extern void undofile_sync(void);
+extern void undofile_postckpt(void);
+
 extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/undofile.h b/src/include/storage/undofile.h
new file mode 100644
index 0000000..7544be3
--- /dev/null
+++ b/src/include/storage/undofile.h
@@ -0,0 +1,50 @@
+/*
+ * undofile.h
+ *
+ * PostgreSQL undo file manager.  This module manages the files that back undo
+ * logs on the filesystem.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/undofile.h
+ */
+
+#ifndef UNDOFILE_H
+#define UNDOFILE_H
+
+#include "storage/smgr.h"
+
+/* Prototypes of functions exposed to SMgr. */
+extern void undofile_init(void);
+extern void undofile_shutdown(void);
+extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
+							bool isRedo);
+extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
+							bool isRedo);
+extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum, char *buffer,
+							bool skipFsync);
+extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
+extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum, char *buffer);
+extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, char *buffer,
+						   bool skipFsync);
+extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber nblocks);
+extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_pre_ckpt(void);
+extern void undofile_sync(void);
+extern void undofile_post_ckpt(void);
+
+/* Functions used by undolog.c. */
+extern void undofile_forgetsync(Oid logno, Oid tablespace, int segno);
+
+#endif
-- 
1.8.3.1

#30

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#29)

1 attachment(s)

Re: Undo logs

On Sun, Dec 23, 2018 at 3:49 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Dec 12, 2018 at 3:03 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

For addressing these issues related to multilog I have changed the
design as we discussed offlist.
1) Now, at Do time we identify the log switch as you mentioned above
(identify which log we are attached to before and after allocate).
And, if the log is switched we write a WAL for the same and during
recovery whenever this WAL is replayed we stored the undo record
pointer of the transaction header (which is in the previous undo log)
in UndoLogStateData and read it while allocating space the undo record
and immediately reset it.

2) For handling the discard issue, along with updating the current
transaction's start header in the previous undo log we also update the
previous transaction's start header in the current log if we get
assigned with the non-empty undo log.

3) For, identifying the previous undo record of the transaction during
rollback (when undo log is switched), we store the transaction's last
record's (in previous undo log) undo record pointer in the transaction
header of the first undo record in the new undo log.

Thanks, the new changes look mostly okay to me, but I have few comments:
1.
+ /*
+ * WAL log, for log switch.  This is required to identify the log switch
+ * during recovery.
+ */
+ if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
+ {
+ XLogBeginInsert();
+ XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
+ XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
+ }
+

Don't we want to do this under critical section?

2.
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+    TransactionId txid, UndoPersistence upersistence)
{
..
+ if (log_switched)
+ {
+ /*
+ * If undo log is switched then during rollback we can not go
+ * to the previous undo record of the transaction by prevlen
+ * so we store the previous undo record pointer in the
+ * transaction header.
+ */
+ log = UndoLogGet(prevlogno, false);
+ urec->uur_prevurp = MakeUndoRecPtr(prevlogno,
+    log->meta.insert -
log->meta.prevlen);
+ }
..
}

Can we have an Assert for a valid prevlogno in the above condition?

+ uint64 urec_next; /* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+ (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
Isn't it better to define urec_next as UndoRecPtr, even though it is
technically the same as per the current code?
While replying I noticed that I haven't address this comment, I will
handle this in next patch. I have to change this at couple of place.

Okay, I think the new variable (uur_prevurp) introduced by this
version of the patch also needs to be changed in a similar way.

Apart from the above, I have made quite a few cosmetic changes and
modified few comments, most notably, I have updated the comments
related to the handling of multiple logs at the beginning of
undoinsert.c file. Kindly, include these changes in your next
patchset, if they look okay to you.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-undo-interface-v12-delta-amit.patchapplication/octet-stream; name=0003-undo-interface-v12-delta-amit.patchDownload

diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
index 68e6adbfc6..14b9f3c59c 100644
--- a/src/backend/access/undo/undoinsert.c
+++ b/src/backend/access/undo/undoinsert.c
@@ -20,32 +20,39 @@
  * this is entirely maintained and used by undo record layer.   See
  * undorecord.h for detailed information about undo record header.
  *
- * Handling multi log:
+ * Multiple logs:
  *
- * It is possible that the undo record of a transaction can be spread across
- * multiple undo log.  And, we need some special handling while inserting the
- * undo for discard and rollback to work sanely.
+ * It is possible that the undo records for a transaction spans across
+ * multiple undo logs.  We need some special handling while inserting them to
+ * ensure that discard and rollbacks can work sanely.
  *
- * If the undorecord goes to next log then we insert a transaction header for
- * the first record in the new log and update the transaction header with this
- * new log's location. This will allow us to connect transactions across logs
- * when the same transaction span across log (for this we keep track of the
- * previous logno in undo log meta) which is required to find the latest undo
- * record pointer of the aborted transaction for executing the undo actions
- * before discard. If the next log get processed first in that case we
- * don't need to trace back the actual start pointer of the transaction,
- * in such case we can only execute the undo actions from the current log
- * because the undo pointer in the slot will be rewound and that will be enough
- * to avoid executing same actions.  However, there is possibility that after
- * executing the undo actions the undo pointer got discarded, now in later
- * stage while processing the previous log it might try to fetch the undo
- * record in the discarded log while chasing the transaction header chain.
- * To avoid this situation we first check if the next_urec of the transaction
- * is already discarded then no need to access that and start executing from
+ * When the undorecord for a transaction gets inserted in the next log then we
+ * insert a transaction header for the first record in the new log and update
+ * the transaction header with this new logs location.  We will also keep
+ * a back pointer to the last undo record of previous log in the first record
+ * of new log, so that we can traverse the previous record during rollback.
+ * Incase, this is not the first record in new log (aka new log already
+ * contains some other transactions data), we also update that transactions
+ * next start header with this new undo records location.  This will allow us
+ * to connect transaction's undo records across logs when the same transaction
+ * span across log.
+ *
+ * There is some difference in the way the rollbacks work when the undo for
+ * same transaction spans across multiple logs depending on which log is
+ * processed first by the discard worker.  If it processes the first log which
+ * contains the transactions first record, then it can get the last record
+ * of that transaction even if it is in different log and then processes all
+ * the undo records from last to first.  OTOH, if the next log get processed
+ * first, we don't need to trace back the actual start pointer of the
+ * transaction, rather we only execute the undo actions from the current log
+ * and avoid re-executing them next time.  There is a possibility that after
+ * executing the undo actions, the undo got discarded, now in later stage while
+ * processing the previous log, it might try to fetch the undo record in the
+ * discarded log while chasing the transaction header chain which can cause
+ * trouble.  We avoid this situation by first checking if the next_urec of
+ * the transaction is already discarded and if so, we start executing from
  * the last undo record in the current log.
  *
- * We only connect to next log if the same transaction spread to next log
- * otherwise don't.
  *-------------------------------------------------------------------------
  */
 
@@ -81,12 +88,12 @@
 #define MAX_PREPARED_UNDO 2
 
 /*
- * This defines the max number of previous xact info we need to update.
+ * This defines the max number of previous xact infos we need to update.
  * Usually it's 1 for updating next link of previous transaction's header
- * if we are starting a new transaction. But, in some cases where the same
- * transaction is spilled to the next log that time we update our own
- * transaction's header in previous undo log as well as the header of the
- * previous transaction in the new log.
+ * if we are starting a new transaction.  But, in some cases where the same
+ * transaction is spilled to the next log, we update our own transaction's
+ * header in previous undo log as well as the header of the previous
+ * transaction in the new log.
  */
 #define MAX_XACT_UNDO_INFO	2
 
@@ -529,9 +536,10 @@ resize:
 	if (InRecovery)
 	{
 		/*
-		 * During recovery we can directly identify by checking the prevlogurp
-		 * from the MyUndoLogState which is stored in it by WAL and we
-		 * immediately reset it.
+		 * During recovery we can identify the log switch by checking the
+		 * prevlogurp from the MyUndoLogState.  The WAL replay action for log
+		 * switch would have set the value and we need to clear it after
+		 * retrieving the latest value.
 		 */
 		prevlogurp = UndoLogStateGetAndClearPrevLogXactUrp();
 		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
@@ -544,8 +552,9 @@ resize:
 	else
 	{
 		/*
-		 * Just check the current log which we are attached to, and if this
-		 * got switched after the allocation then the undo log got switched.
+		 * Check whether the current log is switched after allocation.  We can
+		 * determine that by simply checking to which log we are attached
+		 * before and after allocation.
 		 */
 		prevlogno = UndoLogAmAttachedTo(upersistence);
 		urecptr = UndoLogAllocate(size, upersistence);
@@ -608,7 +617,7 @@ resize:
 	UndoLogAdvance(urecptr, size, upersistence);
 
 	/*
-	 * WAL log, for log switch.  This is required to identify the log switch
+	 * Write WAL for log switch.  This is required to identify the log switch
 	 * during recovery.
 	 */
 	if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
@@ -883,13 +892,13 @@ InsertPreparedUndo(void)
 		SetCurrentUndoLocation(urp);
 	}
 
-	/* Update previous transaction header. */
+	/* Update previously prepared transaction headers. */
 	if (xact_urec_info_idx > 0)
 	{
-		int			i = 0;
+		int		i = 0;
 
 		for (i = 0; i < xact_urec_info_idx; i++)
-			UndoRecordUpdateTransInfo(i);
+			 UndoRecordUpdateTransInfo(i);
 	}
 
 }
@@ -1106,10 +1115,8 @@ UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
 /*
  * Return the previous undo record pointer.
  *
- * If prevurp is valid undo record pointer then it will directly
- * return that assuming the caller has detected the undo log got
- * switched during the transaction and prevurp is a valid previous
- * undo record pointer of the transaction in the previous undo log.
+ * A valid value of prevurp indicates that the previous undo record
+ * pointer is in some other log and caller can directly use that.
  * Otherwise this will calculate the previous undo record pointer
  * by using current urp and the prevlen.
  */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
index 9f090558f9..0b29334652 100644
--- a/src/include/access/undorecord.h
+++ b/src/include/access/undorecord.h
@@ -117,10 +117,10 @@ typedef struct UndoRecordTransaction
 	Oid			urec_dbid;		/* database id */
 
 	/*
-	 * Transaction previous undo record pointer when transaction split across
-	 * undo log.  The first undo record in the new log will stores the
-	 * previous undo record pointer in the previous log as we can not
-	 * calculate that directly using prevlen during rollback.
+	 * Transaction's previous undo record pointer when a transaction spans
+	 * across undo logs.  The first undo record in the new log stores the
+	 * previous undo record pointer in the previous log as we can't calculate
+	 * that directly using prevlen during rollback.
 	 */
 	uint64		urec_prevurp;
 	uint64		urec_next;		/* urec pointer of the next transaction */
@@ -175,7 +175,8 @@ typedef struct UnpackedUndoRecord
 	OffsetNumber uur_offset;	/* offset number */
 	Buffer		uur_buffer;		/* buffer in which undo record data points */
 	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
-	uint64		uur_prevurp;
+	uint64		uur_prevurp;	/* urec pointer to the previous record in
+								 * the different log */
 	uint64		uur_next;		/* urec pointer of the next transaction */
 	Oid			uur_dbid;		/* database id */

#31

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Amit Kapila (#30)

3 attachment(s)

Re: Undo logs

On Tue, Jan 1, 2019 at 4:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Thanks, the new changes look mostly okay to me, but I have few comments:
1.
+ /*
+ * WAL log, for log switch.  This is required to identify the log switch
+ * during recovery.
+ */
+ if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
+ {
+ XLogBeginInsert();
+ XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
+ XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
+ }
+

Don't we want to do this under critical section?

I think we are not making any buffer changes here and just inserting a
WAL, IMHO we don't need any critical section. Am I missing
something?.

2.
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+    TransactionId txid, UndoPersistence upersistence)
{
..
+ if (log_switched)
+ {
+ /*
+ * If undo log is switched then during rollback we can not go
+ * to the previous undo record of the transaction by prevlen
+ * so we store the previous undo record pointer in the
+ * transaction header.
+ */
+ log = UndoLogGet(prevlogno, false);
+ urec->uur_prevurp = MakeUndoRecPtr(prevlogno,
+    log->meta.insert -
log->meta.prevlen);
+ }
..
}

Can we have an Assert for a valid prevlogno in the above condition?

Done

+ uint64 urec_next; /* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+ (offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
Isn't it better to define urec_next as UndoRecPtr, even though it is
technically the same as per the current code?
While replying I noticed that I haven't address this comment, I will
handle this in next patch. I have to change this at couple of place.
Okay, I think the new variable (uur_prevurp) introduced by this
version of the patch also needs to be changed in a similar way.

DOne

Apart from the above, I have made quite a few cosmetic changes and
modified few comments, most notably, I have updated the comments
related to the handling of multiple logs at the beginning of
undoinsert.c file. Kindly, include these changes in your next
patchset, if they look okay to you.

I have taken all changes except this one

if (xact_urec_info_idx > 0)
  {
- int i = 0;
+ int i = 0;   --> pgindent changed it back to the above one.

  for (i = 0; i < xact_urec_info_idx; i++)
- UndoRecordUpdateTransInfo(i);
+ UndoRecordUpdateTransInfo(i);  -- This induce extra space so I ignored this
  }

Undo-log patches need rebased so I have done that as well along with
the changes mentioned above.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0002-Provide-access-to-undo-log-data-via-the-buffer-manag.patchapplication/octet-stream; name=0002-Provide-access-to-undo-log-data-via-the-buffer-manag.patchDownload

From 4c68538a95a2479e60a69ea29d633d6ada43c5de Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Sat, 5 Jan 2019 10:42:02 +0530
Subject: [PATCH 2/3] Provide access to undo log data via the buffer manager.

In ancient Berkeley POSTGRES, smgr.c allowed for different storage engines, of
which only md.c survives.  Revive this mechanism to provide access to undo log
data through the existing buffer manager.

Undo logs exist in a pseudo-database whose OID is used to dispatch IO requests
to undofile.c instead of md.c.

Note: a separate proposal generalizes the fsync request machinery, see
https://commitfest.postgresql.org/20/1829/.  This patch has some stand-in
fsync machinery, but will be rebased on that other one depending on progress.
It seems better to avoid tangling up too many concurrently proposals so for
now this patch has its own fsync queue, duplicating some code from md.c.

Author: Thomas Munro, though ForgetBuffer() was contributed by Robert Haas
Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com
---
 src/backend/access/transam/xlogutils.c |  10 +-
 src/backend/postmaster/checkpointer.c  |   2 +-
 src/backend/postmaster/pgstat.c        |  24 +-
 src/backend/storage/buffer/bufmgr.c    |  82 ++++-
 src/backend/storage/smgr/Makefile      |   2 +-
 src/backend/storage/smgr/md.c          |  15 +-
 src/backend/storage/smgr/smgr.c        |  49 ++-
 src/backend/storage/smgr/undofile.c    | 546 +++++++++++++++++++++++++++++++++
 src/include/pgstat.h                   |  16 +-
 src/include/storage/bufmgr.h           |  14 +-
 src/include/storage/smgr.h             |  35 ++-
 src/include/storage/undofile.h         |  50 +++
 12 files changed, 810 insertions(+), 35 deletions(-)
 create mode 100644 src/backend/storage/smgr/undofile.c
 create mode 100644 src/include/storage/undofile.h

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 10a663b..217d092 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -346,7 +346,7 @@ XLogReadBufferForRedoExtended(XLogReaderState *record,
 	 * Make sure that if the block is marked with WILL_INIT, the caller is
 	 * going to initialize it. And vice versa.
 	 */
-	zeromode = (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
+	zeromode = (mode == RBM_ZERO || mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK);
 	willinit = (record->blocks[block_id].flags & BKPBLOCK_WILL_INIT) != 0;
 	if (willinit && !zeromode)
 		elog(PANIC, "block with WILL_INIT flag in WAL record must be zeroed by redo routine");
@@ -462,7 +462,7 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 	{
 		/* page exists in file */
 		buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-										   mode, NULL);
+										   mode, NULL, RELPERSISTENCE_PERMANENT);
 	}
 	else
 	{
@@ -487,7 +487,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				ReleaseBuffer(buffer);
 			}
 			buffer = ReadBufferWithoutRelcache(rnode, forknum,
-											   P_NEW, mode, NULL);
+											   P_NEW, mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 		while (BufferGetBlockNumber(buffer) < blkno);
 		/* Handle the corner case that P_NEW returns non-consecutive pages */
@@ -497,7 +498,8 @@ XLogReadBufferExtended(RelFileNode rnode, ForkNumber forknum,
 				LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 			ReleaseBuffer(buffer);
 			buffer = ReadBufferWithoutRelcache(rnode, forknum, blkno,
-											   mode, NULL);
+											   mode, NULL,
+											   RELPERSISTENCE_PERMANENT);
 		}
 	}
 
diff --git a/src/backend/postmaster/checkpointer.c b/src/backend/postmaster/checkpointer.c
index fe96c41..b558cf9 100644
--- a/src/backend/postmaster/checkpointer.c
+++ b/src/backend/postmaster/checkpointer.c
@@ -1314,7 +1314,7 @@ AbsorbFsyncRequests(void)
 	LWLockRelease(CheckpointerCommLock);
 
 	for (request = requests; n > 0; request++, n--)
-		RememberFsyncRequest(request->rnode, request->forknum, request->segno);
+		smgrrequestsync(request->rnode, request->forknum, request->segno);
 
 	END_CRIT_SECTION();
 
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index d503515..03591bf 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -3515,7 +3515,7 @@ pgstat_get_wait_activity(WaitEventActivity w)
 		case WAIT_EVENT_WAL_WRITER_MAIN:
 			event_name = "WalWriterMain";
 			break;
-			/* no default case, so that compiler will warn */
+		/* no default case, so that compiler will warn */
 	}
 
 	return event_name;
@@ -3897,6 +3897,28 @@ pgstat_get_wait_io(WaitEventIO w)
 		case WAIT_EVENT_TWOPHASE_FILE_WRITE:
 			event_name = "TwophaseFileWrite";
 			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_READ:
+			event_name = "UndoCheckpointRead";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_WRITE:
+			event_name = "UndoCheckpointWrite";
+			break;
+		case WAIT_EVENT_UNDO_CHECKPOINT_SYNC:
+			event_name = "UndoCheckpointSync";
+			break;
+		case WAIT_EVENT_UNDO_FILE_READ:
+			event_name = "UndoFileRead";
+			break;
+		case WAIT_EVENT_UNDO_FILE_WRITE:
+			event_name = "UndoFileWrite";
+			break;
+		case WAIT_EVENT_UNDO_FILE_FLUSH:
+			event_name = "UndoFileFlush";
+			break;
+		case WAIT_EVENT_UNDO_FILE_SYNC:
+			event_name = "UndoFileSync";
+			break;
+
 		case WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ:
 			event_name = "WALSenderTimelineHistoryRead";
 			break;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 273e2f3..359fc3c 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -176,6 +176,7 @@ static PrivateRefCountEntry *NewPrivateRefCountEntry(Buffer buffer);
 static PrivateRefCountEntry *GetPrivateRefCountEntry(Buffer buffer, bool do_move);
 static inline int32 GetPrivateRefCount(Buffer buffer);
 static void ForgetPrivateRefCountEntry(PrivateRefCountEntry *ref);
+static void InvalidateBuffer(BufferDesc *buf);
 
 /*
  * Ensure that the PrivateRefCountArray has sufficient space to store one more
@@ -618,10 +619,12 @@ ReadBuffer(Relation reln, BlockNumber blockNum)
  * valid, the page is zeroed instead of throwing an error. This is intended
  * for non-critical data, where the caller is prepared to repair errors.
  *
- * In RBM_ZERO_AND_LOCK mode, if the page isn't in buffer cache already, it's
+ * In RBM_ZERO mode, if the page isn't in buffer cache already, it's
  * filled with zeros instead of reading it from disk.  Useful when the caller
  * is going to fill the page from scratch, since this saves I/O and avoids
  * unnecessary failure if the page-on-disk has corrupt page headers.
+ *
+ * In RBM_ZERO_AND_LOCK mode, the page is zeroed and also locked.
  * The page is returned locked to ensure that the caller has a chance to
  * initialize the page before it's made visible to others.
  * Caution: do not use this mode to read a page that is beyond the relation's
@@ -672,24 +675,20 @@ ReadBufferExtended(Relation reln, ForkNumber forkNum, BlockNumber blockNum,
 /*
  * ReadBufferWithoutRelcache -- like ReadBufferExtended, but doesn't require
  *		a relcache entry for the relation.
- *
- * NB: At present, this function may only be used on permanent relations, which
- * is OK, because we only use it during XLOG replay.  If in the future we
- * want to use it on temporary or unlogged relations, we could pass additional
- * parameters.
  */
 Buffer
 ReadBufferWithoutRelcache(RelFileNode rnode, ForkNumber forkNum,
 						  BlockNumber blockNum, ReadBufferMode mode,
-						  BufferAccessStrategy strategy)
+						  BufferAccessStrategy strategy,
+						  char relpersistence)
 {
 	bool		hit;
 
-	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
-
-	Assert(InRecovery);
+	SMgrRelation smgr = smgropen(rnode,
+								 relpersistence == RELPERSISTENCE_TEMP
+								 ? MyBackendId : InvalidBackendId);
 
-	return ReadBuffer_common(smgr, RELPERSISTENCE_PERMANENT, forkNum, blockNum,
+	return ReadBuffer_common(smgr, relpersistence, forkNum, blockNum,
 							 mode, strategy, &hit);
 }
 
@@ -883,7 +882,9 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		 * Read in the page, unless the caller intends to overwrite it and
 		 * just wants us to allocate a buffer.
 		 */
-		if (mode == RBM_ZERO_AND_LOCK || mode == RBM_ZERO_AND_CLEANUP_LOCK)
+		if (mode == RBM_ZERO ||
+			mode == RBM_ZERO_AND_LOCK ||
+			mode == RBM_ZERO_AND_CLEANUP_LOCK)
 			MemSet((char *) bufBlock, 0, BLCKSZ);
 		else
 		{
@@ -1338,6 +1339,61 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 }
 
 /*
+ * ForgetBuffer -- drop a buffer from shared buffers
+ *
+ * If the buffer isn't present in shared buffers, nothing happens.  If it is
+ * present, it is discarded without making any attempt to write it back out to
+ * the operating system.  The caller must therefore somehow be sure that the
+ * data won't be needed for anything now or in the future.  It assumes that
+ * there is no concurrent access to the block, except that it might be being
+ * concurrently written.
+ */
+void
+ForgetBuffer(RelFileNode rnode, ForkNumber forkNum, BlockNumber blockNum)
+{
+	SMgrRelation smgr = smgropen(rnode, InvalidBackendId);
+	BufferTag	tag;			/* identity of target block */
+	uint32		hash;			/* hash value for tag */
+	LWLock	   *partitionLock;	/* buffer partition lock for it */
+	int			buf_id;
+	BufferDesc *bufHdr;
+	uint32		buf_state;
+
+	/* create a tag so we can lookup the buffer */
+	INIT_BUFFERTAG(tag, smgr->smgr_rnode.node, forkNum, blockNum);
+
+	/* determine its hash code and partition lock ID */
+	hash = BufTableHashCode(&tag);
+	partitionLock = BufMappingPartitionLock(hash);
+
+	/* see if the block is in the buffer pool */
+	LWLockAcquire(partitionLock, LW_SHARED);
+	buf_id = BufTableLookup(&tag, hash);
+	LWLockRelease(partitionLock);
+
+	/* didn't find it, so nothing to do */
+	if (buf_id < 0)
+		return;
+
+	/* take the buffer header lock */
+	bufHdr = GetBufferDescriptor(buf_id);
+	buf_state = LockBufHdr(bufHdr);
+
+	/*
+	 * The buffer might been evicted after we released the partition lock and
+	 * before we acquired the buffer header lock.  If so, the buffer we've
+	 * locked might contain some other data which we shouldn't touch. If the
+	 * buffer hasn't been recycled, we proceed to invalidate it.
+	 */
+	if (RelFileNodeEquals(bufHdr->tag.rnode, rnode) &&
+		bufHdr->tag.blockNum == blockNum &&
+		bufHdr->tag.forkNum == forkNum)
+		InvalidateBuffer(bufHdr);		/* releases spinlock */
+	else
+		UnlockBufHdr(bufHdr, buf_state);
+}
+
+/*
  * InvalidateBuffer -- mark a shared buffer invalid and return it to the
  * freelist.
  *
@@ -1412,7 +1468,7 @@ retry:
 		LWLockRelease(oldPartitionLock);
 		/* safety check: should definitely not be our *own* pin */
 		if (GetPrivateRefCount(BufferDescriptorGetBuffer(buf)) > 0)
-			elog(ERROR, "buffer is pinned in InvalidateBuffer");
+			elog(PANIC, "buffer is pinned in InvalidateBuffer");
 		WaitIO(buf);
 		goto retry;
 	}
diff --git a/src/backend/storage/smgr/Makefile b/src/backend/storage/smgr/Makefile
index 2b95cb0..b657eb2 100644
--- a/src/backend/storage/smgr/Makefile
+++ b/src/backend/storage/smgr/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/smgr
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = md.o smgr.o smgrtype.o
+OBJS = md.o smgr.o smgrtype.o undofile.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/storage/smgr/md.c b/src/backend/storage/smgr/md.c
index e4501ff..9045b24 100644
--- a/src/backend/storage/smgr/md.c
+++ b/src/backend/storage/smgr/md.c
@@ -45,7 +45,7 @@
 #define UNLINKS_PER_ABSORB		10
 
 /*
- * Special values for the segno arg to RememberFsyncRequest.
+ * Special values for the segno arg to mdrequestsync.
  *
  * Note that CompactCheckpointerRequestQueue assumes that it's OK to remove an
  * fsync request from the queue if an identical, subsequent request is found.
@@ -1420,7 +1420,7 @@ register_dirty_segment(SMgrRelation reln, ForkNumber forknum, MdfdVec *seg)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
+		mdrequestsync(reln->smgr_rnode.node, forknum, seg->mdfd_segno);
 	}
 	else
 	{
@@ -1456,8 +1456,7 @@ register_unlink(RelFileNodeBackend rnode)
 	if (pendingOpsTable)
 	{
 		/* push it into local pending-ops table */
-		RememberFsyncRequest(rnode.node, MAIN_FORKNUM,
-							 UNLINK_RELATION_REQUEST);
+		mdrequestsync(rnode.node, MAIN_FORKNUM, UNLINK_RELATION_REQUEST);
 	}
 	else
 	{
@@ -1476,7 +1475,7 @@ register_unlink(RelFileNodeBackend rnode)
 }
 
 /*
- * RememberFsyncRequest() -- callback from checkpointer side of fsync request
+ * mdrequestsync() -- callback from checkpointer side of fsync request
  *
  * We stuff fsync requests into the local hash table for execution
  * during the checkpointer's next checkpoint.  UNLINK requests go into a
@@ -1497,7 +1496,7 @@ register_unlink(RelFileNodeBackend rnode)
  * heavyweight operation anyhow, so we'll live with it.)
  */
 void
-RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum, BlockNumber segno)
+mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
 {
 	Assert(pendingOpsTable);
 
@@ -1640,7 +1639,7 @@ ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, forknum, FORGET_RELATION_FSYNC);
+		mdrequestsync(rnode, forknum, FORGET_RELATION_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
@@ -1679,7 +1678,7 @@ ForgetDatabaseFsyncRequests(Oid dbid)
 	if (pendingOpsTable)
 	{
 		/* standalone backend or startup process: fsync state is local */
-		RememberFsyncRequest(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
+		mdrequestsync(rnode, InvalidForkNumber, FORGET_DATABASE_FSYNC);
 	}
 	else if (IsUnderPostmaster)
 	{
diff --git a/src/backend/storage/smgr/smgr.c b/src/backend/storage/smgr/smgr.c
index 0c0bba4..0802f13 100644
--- a/src/backend/storage/smgr/smgr.c
+++ b/src/backend/storage/smgr/smgr.c
@@ -58,6 +58,8 @@ typedef struct f_smgr
 	BlockNumber (*smgr_nblocks) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_truncate) (SMgrRelation reln, ForkNumber forknum,
 								  BlockNumber nblocks);
+	void		(*smgr_requestsync) (RelFileNode rnode, ForkNumber forknum,
+									 int segno);
 	void		(*smgr_immedsync) (SMgrRelation reln, ForkNumber forknum);
 	void		(*smgr_pre_ckpt) (void);	/* may be NULL */
 	void		(*smgr_sync) (void);	/* may be NULL */
@@ -81,15 +83,45 @@ static const f_smgr smgrsw[] = {
 		.smgr_writeback = mdwriteback,
 		.smgr_nblocks = mdnblocks,
 		.smgr_truncate = mdtruncate,
+		.smgr_requestsync = mdrequestsync,
 		.smgr_immedsync = mdimmedsync,
 		.smgr_pre_ckpt = mdpreckpt,
 		.smgr_sync = mdsync,
 		.smgr_post_ckpt = mdpostckpt
+	},
+	/* undo logs */
+	{
+		.smgr_init = undofile_init,
+		.smgr_shutdown = undofile_shutdown,
+		.smgr_close = undofile_close,
+		.smgr_create = undofile_create,
+		.smgr_exists = undofile_exists,
+		.smgr_unlink = undofile_unlink,
+		.smgr_extend = undofile_extend,
+		.smgr_prefetch = undofile_prefetch,
+		.smgr_read = undofile_read,
+		.smgr_write = undofile_write,
+		.smgr_writeback = undofile_writeback,
+		.smgr_nblocks = undofile_nblocks,
+		.smgr_truncate = undofile_truncate,
+		.smgr_requestsync = undofile_requestsync,
+		.smgr_immedsync = undofile_immedsync,
+		.smgr_pre_ckpt = undofile_preckpt,
+		.smgr_sync = undofile_sync,
+		.smgr_post_ckpt = undofile_postckpt
 	}
 };
 
 static const int NSmgr = lengthof(smgrsw);
 
+/*
+ * In ancient Postgres the catalog entry for each relation controlled the
+ * choice of storage manager implementation.  Now we have only md.c for
+ * regular relations, and undofile.c for undo log storage in the undolog
+ * pseudo-database.
+ */
+#define SmgrWhichForRelFileNode(rfn)			\
+	((rfn).dbNode == 9 ? 1 : 0)
 
 /*
  * Each backend has a hashtable that stores all extant SMgrRelation objects.
@@ -185,11 +217,18 @@ smgropen(RelFileNode rnode, BackendId backend)
 		reln->smgr_targblock = InvalidBlockNumber;
 		reln->smgr_fsm_nblocks = InvalidBlockNumber;
 		reln->smgr_vm_nblocks = InvalidBlockNumber;
-		reln->smgr_which = 0;	/* we only have md.c at present */
+
+		/* Which storage manager implementation? */
+		reln->smgr_which = SmgrWhichForRelFileNode(rnode);
 
 		/* mark it not open */
 		for (forknum = 0; forknum <= MAX_FORKNUM; forknum++)
+		{
 			reln->md_num_open_segs[forknum] = 0;
+			reln->md_seg_fds[forknum] = NULL;
+		}
+
+		reln->private_data = NULL;
 
 		/* it has no owner yet */
 		add_to_unowned_list(reln);
@@ -723,6 +762,14 @@ smgrtruncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
 }
 
 /*
+ *	smgrrequestsync() -- Enqueue a request for smgrsync() to flush data.
+ */
+void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	smgrsw[SmgrWhichForRelFileNode(rnode)].smgr_requestsync(rnode, forknum, segno);
+}
+
+/*
  *	smgrimmedsync() -- Force the specified relation to stable storage.
  *
  *		Synchronously force all previous writes to the specified relation
diff --git a/src/backend/storage/smgr/undofile.c b/src/backend/storage/smgr/undofile.c
new file mode 100644
index 0000000..afba64e
--- /dev/null
+++ b/src/backend/storage/smgr/undofile.c
@@ -0,0 +1,546 @@
+/*
+ * undofile.h
+ *
+ * PostgreSQL undo file manager.  This module provides SMGR-compatible
+ * interface to the files that back undo logs on the filesystem, so that undo
+ * log data can use the shared buffer pool.  Other aspects of undo log
+ * management are provided by undolog.c, so the SMGR interfaces not directly
+ * concerned with reading, writing and flushing data are unimplemented.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/storage/smgr/undofile.c
+ */
+
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/xlog.h"
+#include "miscadmin.h"
+#include "pgstat.h"
+#include "postmaster/bgwriter.h"
+#include "storage/fd.h"
+#include "storage/undofile.h"
+#include "utils/memutils.h"
+
+/* intervals for calling AbsorbFsyncRequests in undofile_sync */
+#define FSYNCS_PER_ABSORB		10
+
+/*
+ * Special values for the fork arg to undofile_requestsync.
+ */
+#define FORGET_UNDO_SEGMENT_FSYNC	(InvalidBlockNumber)
+
+/*
+ * While md.c expects random access and has a small number of huge
+ * segments, undofile.c manages a potentially very large number of smaller
+ * segments and has a less random access pattern.  Therefore, instead of
+ * keeping a potentially huge array of vfds we'll just keep the most
+ * recently accessed N.
+ *
+ * For now, N == 1, so we just need to hold onto one 'File' handle.
+ */
+typedef struct UndoFileState
+{
+	int		mru_segno;
+	File	mru_file;
+} UndoFileState;
+
+static MemoryContext UndoFileCxt;
+
+typedef uint16 CycleCtr;
+
+/*
+ * An entry recording the segments that need to be fsynced by undofile_sync().
+ * This is a bit simpler than md.c's version, though it could perhaps be
+ * merged into a common struct.  One difference is that we can have much
+ * larger segment numbers, so we'll adjust for that to avoid having a lot of
+ * leading zero bits.
+ */
+typedef struct
+{
+	RelFileNode rnode;
+	Bitmapset  *requests;
+	CycleCtr	cycle_ctr;
+} PendingOperationEntry;
+
+static HTAB *pendingOpsTable = NULL;
+static MemoryContext pendingOpsCxt;
+
+static CycleCtr undofile_sync_cycle_ctr = 0;
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok);
+static File undofile_get_segment_file(SMgrRelation reln, int segno);
+
+void
+undofile_init(void)
+{
+	UndoFileCxt = AllocSetContextCreate(TopMemoryContext,
+										"UndoFileSmgr",
+										ALLOCSET_DEFAULT_SIZES);
+
+	if (!IsUnderPostmaster || AmStartupProcess() || AmCheckpointerProcess())
+	{
+		HASHCTL		hash_ctl;
+
+		pendingOpsCxt = AllocSetContextCreate(UndoFileCxt,
+											  "Pending ops context",
+											  ALLOCSET_DEFAULT_SIZES);
+		MemoryContextAllowInCriticalSection(pendingOpsCxt, true);
+
+		MemSet(&hash_ctl, 0, sizeof(hash_ctl));
+		hash_ctl.keysize = sizeof(RelFileNode);
+		hash_ctl.entrysize = sizeof(PendingOperationEntry);
+		hash_ctl.hcxt = pendingOpsCxt;
+		pendingOpsTable = hash_create("Pending Ops Table",
+									  100L,
+									  &hash_ctl,
+									  HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+	}
+}
+
+void
+undofile_shutdown(void)
+{
+}
+
+void
+undofile_close(SMgrRelation reln, ForkNumber forknum)
+{
+}
+
+void
+undofile_create(SMgrRelation reln, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_create is not supported");
+}
+
+bool
+undofile_exists(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_exists is not supported");
+}
+
+void
+undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum, bool isRedo)
+{
+	elog(ERROR, "undofile_unlink is not supported");
+}
+
+void
+undofile_extend(SMgrRelation reln, ForkNumber forknum,
+				BlockNumber blocknum, char *buffer,
+				bool skipFsync)
+{
+	elog(ERROR, "undofile_extend is not supported");
+}
+
+void
+undofile_prefetch(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum)
+{
+	elog(ERROR, "undofile_prefetch is not supported");
+}
+
+void
+undofile_read(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
+			  char *buffer)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	nbytes = FileRead(file, buffer, BLCKSZ, seekpos, WAIT_EVENT_UNDO_FILE_READ);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not read block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		ereport(ERROR,
+				(errcode(ERRCODE_DATA_CORRUPTED),
+				 errmsg("could not read block %u in file \"%s\": read only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+}
+
+static void
+register_dirty_segment(SMgrRelation reln, ForkNumber forknum, int segno, File file)
+{
+	/* Temp relations should never be fsync'd */
+	Assert(!SmgrIsTemp(reln));
+
+	if (pendingOpsTable)
+	{
+		/* push it into local pending-ops table */
+		undofile_requestsync(reln->smgr_rnode.node, forknum, segno);
+	}
+	else
+	{
+		if (ForwardFsyncRequest(reln->smgr_rnode.node, forknum, segno))
+			return;				/* passed it off successfully */
+
+		ereport(DEBUG1,
+				(errmsg("could not forward fsync request because request queue is full")));
+
+		if (FileSync(file, WAIT_EVENT_DATA_FILE_SYNC) < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not fsync file \"%s\": %m",
+							FilePathName(file))));
+	}
+}
+
+void
+undofile_write(SMgrRelation reln, ForkNumber forknum,
+			   BlockNumber blocknum, char *buffer,
+			   bool skipFsync)
+{
+	File		file;
+	off_t		seekpos;
+	int			nbytes;
+
+	Assert(forknum == MAIN_FORKNUM);
+	file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+	seekpos = (off_t) BLCKSZ * (blocknum % ((BlockNumber) UNDOSEG_SIZE));
+	Assert(seekpos < (off_t) BLCKSZ * UNDOSEG_SIZE);
+	nbytes = FileWrite(file, buffer, BLCKSZ, seekpos, WAIT_EVENT_UNDO_FILE_WRITE);
+	if (nbytes != BLCKSZ)
+	{
+		if (nbytes < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write block %u in file \"%s\": %m",
+							blocknum, FilePathName(file))));
+		/*
+		 * short write: unexpected, because this should be overwriting an
+		 * entirely pre-allocated segment file
+		 */
+		ereport(ERROR,
+				(errcode(ERRCODE_DISK_FULL),
+				 errmsg("could not write block %u in file \"%s\": wrote only %d of %d bytes",
+						blocknum, FilePathName(file),
+						nbytes, BLCKSZ)));
+	}
+
+	if (!skipFsync && !SmgrIsTemp(reln))
+		register_dirty_segment(reln, forknum, blocknum / UNDOSEG_SIZE, file);
+}
+
+void
+undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+				   BlockNumber blocknum, BlockNumber nblocks)
+{
+	while (nblocks > 0)
+	{
+		File	file;
+		int		nflush;
+
+		file = undofile_get_segment_file(reln, blocknum / UNDOSEG_SIZE);
+
+		/* compute number of desired writes within the current segment */
+		nflush = Min(nblocks,
+					 1 + UNDOSEG_SIZE - (blocknum % UNDOSEG_SIZE));
+
+		FileWriteback(file,
+					  (blocknum % UNDOSEG_SIZE) * BLCKSZ,
+					  nflush * BLCKSZ, WAIT_EVENT_UNDO_FILE_FLUSH);
+
+		nblocks -= nflush;
+		blocknum += nflush;
+	}
+}
+
+BlockNumber
+undofile_nblocks(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_nblocks is not supported");
+	return 0;
+}
+
+void
+undofile_truncate(SMgrRelation reln, ForkNumber forknum, BlockNumber nblocks)
+{
+	elog(ERROR, "undofile_truncate is not supported");
+}
+
+void
+undofile_immedsync(SMgrRelation reln, ForkNumber forknum)
+{
+	elog(ERROR, "undofile_immedsync is not supported");
+}
+
+void
+undofile_preckpt(void)
+{
+}
+
+void
+undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno)
+{
+	MemoryContext oldcxt = MemoryContextSwitchTo(pendingOpsCxt);
+	PendingOperationEntry *entry;
+	bool		found;
+
+	Assert(pendingOpsTable);
+
+	if (forknum == FORGET_UNDO_SEGMENT_FSYNC)
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_FIND,
+													  NULL);
+		if (entry)
+			entry->requests = bms_del_member(entry->requests, segno);
+	}
+	else
+	{
+		entry = (PendingOperationEntry *) hash_search(pendingOpsTable,
+													  &rnode,
+													  HASH_ENTER,
+													  &found);
+		if (!found)
+		{
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+			entry->requests = bms_make_singleton(segno);
+		}
+		else
+			entry->requests = bms_add_member(entry->requests, segno);
+	}
+
+	MemoryContextSwitchTo(oldcxt);
+}
+
+void
+undofile_forgetsync(Oid logno, Oid tablespace, int segno)
+{
+	RelFileNode rnode;
+
+	rnode.dbNode = 9;
+	rnode.spcNode = tablespace;
+	rnode.relNode = logno;
+
+	if (pendingOpsTable)
+		undofile_requestsync(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno);
+	else if (IsUnderPostmaster)
+	{
+		while (!ForwardFsyncRequest(rnode, FORGET_UNDO_SEGMENT_FSYNC, segno))
+			pg_usleep(10000L);
+	}
+}
+
+void
+undofile_sync(void)
+{
+	static bool undofile_sync_in_progress = false;
+
+	HASH_SEQ_STATUS hstat;
+	PendingOperationEntry *entry;
+	int			absorb_counter;
+	int			segno;
+
+	if (!pendingOpsTable)
+		elog(ERROR, "cannot sync without a pendingOpsTable");
+
+	AbsorbFsyncRequests();
+
+	if (undofile_sync_in_progress)
+	{
+		/* prior try failed, so update any stale cycle_ctr values */
+		hash_seq_init(&hstat, pendingOpsTable);
+		while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+			entry->cycle_ctr = undofile_sync_cycle_ctr;
+	}
+
+	undofile_sync_cycle_ctr++;
+	undofile_sync_in_progress = true;
+
+	absorb_counter = FSYNCS_PER_ABSORB;
+	hash_seq_init(&hstat, pendingOpsTable);
+	while ((entry = (PendingOperationEntry *) hash_seq_search(&hstat)) != NULL)
+	{
+		Bitmapset	   *requests;
+
+		/* Skip entries that arrived after we arrived. */
+		if (entry->cycle_ctr == undofile_sync_cycle_ctr)
+			continue;
+
+		Assert((CycleCtr) (entry->cycle_ctr + 1) == undofile_sync_cycle_ctr);
+
+		if (!enableFsync)
+			continue;
+
+		requests = entry->requests;
+		entry->requests = NULL;
+
+		segno = -1;
+		while ((segno = bms_next_member(requests, segno)) >= 0)
+		{
+			File		file;
+
+			if (!enableFsync)
+				continue;
+
+			file = undofile_open_segment_file(entry->rnode.relNode,
+											  entry->rnode.spcNode,
+											  segno, true /* missing_ok */);
+
+			/*
+			 * The file may be gone due to concurrent discard.  We'll ignore
+			 * that, but only if we find a cancel request for this segment in
+			 * the queue.
+			 *
+			 * It's also possible that we succeed in opening a segment file
+			 * that is subsequently recycled (renamed to represent a new range
+			 * of undo log), in which case we'll fsync that later file
+			 * instead.  That is rare and harmless.
+			 */
+			if (file <= 0)
+			{
+				char		name[MAXPGPATH];
+
+				/*
+				 * Put the request back into the bitset in a way that can't
+				 * fail due to memory allocation.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				/*
+				 * Check if a forgetsync request has arrived to delete that
+				 * segment.
+				 */
+				AbsorbFsyncRequests();
+				if (bms_is_member(segno, entry->requests))
+				{
+					UndoLogSegmentPath(entry->rnode.relNode,
+									   segno,
+									   entry->rnode.spcNode,
+									   name);
+					ereport(ERROR,
+							(errcode_for_file_access(),
+							 errmsg("could not fsync file \"%s\": %m", name)));
+				}
+				/* It must have been removed, so we can safely skip it. */
+				continue;
+			}
+
+			elog(LOG, "fsync()ing %s", FilePathName(file));	/* TODO: remove me */
+			if (FileSync(file, WAIT_EVENT_UNDO_FILE_SYNC) < 0)
+			{
+				char		name[MAXPGPATH];
+
+				strcpy(name, FilePathName(file));
+				FileClose(file);
+
+				/*
+				 * Keep the failed requests, but merge with any new ones.  The
+				 * requirement to be able to do this without risk of failure
+				 * prevents us from using a smaller bitmap that doesn't bother
+				 * tracking leading zeros.  Perhaps another data structure
+				 * would be better.
+				 */
+				entry->requests = bms_join(entry->requests, requests);
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not fsync file \"%s\": %m", name)));
+			}
+			requests = bms_del_member(requests, segno);
+			FileClose(file);
+
+			if (--absorb_counter <= 0)
+			{
+				AbsorbFsyncRequests();
+				absorb_counter = FSYNCS_PER_ABSORB;
+			}
+		}
+
+		bms_free(requests);
+	}
+
+	undofile_sync_in_progress = true;
+}
+
+void undofile_postckpt(void)
+{
+}
+
+static File undofile_open_segment_file(Oid relNode, Oid spcNode, int segno,
+									   bool missing_ok)
+{
+	File		file;
+	char		path[MAXPGPATH];
+
+	UndoLogSegmentPath(relNode, segno, spcNode, path);
+	file = PathNameOpenFile(path, O_RDWR | PG_BINARY);
+
+	if (file <= 0 && (!missing_ok || errno != ENOENT))
+		elog(ERROR, "cannot open undo segment file '%s': %m", path);
+
+	return file;
+}
+
+/*
+ * Get a File for a particular segment of a SMgrRelation representing an undo
+ * log.
+ */
+static File undofile_get_segment_file(SMgrRelation reln, int segno)
+{
+	UndoFileState *state;
+
+
+	/*
+	 * Create private state space on demand.
+	 *
+	 * XXX There should probably be a smgr 'open' or 'init' interface that
+	 * would do this.  smgr.c currently initializes reln->md_XXX stuff
+	 * directly...
+	 */
+	state = (UndoFileState *) reln->private_data;
+	if (unlikely(state == NULL))
+	{
+		state = MemoryContextAllocZero(UndoFileCxt, sizeof(UndoFileState));
+		reln->private_data = state;
+	}
+
+	/* If we have a file open already, check if we need to close it. */
+	if (state->mru_file > 0 && state->mru_segno != segno)
+	{
+		/* These are not the blocks we're looking for. */
+		FileClose(state->mru_file);
+		state->mru_file = 0;
+	}
+
+	/* Check if we need to open a new file. */
+	if (state->mru_file <= 0)
+	{
+		state->mru_file =
+			undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+									   reln->smgr_rnode.node.spcNode,
+									   segno, InRecovery);
+		if (InRecovery && state->mru_file <= 0)
+		{
+			/*
+			 * If in recovery, we may be trying to access a file that will
+			 * later be unlinked.  Tolerate missing files, creating a new
+			 * zero-filled file as required.
+			 */
+			UndoLogNewSegment(reln->smgr_rnode.node.relNode,
+							  reln->smgr_rnode.node.spcNode,
+							  segno);
+			state->mru_file =
+				undofile_open_segment_file(reln->smgr_rnode.node.relNode,
+										   reln->smgr_rnode.node.spcNode,
+										   segno, false);
+			Assert(state->mru_file > 0);
+		}
+		state->mru_segno = segno;
+	}
+
+	return state->mru_file;
+}
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 313ca5f..e6b22f4 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -624,6 +624,11 @@ typedef struct PgStat_StatTabEntry
 	PgStat_Counter tuples_inserted;
 	PgStat_Counter tuples_updated;
 	PgStat_Counter tuples_deleted;
+
+	/*
+	 * Counter tuples_hot_updated stores number of hot updates for heap table
+	 * and the number of inplace updates for zheap table.
+	 */
 	PgStat_Counter tuples_hot_updated;
 
 	PgStat_Counter n_live_tuples;
@@ -743,6 +748,7 @@ typedef enum BackendState
 #define PG_WAIT_IPC					0x08000000U
 #define PG_WAIT_TIMEOUT				0x09000000U
 #define PG_WAIT_IO					0x0A000000U
+#define PG_WAIT_PAGE_TRANS_SLOT		0x0B000000U
 
 /* ----------
  * Wait Events - Activity
@@ -767,7 +773,7 @@ typedef enum
 	WAIT_EVENT_SYSLOGGER_MAIN,
 	WAIT_EVENT_WAL_RECEIVER_MAIN,
 	WAIT_EVENT_WAL_SENDER_MAIN,
-	WAIT_EVENT_WAL_WRITER_MAIN
+	WAIT_EVENT_WAL_WRITER_MAIN,
 } WaitEventActivity;
 
 /* ----------
@@ -913,6 +919,13 @@ typedef enum
 	WAIT_EVENT_TWOPHASE_FILE_READ,
 	WAIT_EVENT_TWOPHASE_FILE_SYNC,
 	WAIT_EVENT_TWOPHASE_FILE_WRITE,
+	WAIT_EVENT_UNDO_CHECKPOINT_READ,
+	WAIT_EVENT_UNDO_CHECKPOINT_WRITE,
+	WAIT_EVENT_UNDO_CHECKPOINT_SYNC,
+	WAIT_EVENT_UNDO_FILE_READ,
+	WAIT_EVENT_UNDO_FILE_WRITE,
+	WAIT_EVENT_UNDO_FILE_FLUSH,
+	WAIT_EVENT_UNDO_FILE_SYNC,
 	WAIT_EVENT_WALSENDER_TIMELINE_HISTORY_READ,
 	WAIT_EVENT_WAL_BOOTSTRAP_SYNC,
 	WAIT_EVENT_WAL_BOOTSTRAP_WRITE,
@@ -1317,6 +1330,7 @@ pgstat_report_wait_end(void)
 
 extern void pgstat_count_heap_insert(Relation rel, PgStat_Counter n);
 extern void pgstat_count_heap_update(Relation rel, bool hot);
+extern void pgstat_count_zheap_update(Relation rel);
 extern void pgstat_count_heap_delete(Relation rel);
 extern void pgstat_count_truncate(Relation rel);
 extern void pgstat_update_heap_dead_tuples(Relation rel, int delta);
diff --git a/src/include/storage/bufmgr.h b/src/include/storage/bufmgr.h
index f61794c..3f951e4 100644
--- a/src/include/storage/bufmgr.h
+++ b/src/include/storage/bufmgr.h
@@ -38,8 +38,9 @@ typedef enum BufferAccessStrategyType
 typedef enum
 {
 	RBM_NORMAL,					/* Normal read */
-	RBM_ZERO_AND_LOCK,			/* Don't read from disk, caller will
-								 * initialize. Also locks the page. */
+	RBM_ZERO,					/* Don't read from disk, caller will
+								 * initialize. */
+	RBM_ZERO_AND_LOCK,			/* Like RBM_ZERO, but also locks the page. */
 	RBM_ZERO_AND_CLEANUP_LOCK,	/* Like RBM_ZERO_AND_LOCK, but locks the page
 								 * in "cleanup" mode */
 	RBM_ZERO_ON_ERROR,			/* Read, but return an all-zeros page on error */
@@ -171,7 +172,10 @@ extern Buffer ReadBufferExtended(Relation reln, ForkNumber forkNum,
 				   BufferAccessStrategy strategy);
 extern Buffer ReadBufferWithoutRelcache(RelFileNode rnode,
 						  ForkNumber forkNum, BlockNumber blockNum,
-						  ReadBufferMode mode, BufferAccessStrategy strategy);
+						  ReadBufferMode mode, BufferAccessStrategy strategy,
+						  char relpersistence);
+extern void ForgetBuffer(RelFileNode rnode, ForkNumber forkNum,
+			 BlockNumber blockNum);
 extern void ReleaseBuffer(Buffer buffer);
 extern void UnlockReleaseBuffer(Buffer buffer);
 extern void MarkBufferDirty(Buffer buffer);
@@ -228,6 +232,10 @@ extern void AtProcExit_LocalBuffers(void);
 
 extern void TestForOldSnapshot_impl(Snapshot snapshot, Relation relation);
 
+/* in localbuf.c */
+extern void ForgetLocalBuffer(RelFileNode rnode, ForkNumber forkNum,
+				  BlockNumber blockNum);
+
 /* in freelist.c */
 extern BufferAccessStrategy GetAccessStrategy(BufferAccessStrategyType btype);
 extern void FreeAccessStrategy(BufferAccessStrategy strategy);
diff --git a/src/include/storage/smgr.h b/src/include/storage/smgr.h
index 820d08e..0727674 100644
--- a/src/include/storage/smgr.h
+++ b/src/include/storage/smgr.h
@@ -71,6 +71,9 @@ typedef struct SMgrRelationData
 	int			md_num_open_segs[MAX_FORKNUM + 1];
 	struct _MdfdVec *md_seg_fds[MAX_FORKNUM + 1];
 
+	/* For use by implementations. */
+	void	   *private_data;
+
 	/* if unowned, list link in list of all unowned SMgrRelations */
 	struct SMgrRelationData *next_unowned_reln;
 } SMgrRelationData;
@@ -105,6 +108,7 @@ extern void smgrwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber smgrnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void smgrtruncate(SMgrRelation reln, ForkNumber forknum,
 			 BlockNumber nblocks);
+extern void smgrrequestsync(RelFileNode rnode, ForkNumber forknum, int segno);
 extern void smgrimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void smgrpreckpt(void);
 extern void smgrsync(void);
@@ -133,14 +137,41 @@ extern void mdwriteback(SMgrRelation reln, ForkNumber forknum,
 extern BlockNumber mdnblocks(SMgrRelation reln, ForkNumber forknum);
 extern void mdtruncate(SMgrRelation reln, ForkNumber forknum,
 		   BlockNumber nblocks);
+extern void mdrequestsync(RelFileNode rnode, ForkNumber forknum, int segno);
 extern void mdimmedsync(SMgrRelation reln, ForkNumber forknum);
 extern void mdpreckpt(void);
 extern void mdsync(void);
 extern void mdpostckpt(void);
 
+/* in undofile.c */
+extern void undofile_init(void);
+extern void undofile_shutdown(void);
+extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
+							bool isRedo);
+extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
+							bool isRedo);
+extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
+		 BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber blocknum);
+extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum, char *buffer);
+extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
+		BlockNumber blocknum, char *buffer, bool skipFsync);
+extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+			BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
+		   BlockNumber nblocks);
+extern void undofile_requestsync(RelFileNode rnode, ForkNumber forknum, int segno);
+extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_preckpt(void);
+extern void undofile_sync(void);
+extern void undofile_postckpt(void);
+
 extern void SetForwardFsyncRequests(void);
-extern void RememberFsyncRequest(RelFileNode rnode, ForkNumber forknum,
-					 BlockNumber segno);
 extern void ForgetRelationFsyncRequests(RelFileNode rnode, ForkNumber forknum);
 extern void ForgetDatabaseFsyncRequests(Oid dbid);
 extern void DropRelationFiles(RelFileNode *delrels, int ndelrels, bool isRedo);
diff --git a/src/include/storage/undofile.h b/src/include/storage/undofile.h
new file mode 100644
index 0000000..7544be3
--- /dev/null
+++ b/src/include/storage/undofile.h
@@ -0,0 +1,50 @@
+/*
+ * undofile.h
+ *
+ * PostgreSQL undo file manager.  This module manages the files that back undo
+ * logs on the filesystem.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/storage/undofile.h
+ */
+
+#ifndef UNDOFILE_H
+#define UNDOFILE_H
+
+#include "storage/smgr.h"
+
+/* Prototypes of functions exposed to SMgr. */
+extern void undofile_init(void);
+extern void undofile_shutdown(void);
+extern void undofile_close(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_create(SMgrRelation reln, ForkNumber forknum,
+							bool isRedo);
+extern bool undofile_exists(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_unlink(RelFileNodeBackend rnode, ForkNumber forknum,
+							bool isRedo);
+extern void undofile_extend(SMgrRelation reln, ForkNumber forknum,
+							BlockNumber blocknum, char *buffer,
+							bool skipFsync);
+extern void undofile_prefetch(SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber blocknum);
+extern void undofile_read(SMgrRelation reln, ForkNumber forknum,
+						  BlockNumber blocknum, char *buffer);
+extern void undofile_write(SMgrRelation reln, ForkNumber forknum,
+						   BlockNumber blocknum, char *buffer,
+						   bool skipFsync);
+extern void undofile_writeback(SMgrRelation reln, ForkNumber forknum,
+							   BlockNumber blocknum, BlockNumber nblocks);
+extern BlockNumber undofile_nblocks(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_truncate(SMgrRelation reln, ForkNumber forknum,
+							  BlockNumber nblocks);
+extern void undofile_immedsync(SMgrRelation reln, ForkNumber forknum);
+extern void undofile_pre_ckpt(void);
+extern void undofile_sync(void);
+extern void undofile_post_ckpt(void);
+
+/* Functions used by undolog.c. */
+extern void undofile_forgetsync(Oid logno, Oid tablespace, int segno);
+
+#endif
-- 
1.8.3.1

0001-Add-undo-log-manager.patchapplication/octet-stream; name=0001-Add-undo-log-manager.patchDownload

From 9c59cf0afa20753f69c2c1477df325507565e0a2 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Sat, 5 Jan 2019 10:41:09 +0530
Subject: [PATCH 1/3] Add undo log manager.

Add a new subsystem to manage undo logs.  Undo logs allow data to be appended
efficiently, like logs.  They also allow data to be discarded efficiently from
the other end, like a queue.  Thirdly, they allow efficient buffered random
access, like a relation.

Undo logs physically consist of a set of 1MB segment files under
$PGDATA/base/undo (or per-tablespace equivalent) that are created, deleted or
renamed as required, similarly to the way that WAL segments are managed.
Meta-data about the set of undo logs is stored in shared memory, and written
to per-checkpoint files under $PGDATA/pg_undo.

This commit provides an API for allocating and discarding undo log storage
space and managing the files in a crash-safe way.  A later commit will provide
support for accessing the data stored inside them.

XXX Status: WIP.  Some details around WAL are being reconsidered, as noted in
comments.

Author: Thomas Munro, with contributions from Dilip Kumar and input from
        Amit Kapila and Robert Haas
Tested-By: Neha Sharma
Discussion: https://postgr.es/m/CAEepm%3D2EqROYJ_xYz4v5kfr4b0qw_Lq_6Pe8RTEC8rx3upWsSQ%40mail.gmail.com
---
 src/backend/access/Makefile               |    2 +-
 src/backend/access/rmgrdesc/Makefile      |    2 +-
 src/backend/access/rmgrdesc/undologdesc.c |   97 ++
 src/backend/access/transam/rmgr.c         |    1 +
 src/backend/access/transam/xlog.c         |   17 +
 src/backend/access/undo/Makefile          |   17 +
 src/backend/access/undo/undolog.c         | 2682 +++++++++++++++++++++++++++++
 src/backend/catalog/system_views.sql      |    4 +
 src/backend/commands/tablespace.c         |   23 +
 src/backend/replication/logical/decode.c  |    1 +
 src/backend/storage/ipc/ipci.c            |    3 +
 src/backend/storage/lmgr/lwlock.c         |    2 +
 src/backend/storage/lmgr/lwlocknames.txt  |    1 +
 src/backend/utils/init/postinit.c         |    1 +
 src/backend/utils/misc/guc.c              |   12 +
 src/bin/initdb/initdb.c                   |    2 +
 src/bin/pg_waldump/rmgrdesc.c             |    1 +
 src/include/access/rmgrlist.h             |    1 +
 src/include/access/undolog.h              |  398 +++++
 src/include/access/undolog_xlog.h         |   73 +
 src/include/catalog/pg_proc.dat           |    7 +
 src/include/storage/lwlock.h              |    2 +
 src/include/utils/guc.h                   |    2 +
 src/test/regress/expected/rules.out       |   11 +
 24 files changed, 3360 insertions(+), 2 deletions(-)
 create mode 100644 src/backend/access/rmgrdesc/undologdesc.c
 create mode 100644 src/backend/access/undo/Makefile
 create mode 100644 src/backend/access/undo/undolog.c
 create mode 100644 src/include/access/undolog.h
 create mode 100644 src/include/access/undolog_xlog.h

diff --git a/src/backend/access/Makefile b/src/backend/access/Makefile
index bd93a6a..7f7380c 100644
--- a/src/backend/access/Makefile
+++ b/src/backend/access/Makefile
@@ -9,6 +9,6 @@ top_builddir = ../../..
 include $(top_builddir)/src/Makefile.global
 
 SUBDIRS	    = brin common gin gist hash heap index nbtree rmgrdesc spgist \
-			  tablesample transam
+			  tablesample transam undo
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index 5514db1..91ad1ef 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -11,6 +11,6 @@ include $(top_builddir)/src/Makefile.global
 OBJS = brindesc.o clogdesc.o committsdesc.o dbasedesc.o genericdesc.o \
 	   gindesc.o gistdesc.o hashdesc.o heapdesc.o logicalmsgdesc.o \
 	   mxactdesc.o nbtdesc.o relmapdesc.o replorigindesc.o seqdesc.o \
-	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o xactdesc.o xlogdesc.o
+	   smgrdesc.o spgdesc.o standbydesc.o tblspcdesc.o undologdesc.o xactdesc.o xlogdesc.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/rmgrdesc/undologdesc.c b/src/backend/access/rmgrdesc/undologdesc.c
new file mode 100644
index 0000000..1053dc7
--- /dev/null
+++ b/src/backend/access/rmgrdesc/undologdesc.c
@@ -0,0 +1,97 @@
+/*-------------------------------------------------------------------------
+ *
+ * undologdesc.c
+ *	  rmgr descriptor routines for access/undo/undolog.c
+ *
+ * Portions Copyright (c) 1996-2017, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/undologdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+
+void
+undolog_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_UNDOLOG_CREATE)
+	{
+		xl_undolog_create *xlrec = (xl_undolog_create *) rec;
+
+		appendStringInfo(buf, "logno %u", xlrec->logno);
+	}
+	else if (info == XLOG_UNDOLOG_EXTEND)
+	{
+		xl_undolog_extend *xlrec = (xl_undolog_extend *) rec;
+
+		appendStringInfo(buf, "logno %u end " UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_ATTACH)
+	{
+		xl_undolog_attach *xlrec = (xl_undolog_attach *) rec;
+
+		appendStringInfo(buf, "logno %u xid %u", xlrec->logno, xlrec->xid);
+	}
+	else if (info == XLOG_UNDOLOG_DISCARD)
+	{
+		xl_undolog_discard *xlrec = (xl_undolog_discard *) rec;
+
+		appendStringInfo(buf, "logno %u discard " UndoLogOffsetFormat " end "
+						 UndoLogOffsetFormat,
+						 xlrec->logno, xlrec->discard, xlrec->end);
+	}
+	else if (info == XLOG_UNDOLOG_REWIND)
+	{
+		xl_undolog_rewind *xlrec = (xl_undolog_rewind *) rec;
+
+		appendStringInfo(buf, "logno %u insert " UndoLogOffsetFormat " prevlen %d",
+						 xlrec->logno, xlrec->insert, xlrec->prevlen);
+	}
+	else if (info == XLOG_UNDOLOG_SWITCH)
+	{
+		UndoRecPtr prevlogurp = *(UndoRecPtr *) rec;
+
+		appendStringInfo(buf, "previous log urp " UndoRecPtrFormat, prevlogurp);
+	}	
+
+}
+
+const char *
+undolog_identify(uint8 info)
+{
+	const char *id = NULL;
+
+	switch (info & ~XLR_INFO_MASK)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			id = "CREATE";
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			id = "EXTEND";
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			id = "ATTACH";
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			id = "DISCARD";
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			id = "REWIND";
+			break;
+		case XLOG_UNDOLOG_SWITCH:
+			id = "SWITCH";
+			break;			
+	}
+
+	return id;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 9368b56..8b05374 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -18,6 +18,7 @@
 #include "access/multixact.h"
 #include "access/nbtxlog.h"
 #include "access/spgxlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9823b75..c4c5ab4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -31,6 +31,7 @@
 #include "access/transam.h"
 #include "access/tuptoaster.h"
 #include "access/twophase.h"
+#include "access/undolog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "access/xloginsert.h"
@@ -6699,6 +6700,9 @@ StartupXLOG(void)
 	 */
 	restoreTwoPhaseData();
 
+	/* Recover undo log meta data corresponding to this checkpoint. */
+	StartupUndoLogs(ControlFile->checkPointCopy.redo);
+
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
 	RedoRecPtr = XLogCtl->RedoRecPtr = XLogCtl->Insert.RedoRecPtr = checkPoint.redo;
@@ -7321,7 +7325,13 @@ StartupXLOG(void)
 	 * end-of-recovery steps fail.
 	 */
 	if (InRecovery)
+	{
 		ResetUnloggedRelations(UNLOGGED_RELATION_INIT);
+		ResetUndoLogs(UNDO_UNLOGGED);
+	}
+
+	/* Always reset temporary undo logs. */
+	ResetUndoLogs(UNDO_TEMP);
 
 	/*
 	 * We don't need the latch anymore. It's not strictly necessary to disown
@@ -9026,6 +9036,7 @@ CheckPointGuts(XLogRecPtr checkPointRedo, int flags)
 	CheckPointSnapBuild();
 	CheckPointLogicalRewriteHeap();
 	CheckPointBuffers(flags);	/* performs all required fsyncs */
+	CheckPointUndoLogs(checkPointRedo, ControlFile->checkPointCopy.redo);
 	CheckPointReplicationOrigin();
 	/* We deliberately delay 2PC checkpointing as long as possible */
 	CheckPointTwoPhase(checkPointRedo);
@@ -9732,6 +9743,9 @@ xlog_redo(XLogReaderState *record)
 		XLogCtl->ckptXid = checkPoint.nextXid;
 		SpinLockRelease(&XLogCtl->info_lck);
 
+		/* Write an undo log metadata snapshot. */
+		CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo);
+
 		/*
 		 * We should've already switched to the new TLI before replaying this
 		 * record.
@@ -9791,6 +9805,9 @@ xlog_redo(XLogReaderState *record)
 		XLogCtl->ckptXid = checkPoint.nextXid;
 		SpinLockRelease(&XLogCtl->info_lck);
 
+		/* Write an undo log metadata snapshot. */
+		CheckPointUndoLogs(checkPoint.redo, ControlFile->checkPointCopy.redo);
+
 		/* TLI should not change in an on-line checkpoint */
 		if (checkPoint.ThisTimeLineID != ThisTimeLineID)
 			ereport(PANIC,
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
new file mode 100644
index 0000000..219c696
--- /dev/null
+++ b/src/backend/access/undo/Makefile
@@ -0,0 +1,17 @@
+#-------------------------------------------------------------------------
+#
+# Makefile--
+#    Makefile for access/undo
+#
+# IDENTIFICATION
+#    src/backend/access/undo/Makefile
+#
+#-------------------------------------------------------------------------
+
+subdir = src/backend/access/undo
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+
+OBJS = undolog.o
+
+include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undolog.c b/src/backend/access/undo/undolog.c
new file mode 100644
index 0000000..42a9590
--- /dev/null
+++ b/src/backend/access/undo/undolog.c
@@ -0,0 +1,2682 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.c
+ *	  management of undo logs
+ *
+ * PostgreSQL undo log manager.  This module is responsible for managing the
+ * lifecycle of undo logs and their segment files, associating undo logs with
+ * backends, and allocating space within undo logs.
+ *
+ * For the code that reads and writes blocks of data, see undofile.c.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undolog.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/transam.h"
+#include "access/undolog.h"
+#include "access/undolog_xlog.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/xlogreader.h"
+#include "catalog/catalog.h"
+#include "catalog/pg_tablespace.h"
+#include "commands/tablespace.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "nodes/execnodes.h"
+#include "pgstat.h"
+#include "storage/buf.h"
+#include "storage/bufmgr.h"
+#include "storage/fd.h"
+#include "storage/ipc.h"
+#include "storage/lwlock.h"
+#include "storage/procarray.h"
+#include "storage/shmem.h"
+#include "storage/standby.h"
+#include "storage/undofile.h"
+#include "utils/builtins.h"
+#include "utils/guc.h"
+#include "utils/memutils.h"
+#include "utils/varlena.h"
+
+#include <sys/stat.h>
+#include <unistd.h>
+
+/*
+ * During recovery we maintain a mapping of transaction ID to undo logs
+ * numbers.  We do this with a two-level array, so that we use memory only for
+ * chunks of the array that overlap with the range of active xids.
+ */
+#define UndoLogXidLowBits 16
+
+/*
+ * Number of high bits.
+ */
+#define UndoLogXidHighBits \
+	(sizeof(TransactionId) * CHAR_BIT - UndoLogXidLowBits)
+
+/* Extract the upper bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidHigh(xid) ((xid) >> UndoLogXidLowBits)
+
+/* Extract the lower bits of an xid, for undo log mapping purposes. */
+#define UndoLogGetXidLow(xid) ((xid) & ((1 << UndoLogXidLowBits) - 1))
+
+/*
+ * Main control structure for undo log management in shared memory.
+ * UndoLogControl objects are arranged in a fixed-size array, at a position
+ * determined by the undo log number.
+ */
+typedef struct UndoLogSharedData
+{
+	UndoLogNumber free_lists[UndoPersistenceLevels];
+	UndoLogNumber low_logno; /* the lowest logno */
+	UndoLogNumber next_logno; /* one past the highest logno */
+	UndoLogNumber array_size; /* how many UndoLogControl objects do we have? */
+	UndoLogControl logs[FLEXIBLE_ARRAY_MEMBER];
+} UndoLogSharedData;
+
+/*
+ * Per-backend state for the undo log module.
+ * Backend-local pointers to undo subsystem state in shared memory.
+ */
+typedef struct UndoLogSession
+{
+	UndoLogSharedData *shared;
+
+	/*
+	 * The control object for the undo logs that this session is currently
+	 * attached to at each persistence level.  This is where it will write new
+	 * undo data.
+	 */
+	UndoLogControl *logs[UndoPersistenceLevels];
+
+	/*
+	 * If the undo_tablespaces GUC changes we'll remember to examine it and
+	 * attach to a new undo log using this flag.
+	 */
+	bool			need_to_choose_tablespace;
+
+	/*
+	 * During recovery, the startup process maintains a mapping of xid to undo
+	 * log number, instead of using 'log' above.  This is not used in regular
+	 * backends and can be in backend-private memory so long as recovery is
+	 * single-process.  This map references UNDO_PERMANENT logs only, since
+	 * temporary and unlogged relations don't have WAL to replay.
+	 */
+	UndoLogNumber **xid_map;
+
+	/*
+	 * The slot for the oldest xids still running.  We advance this during
+	 * checkpoints to free up chunks of the map.
+	 */
+	uint16			xid_map_oldest_chunk;
+
+	/* Current dbid.  Used during recovery. */
+	Oid				dbid;
+
+	/*
+	 * Transaction's start header undo record pointer in the previous
+	 * undo log when transaction spills across multiple undo log.  This
+	 * is used for identifying the log switch during recovery and updating
+	 * the transaction header in the previous log. 
+	 */
+	UndoRecPtr	prevlogurp;	
+} UndoLogSession;
+
+UndoLogSession MyUndoLogState;
+
+undologtable_hash *undologtable_cache;
+
+/* GUC variables */
+char	   *undo_tablespaces = NULL;
+
+static UndoLogControl *get_undo_log(UndoLogNumber logno, bool locked);
+static UndoLogControl *allocate_undo_log(void);
+static void free_undo_log(UndoLogControl *log);
+static void attach_undo_log(UndoPersistence level, Oid tablespace);
+static void detach_current_undo_log(UndoPersistence level, bool full);
+static void extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end);
+static void undo_log_before_exit(int code, Datum value);
+static void forget_undo_buffers(int logno, UndoLogOffset old_discard,
+								UndoLogOffset new_discard,
+								bool drop_tail);
+static bool choose_undo_tablespace(bool force_detach, Oid *oid);
+static void undolog_xid_map_gc(void);
+
+PG_FUNCTION_INFO_V1(pg_stat_get_undo_logs);
+
+/*
+ * How many undo logs can be active at a time?  This creates a theoretical
+ * maximum transaction size, but it we set it to a factor the maximum number
+ * of backends it will be a very high limit.  Alternative designs involving
+ * demand paging or dynamic shared memory could remove this limit but
+ * introduce other problems.
+ */
+static inline size_t
+UndoLogNumSlots(void)
+{
+	return MaxBackends * 4;
+}
+
+/*
+ * Return the amount of traditional smhem required for undo log management.
+ * Extra shared memory will be managed using DSM segments.
+ */
+Size
+UndoLogShmemSize(void)
+{
+	return sizeof(UndoLogSharedData) +
+		UndoLogNumSlots() * sizeof(UndoLogControl);
+}
+
+/*
+ * Initialize the undo log subsystem.  Called in each backend.
+ */
+void
+UndoLogShmemInit(void)
+{
+	bool found;
+
+	MyUndoLogState.shared = (UndoLogSharedData *)
+		ShmemInitStruct("UndoLogShared", UndoLogShmemSize(), &found);
+
+	/* The postmaster initialized the shared memory state. */
+	if (!IsUnderPostmaster)
+	{
+		UndoLogSharedData *shared = MyUndoLogState.shared;
+		int		i;
+
+		Assert(!found);
+
+		/*
+		 * We start with no active undo logs.  StartUpUndoLogs() will recreate
+		 * the undo logs that were known at the last checkpoint.
+		 */
+		memset(shared, 0, sizeof(*shared));
+		shared->array_size = UndoLogNumSlots();
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+			shared->free_lists[i] = InvalidUndoLogNumber;
+		for (i = 0; i < shared->array_size; ++i)
+		{
+			memset(&shared->logs[i], 0, sizeof(shared->logs[i]));
+			shared->logs[i].logno = InvalidUndoLogNumber;
+			LWLockInitialize(&shared->logs[i].mutex,
+							 LWTRANCHE_UNDOLOG);
+			LWLockInitialize(&shared->logs[i].discard_lock,
+							 LWTRANCHE_UNDODISCARD);
+		}
+	}
+	else
+		Assert(found);
+
+	/* All backends prepare their per-backend lookup table. */
+	undologtable_cache = undologtable_create(TopMemoryContext,
+											 UndoLogNumSlots(),
+											 NULL);
+}
+
+void
+UndoLogInit(void)
+{
+	before_shmem_exit(undo_log_before_exit, 0);
+}
+
+/*
+ * Figure out which directory holds an undo log based on tablespace.
+ */
+static void
+UndoLogDirectory(Oid tablespace, char *dir)
+{
+	if (tablespace == DEFAULTTABLESPACE_OID ||
+		tablespace == InvalidOid)
+		snprintf(dir, MAXPGPATH, "base/undo");
+	else
+		snprintf(dir, MAXPGPATH, "pg_tblspc/%u/%s/undo",
+				 tablespace, TABLESPACE_VERSION_DIRECTORY);
+}
+
+/*
+ * Compute the pathname to use for an undo log segment file.
+ */
+void
+UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace, char *path)
+{
+	char		dir[MAXPGPATH];
+
+	/* Figure out which directory holds the segment, based on tablespace. */
+	UndoLogDirectory(tablespace, dir);
+
+	/*
+	 * Build the path from log number and offset.  The pathname is the
+	 * UndoRecPtr of the first byte in the segment in hexadecimal, with a
+	 * period inserted between the components.
+	 */
+	snprintf(path, MAXPGPATH, "%s/%06X.%010zX", dir, logno,
+			 segno * UndoLogSegmentSize);
+}
+
+/*
+ * Iterate through the set of currently active logs.  Pass in NULL to get the
+ * first undo log.  NULL indicates the end of the set of logs.  The caller
+ * must lock the returned log before accessing its members, and must skip if
+ * logno is not valid.
+ */
+UndoLogControl *
+UndoLogNext(UndoLogControl *log)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+	for (;;)
+	{
+		/* Advance to the next log. */
+		if (log == NULL)
+		{
+			/* Start at the beginning. */
+			log = &shared->logs[0];
+		}
+		else if (++log == &shared->logs[shared->array_size])
+		{
+			/* Past the end. */
+			log = NULL;
+			break;
+		}
+		/* Have we found a slot with a valid log? */
+		if (log->logno != InvalidUndoLogNumber)
+			break;
+	}
+	LWLockRelease(UndoLogLock);
+
+	/* XXX: erm, which lock should the caller hold!? */
+	return log;
+}
+
+/*
+ * Check if an undo log position has been discarded.  'point' must be an undo
+ * log pointer that was allocated at some point in the past, otherwise the
+ * result is undefined.
+ */
+bool
+UndoLogIsDiscarded(UndoRecPtr point)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(point);
+	UndoLogControl *log;
+	bool	result;
+
+	log = get_undo_log(logno, false);
+
+	/*
+	 * If we couldn't find the undo log number, then it must be entirely
+	 * discarded.
+	 */
+	if (log == NULL)
+		return true;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	if (unlikely(logno != log->logno))
+	{
+		/*
+		 * The undo log has been entirely discarded since we looked it up, and
+		 * the UndoLogControl slot is now unused or being used for some other
+		 * undo log.  That means that any pointer within it must be discarded.
+		 */
+		result = true;
+	}
+	else
+	{
+		/* Check if this point is before the discard pointer. */
+		result = UndoRecPtrGetOffset(point) < log->meta.discard;
+	}
+	LWLockRelease(&log->mutex);
+
+	return result;
+}
+
+/*
+ * Store latest transaction's start undo record point in undo meta data.  It
+ * will fetched by the backend when it's reusing the undo log and preparing
+ * its first undo.
+ */
+void
+UndoLogSetLastXactStartPoint(UndoRecPtr point)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(point);
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	/* TODO: review */
+	log->meta.last_xact_start = UndoRecPtrGetOffset(point);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Fetch the previous transaction's start undo record point.
+ */
+UndoRecPtr
+UndoLogGetLastXactStartPoint(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	uint64 last_xact_start = 0;
+
+	if (unlikely(log == NULL))
+		return InvalidUndoRecPtr;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	/* TODO: review */
+	last_xact_start = log->meta.last_xact_start;
+	LWLockRelease(&log->mutex);
+
+	if (last_xact_start == 0)
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, last_xact_start);
+}
+
+/*
+ * Store the last undo record's length in undo meta-data so that it can be
+ * persistent across restart.
+ */
+void
+UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	/* TODO review */
+	log->meta.prevlen = prevlen;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get the last undo record's length.
+ */
+uint16
+UndoLogGetPrevLen(UndoLogNumber logno)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	uint16	prevlen;
+
+	Assert(log != NULL);
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	/* TODO review */
+	prevlen = log->meta.prevlen;
+	LWLockRelease(&log->mutex);
+
+	return prevlen;
+}
+
+/*
+ * Is this record is the first record for any transaction.
+ */
+bool
+IsTransactionFirstRec(TransactionId xid)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	Assert(InRecovery);
+
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	log = get_undo_log(logno, false);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	/* TODO review */
+	return log->meta.is_first_rec;
+}
+
+/*
+ * Detach from the undo log we are currently attached to, returning it to the
+ * appropriate free list if it still has space.
+ */
+static void
+detach_current_undo_log(UndoPersistence persistence, bool full)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+
+	MyUndoLogState.logs[persistence] = NULL;
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = InvalidPid;
+	log->xid = InvalidTransactionId;
+	if (full)
+		log->meta.status = UNDO_LOG_STATUS_FULL;
+	LWLockRelease(&log->mutex);
+
+	/* Push back onto the appropriate free list, unless it's full. */
+	if (!full)
+	{
+		LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+		log->next_free = shared->free_lists[persistence];
+		shared->free_lists[persistence] = log->logno;
+		LWLockRelease(UndoLogLock);
+	}
+}
+
+/*
+ * Exit handler, detaching from all undo logs.
+ */
+static void
+undo_log_before_exit(int code, Datum arg)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		if (MyUndoLogState.logs[i] != NULL)
+			detach_current_undo_log(i, false);
+	}
+}
+
+/*
+ * Create a new empty segment file on disk for the byte starting at 'end'.
+ */
+static void
+allocate_empty_undo_segment(UndoLogNumber logno, Oid tablespace,
+							UndoLogOffset end)
+{
+	struct stat	stat_buffer;
+	off_t	size;
+	char	path[MAXPGPATH];
+	void   *zeroes;
+	size_t	nzeroes = 8192;
+	int		fd;
+
+	UndoLogSegmentPath(logno, end / UndoLogSegmentSize, tablespace, path);
+
+	/*
+	 * Create and fully allocate a new file.  If we crashed and recovered
+	 * then the file might already exist, so use flags that tolerate that.
+	 * It's also possible that it exists but is too short, in which case
+	 * we'll write the rest.  We don't really care what's in the file, we
+	 * just want to make sure that the filesystem has allocated physical
+	 * blocks for it, so that non-COW filesystems will report ENOSPC now
+	 * rather than later when the space is needed and we'll avoid creating
+	 * files with holes.
+	 */
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0 && tablespace != 0)
+	{
+		char undo_path[MAXPGPATH];
+
+		/* Try creating the undo directory for this tablespace. */
+		UndoLogDirectory(tablespace, undo_path);
+		if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+		{
+			char	   *parentdir;
+
+			if (errno != ENOENT || !InRecovery)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+
+			/*
+			 * In recovery, it's possible that the tablespace directory
+			 * doesn't exist because a later WAL record removed the whole
+			 * tablespace.  In that case we create a regular directory to
+			 * stand in for it.  This is similar to the logic in
+			 * TablespaceCreateDbspace().
+			 */
+
+			/* create two parents up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			/* create one parent up if not exist */
+			parentdir = pstrdup(undo_path);
+			get_parent_directory(parentdir);
+			/* Can't create parent and it doesn't already exist? */
+			if (mkdir(parentdir, S_IRWXU) < 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								parentdir)));
+			pfree(parentdir);
+
+			if (mkdir(undo_path, S_IRWXU) != 0 && errno != EEXIST)
+				ereport(ERROR,
+						(errcode_for_file_access(),
+						 errmsg("could not create directory \"%s\": %m",
+								undo_path)));
+		}
+
+		fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	}
+	if (fd < 0)
+		elog(ERROR, "could not create new file \"%s\": %m", path);
+	if (fstat(fd, &stat_buffer) < 0)
+		elog(ERROR, "could not stat \"%s\": %m", path);
+	size = stat_buffer.st_size;
+
+	/* A buffer full of zeroes we'll use to fill up new segment files. */
+	zeroes = palloc0(nzeroes);
+
+	while (size < UndoLogSegmentSize)
+	{
+		ssize_t written;
+
+		written = write(fd, zeroes, Min(nzeroes, UndoLogSegmentSize - size));
+		if (written < 0)
+			elog(ERROR, "cannot initialize undo log segment file \"%s\": %m",
+				 path);
+		size += written;
+	}
+
+	/* Flush the contents of the file to disk. */
+	if (pg_fsync(fd) != 0)
+		elog(ERROR, "cannot fsync file \"%s\": %m", path);
+	CloseTransientFile(fd);
+
+	pfree(zeroes);
+
+	elog(LOG, "created undo segment \"%s\"", path); /* XXX: remove me */
+}
+
+/*
+ * Create a new undo segment, when it is unexpectedly not present.
+ */
+void
+UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno)
+{
+	Assert(InRecovery);
+	allocate_empty_undo_segment(logno, tablespace, segno * UndoLogSegmentSize);
+}
+
+/*
+ * Create and zero-fill a new segment for a given undo log number.
+ */
+static void
+extend_undo_log(UndoLogNumber logno, UndoLogOffset new_end)
+{
+	UndoLogControl *log;
+	char		dir[MAXPGPATH];
+	size_t		end;
+
+	log = get_undo_log(logno, false);
+
+	/* TODO review interlocking */
+
+	Assert(log != NULL);
+	Assert(log->meta.end % UndoLogSegmentSize == 0);
+	Assert(new_end % UndoLogSegmentSize == 0);
+	Assert(MyUndoLogState.logs[log->meta.persistence] == log || InRecovery);
+
+	/*
+	 * Create all the segments needed to increase 'end' to the requested
+	 * size.  This is quite expensive, so we will try to avoid it completely
+	 * by renaming files into place in UndoLogDiscard instead.
+	 */
+	end = log->meta.end;
+	while (end < new_end)
+	{
+		allocate_empty_undo_segment(logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	Assert(end == new_end);
+
+	/*
+	 * Flush the parent dir so that the directory metadata survives a crash
+	 * after this point.
+	 */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/*
+	 * If we're not in recovery, we need to WAL-log the creation of the new
+	 * file(s).  We do that after the above filesystem modifications, in
+	 * violation of the data-before-WAL rule as exempted by
+	 * src/backend/access/transam/README.  This means that it's possible for
+	 * us to crash having made some or all of the filesystem changes but
+	 * before WAL logging, but in that case we'll eventually try to create the
+	 * same segment(s) again, which is tolerated.
+	 */
+	if (!InRecovery)
+	{
+		xl_undolog_extend xlrec;
+		XLogRecPtr	ptr;
+
+		xlrec.logno = logno;
+		xlrec.end = end;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_EXTEND);
+		XLogFlush(ptr);
+	}
+
+	/*
+	 * We didn't need to acquire the mutex to read 'end' above because only
+	 * we write to it.  But we need the mutex to update it, because the
+	 * checkpointer might read it concurrently.
+	 *
+	 * XXX It's possible for meta.end to be higher already during
+	 * recovery, because of the timing of a checkpoint; in that case we did
+	 * nothing above and we shouldn't update shmem here.  That interaction
+	 * needs more analysis.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (log->meta.end < end)
+		log->meta.end = end;
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Get an insertion point that is guaranteed to be backed by enough space to
+ * hold 'size' bytes of data.  To actually write into the undo log, client
+ * code should call this first and then use bufmgr routines to access buffers
+ * and provide WAL logs and redo handlers.  In other words, while this module
+ * looks after making sure the undo log has sufficient space and the undo meta
+ * data is crash safe, the *contents* of the undo log and (indirectly) the
+ * insertion point are the responsibility of client code.
+ *
+ * Return an undo log insertion point that can be converted to a buffer tag
+ * and an insertion point within a buffer page.
+ *
+ * XXX For now an xl_undolog_meta object is filled in, in case it turns out
+ * to be necessary to write it into the WAL record (like FPI, this must be
+ * logged once for each undo log after each checkpoint).  I think this should
+ * be moved out of this interface and done differently -- to review.
+ */
+UndoRecPtr
+UndoLogAllocate(size_t size, UndoPersistence persistence)
+{
+	UndoLogControl *log = MyUndoLogState.logs[persistence];
+	UndoLogOffset new_insert;
+	TransactionId logxid;
+
+	/*
+	 * We may need to attach to an undo log, either because this is the first
+	 * time this backend as needed to write to an undo log at all or because
+	 * the undo_tablespaces GUC was changed.  When doing that, we'll need
+	 * interlocking against tablespaces being concurrently dropped.
+	 */
+
+ retry:
+	/* See if we need to check the undo_tablespaces GUC. */
+	if (unlikely(MyUndoLogState.need_to_choose_tablespace || log == NULL))
+	{
+		Oid		tablespace;
+		bool	need_to_unlock;
+
+		need_to_unlock =
+			choose_undo_tablespace(MyUndoLogState.need_to_choose_tablespace,
+								   &tablespace);
+		attach_undo_log(persistence, tablespace);
+		if (need_to_unlock)
+			LWLockRelease(TablespaceCreateLock);
+		log = MyUndoLogState.logs[persistence];
+		MyUndoLogState.need_to_choose_tablespace = false;
+	}
+
+	/*
+	 * If this is the first time we've allocated undo log space in this
+	 * transaction, we'll record the xid->undo log association so that it can
+	 * be replayed correctly. Before that, we set the first record flag to
+	 * false.
+	 */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.is_first_rec = false;
+	logxid = log->xid;
+
+	if (logxid != GetTopTransactionId())
+	{
+		xl_undolog_attach xlrec;
+
+		/*
+		 * While we have the lock, check if we have been forcibly detached by
+		 * DROP TABLESPACE.  That can only happen between transactions (see
+		 * DropUndoLogsInsTablespace()).
+		 */
+		if (log->pid == InvalidPid)
+		{
+			LWLockRelease(&log->mutex);
+			log = NULL;
+			goto retry;
+		}
+		log->xid = GetTopTransactionId();
+		log->meta.is_first_rec = true;
+		LWLockRelease(&log->mutex);
+
+		/* Skip the attach record for unlogged and temporary tables. */
+		if (persistence == UNDO_PERMANENT)
+		{
+			xlrec.xid = GetTopTransactionId();
+			xlrec.logno = log->logno;
+			xlrec.dbid = MyDatabaseId;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_ATTACH);
+		}
+	}
+	else
+	{
+		LWLockRelease(&log->mutex);
+	}
+
+	/*
+	 * 'size' is expressed in usable non-header bytes.  Figure out how far we
+	 * have to move insert to create space for 'size' usable bytes, stepping
+	 * over any intervening headers.
+	 */
+	Assert(log->meta.insert % BLCKSZ >= UndoLogBlockHeaderSize);
+	new_insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	Assert(new_insert % BLCKSZ >= UndoLogBlockHeaderSize);
+
+	/*
+	 * We don't need to acquire log->mutex to read log->meta.insert and
+	 * log->meta.end, because this backend is the only one that can
+	 * modify them.
+	 */
+	if (unlikely(new_insert > log->meta.end))
+	{
+		if (new_insert > UndoLogMaxSize)
+		{
+			/* This undo log is entirely full.  Get a new one. */
+			elog(LOG, "undo log %u is full, switching to a new one", log->logno);
+			log = NULL;
+			detach_current_undo_log(persistence, true);
+			goto retry;
+		}
+		/*
+		 * Extend the end of this undo log to cover new_insert (in other words
+		 * round up to the segment size).
+		 */
+		extend_undo_log(log->logno,
+						new_insert + UndoLogSegmentSize -
+						new_insert % UndoLogSegmentSize);
+		Assert(new_insert <= log->meta.end);
+	}
+
+	return MakeUndoRecPtr(log->logno, log->meta.insert);
+}
+
+/*
+ * In recovery, we expect the xid to map to a known log which already has
+ * enough space in it.
+ */
+UndoRecPtr
+UndoLogAllocateInRecovery(TransactionId xid, size_t size,
+						  UndoPersistence level)
+{
+	uint16		high_bits = UndoLogGetXidHigh(xid);
+	uint16		low_bits = UndoLogGetXidLow(xid);
+	UndoLogNumber logno;
+	UndoLogControl *log;
+
+	/*
+	 * The sequence of calls to UndoLogAllocateRecovery() during REDO
+	 * (recovery) must match the sequence of calls to UndoLogAllocate during
+	 * DO, for any given session.  The XXX_redo code for any UNDO-generating
+	 * operation must use UndoLogAllocateRecovery() rather than
+	 * UndoLogAllocate(), because it must supply the extra 'xid' argument so
+	 * that we can find out which undo log number to use.  During DO, that's
+	 * tracked per-backend, but during REDO the original backends/sessions are
+	 * lost and we have only the Xids.
+	 */
+	Assert(InRecovery);
+
+	/*
+	 * Look up the undo log number for this xid.  The mapping must already
+	 * have been created by an XLOG_UNDOLOG_ATTACH record emitted during the
+	 * first call to UndoLogAllocate for this xid after the most recent
+	 * checkpoint.
+	 */
+	if (MyUndoLogState.xid_map == NULL)
+		elog(ERROR, "xid to undo log number map not initialized");
+	if (MyUndoLogState.xid_map[high_bits] == NULL)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+	logno = MyUndoLogState.xid_map[high_bits][low_bits];
+	if (logno == InvalidUndoLogNumber)
+		elog(ERROR, "cannot find undo log number for xid %u", xid);
+
+	/*
+	 * This log must already have been created by an XLOG_UNDOLOG_CREATE
+	 * record emitted by UndoLogAllocate().
+	 */
+	log = get_undo_log(logno, false);
+	if (log == NULL)
+		elog(ERROR, "cannot find undo log number %d for xid %u", logno, xid);
+
+	/*
+	 * This log must already have been extended to cover the requested size by
+	 * XLOG_UNDOLOG_EXTEND records emitted by UndoLogAllocate(), or by
+	 * XLOG_UNDLOG_DISCARD records recycling segments.
+	 */
+	if (log->meta.end < UndoLogOffsetPlusUsableBytes(log->meta.insert, size))
+		elog(ERROR,
+			 "unexpectedly couldn't allocate %zu bytes in undo log number %d",
+			 size, logno);
+
+	/*
+	 * By this time we have allocated a undo log in transaction so after this
+	 * it will not be first undo record for the transaction.
+	 */
+	log->meta.is_first_rec = false;
+
+	return MakeUndoRecPtr(logno, log->meta.insert);
+}
+
+/*
+ * Advance the insertion pointer by 'size' usable (non-header) bytes.
+ */
+void
+UndoLogAdvance(UndoRecPtr insertion_point, size_t size, UndoPersistence persistence)
+{
+	UndoLogControl *log = NULL;
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insertion_point) ;
+
+	/*
+	 * During recovery, MyUndoLogState is uninitialized. Hence, we need to work
+	 * more.
+	 */
+	log = (InRecovery) ? get_undo_log(logno, false)
+		: MyUndoLogState.logs[persistence];
+
+	Assert(log != NULL);
+	Assert(InRecovery || logno == log->logno);
+	Assert(UndoRecPtrGetOffset(insertion_point) == log->meta.insert);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = UndoLogOffsetPlusUsableBytes(log->meta.insert, size);
+	LWLockRelease(&log->mutex);
+}
+
+/*
+ * Advance the discard pointer in one undo log, discarding all undo data
+ * relating to one or more whole transactions.  The passed in undo pointer is
+ * the address of the oldest data that the called would like to keep, and the
+ * affected undo log is implied by this pointer, ie
+ * UndoRecPtrGetLogNo(discard_pointer).
+ *
+ * The caller asserts that there will be no attempts to access the undo log
+ * region being discarded after this moment.  This operation will cause the
+ * relevant buffers to be dropped immediately, without writing any data out to
+ * disk.  Any attempt to read the buffers (except a partial buffer at the end
+ * of this range which will remain) may result in IO errors, because the
+ * underlying segment file may have been physically removed.
+ *
+ * Only one backend should call this for a given undo log concurrently, or
+ * data structures will become corrupted.  It is expected that the caller will
+ * be an undo worker; only one undo worker should be working on a given undo
+ * log at a time.
+ */
+void
+UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(discard_point);
+	UndoLogOffset discard = UndoRecPtrGetOffset(discard_point);
+	UndoLogOffset old_discard;
+	UndoLogOffset end;
+	UndoLogControl *log;
+	int			segno;
+	int			new_segno;
+	bool		need_to_flush_wal = false;
+	bool		entirely_discarded = false;
+
+	log = get_undo_log(logno, false);
+	if (unlikely(log == NULL))
+		elog(ERROR,
+			 "cannot advance discard pointer for undo log %d because it is already entirely discarded",
+			 logno);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	if (unlikely(log->logno != logno))
+		elog(ERROR,
+			 "cannot advance discard pointer for undo log %d because it is entirely discarded",
+			 logno);
+	if (discard > log->meta.insert)
+		elog(ERROR, "cannot move discard point past insert point");
+	old_discard = log->meta.discard;
+	if (discard < old_discard)
+		elog(ERROR, "cannot move discard pointer backwards");
+	end = log->meta.end;
+	/* Are we discarding the last remaining data in a log marked as full? */
+	if (log->meta.status == UNDO_LOG_STATUS_FULL &&
+		discard == log->meta.insert)
+	{
+		/*
+		 * Adjust the discard and insert pointers so that the final segment is
+		 * deleted from disk, and remember not to recycle it.
+		 */
+		entirely_discarded = true;
+		log->meta.insert = log->meta.end;
+		discard = log->meta.end;
+	}
+	LWLockRelease(&log->mutex);
+
+	/*
+	 * Drop all buffers holding this undo data out of the buffer pool (except
+	 * the last one, if the new location is in the middle of it somewhere), so
+	 * that the contained data doesn't ever touch the disk.  The caller
+	 * promises that this data will not be needed again.  We have to drop the
+	 * buffers from the buffer pool before removing files, otherwise a
+	 * concurrent session might try to write the block to evict the buffer.
+	 */
+	forget_undo_buffers(logno, old_discard, discard, entirely_discarded);
+
+	/*
+	 * Check if we crossed a segment boundary and need to do some synchronous
+	 * filesystem operations.
+	 */
+	segno = old_discard / UndoLogSegmentSize;
+	new_segno = discard / UndoLogSegmentSize;
+	if (segno < new_segno)
+	{
+		int		recycle;
+		UndoLogOffset pointer;
+
+		/*
+		 * We always WAL-log discards, but we only need to flush the WAL if we
+		 * have performed a filesystem operation.
+		 */
+		need_to_flush_wal = true;
+
+		/*
+		 * XXX When we rename or unlink a file, it's possible that some
+		 * backend still has it open because it has recently read a page from
+		 * it.  smgr/undofile.c in any such backend will eventually close it,
+		 * because it considers that fd to belong to the file with the name
+		 * that we're unlinking or renaming and it doesn't like to keep more
+		 * than one open at a time.  No backend should ever try to read from
+		 * such a file descriptor; that is what it means when we say that the
+		 * caller of UndoLogDiscard() asserts that there will be no attempts
+		 * to access the discarded range of undo log.  In the case of a
+		 * rename, if a backend were to attempt to read undo data in the range
+		 * being discarded, it would read entirely the wrong data.
+		 */
+
+		/*
+		 * How many segments should we recycle (= rename from tail position to
+		 * head position)?  For now it's always 1 unless there is already a
+		 * spare one, but we could have an adaptive algorithm that recycles
+		 * multiple segments at a time and pays just one fsync().
+		 */
+		LWLockAcquire(&log->mutex, LW_SHARED);
+		if ((log->meta.end - log->meta.insert) < UndoLogSegmentSize &&
+			log->meta.status == UNDO_LOG_STATUS_ACTIVE)
+			recycle = 1;
+		else
+			recycle = 0;
+		LWLockRelease(&log->mutex);
+
+		/* Rewind to the start of the segment. */
+		pointer = segno * UndoLogSegmentSize;
+
+		while (pointer < new_segno * UndoLogSegmentSize)
+		{
+			char	discard_path[MAXPGPATH];
+
+			/*
+			 * Before removing the file, make sure that undofile_sync knows
+			 * that it might be missing.
+			 */
+			undofile_forgetsync(log->logno,
+								log->meta.tablespace,
+								pointer / UndoLogSegmentSize);
+
+			UndoLogSegmentPath(logno, pointer / UndoLogSegmentSize,
+							   log->meta.tablespace, discard_path);
+
+			/* Can we recycle the oldest segment? */
+			if (recycle > 0)
+			{
+				char	recycle_path[MAXPGPATH];
+
+				/*
+				 * End points one byte past the end of the current undo space,
+				 * ie to the first byte of the segment file we want to create.
+				 */
+				UndoLogSegmentPath(logno, end / UndoLogSegmentSize,
+								   log->meta.tablespace, recycle_path);
+				if (rename(discard_path, recycle_path) == 0)
+				{
+					elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+					end += UndoLogSegmentSize;
+					--recycle;
+				}
+				else
+				{
+					elog(ERROR, "could not rename \"%s\" to \"%s\": %m",
+						 discard_path, recycle_path);
+				}
+			}
+			else
+			{
+				if (unlink(discard_path) == 0)
+					elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+				else
+					elog(ERROR, "could not unlink \"%s\": %m", discard_path);
+			}
+			pointer += UndoLogSegmentSize;
+		}
+	}
+
+	/* WAL log the discard. */
+	{
+		xl_undolog_discard xlrec;
+		XLogRecPtr ptr;
+
+		xlrec.logno = logno;
+		xlrec.discard = discard;
+		xlrec.end = end;
+		xlrec.latestxid = xid;
+		xlrec.entirely_discarded = entirely_discarded;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		ptr = XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_DISCARD);
+
+		if (need_to_flush_wal)
+			XLogFlush(ptr);
+	}
+
+	/* Update shmem to show the new discard and end pointers. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+
+	/* If we discarded everything, the slot can be given up. */
+	if (entirely_discarded)
+		free_undo_log(log);
+}
+
+/*
+ * Return an UndoRecPtr to the oldest valid data in an undo log, or
+ * InvalidUndoRecPtr if it is empty.
+ */
+UndoRecPtr
+UndoLogGetFirstValidRecord(UndoLogControl *log, bool *full)
+{
+	UndoRecPtr	result;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	if (log->meta.discard == log->meta.insert)
+		result = InvalidUndoRecPtr;
+	else
+		result = MakeUndoRecPtr(log->logno, log->meta.discard);
+	*full = log->meta.status == UNDO_LOG_STATUS_FULL;
+	LWLockRelease(&log->mutex);
+
+	return result;
+}
+
+/*
+ * Return the Next insert location.  This will also validate the input xid
+ * if latest insert point is not for the same transaction id then this will
+ * return Invalid Undo pointer.
+ */
+UndoRecPtr
+UndoLogGetNextInsertPtr(UndoLogNumber logno, TransactionId xid)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	TransactionId	logxid;
+	UndoRecPtr	insert;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	insert = log->meta.insert;
+	logxid = log->xid;
+	LWLockRelease(&log->mutex);
+
+	if (TransactionIdIsValid(logxid) && !TransactionIdEquals(logxid, xid))
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, insert);
+}
+
+/*
+ * Get the address of the most recently inserted record.
+ */
+UndoRecPtr
+UndoLogGetLastRecordPtr(UndoLogNumber logno, TransactionId xid)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+	TransactionId logxid;
+	UndoRecPtr insert;
+	uint16 prevlen;
+
+	LWLockAcquire(&log->mutex, LW_SHARED);
+	insert = log->meta.insert;
+	logxid = log->xid;
+	prevlen = log->meta.prevlen;
+	LWLockRelease(&log->mutex);
+
+	if (TransactionIdIsValid(logxid) &&
+		TransactionIdIsValid(xid) &&
+		!TransactionIdEquals(logxid, xid))
+		return InvalidUndoRecPtr;
+
+	if (prevlen == 0)
+		return InvalidUndoRecPtr;
+
+	return MakeUndoRecPtr(logno, insert - prevlen);
+}
+
+/*
+ * Rewind the undo log insert position also set the prevlen in the mata
+ */
+void
+UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen)
+{
+	UndoLogNumber	logno = UndoRecPtrGetLogNo(insert_urp);
+	UndoLogControl *log = get_undo_log(logno, false);
+	UndoLogOffset	insert = UndoRecPtrGetOffset(insert_urp);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.insert = insert;
+	log->meta.prevlen = prevlen;
+
+	/*
+	 * Force the wal log on next undo allocation. So that during recovery undo
+	 * insert location is consistent with normal allocation.
+	 */
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	/* WAL log the rewind. */
+	{
+		xl_undolog_rewind xlrec;
+
+		xlrec.logno = logno;
+		xlrec.insert = insert;
+		xlrec.prevlen = prevlen;
+
+		XLogBeginInsert();
+		XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+		XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_REWIND);
+	}
+}
+
+/*
+ * Delete unreachable files under pg_undo.  Any files corresponding to LSN
+ * positions before the previous checkpoint are no longer needed.
+ */
+static void
+CleanUpUndoCheckPointFiles(XLogRecPtr checkPointRedo)
+{
+	DIR	   *dir;
+	struct dirent *de;
+	char	path[MAXPGPATH];
+	char	oldest_path[MAXPGPATH];
+
+	/*
+	 * If a base backup is in progress, we can't delete any checkpoint
+	 * snapshot files because one of them corresponds to the backup label but
+	 * there could be any number of checkpoints during the backup.
+	 */
+	if (BackupInProgress())
+		return;
+
+	/* Otherwise keep only those >= the previous checkpoint's redo point. */
+	snprintf(oldest_path, MAXPGPATH, "%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	dir = AllocateDir("pg_undo");
+	while ((de = ReadDir(dir, "pg_undo")) != NULL)
+	{
+		/*
+		 * Assume that fixed width uppercase hex strings sort the same way as
+		 * the values they represent, so we can use strcmp to identify undo
+		 * log snapshot files corresponding to checkpoints that we don't need
+		 * anymore.  This assumption holds for ASCII.
+		 */
+		if (!(strlen(de->d_name) == UNDO_CHECKPOINT_FILENAME_LENGTH))
+			continue;
+
+		if (UndoCheckPointFilenamePrecedes(de->d_name, oldest_path))
+		{
+			snprintf(path, MAXPGPATH, "pg_undo/%s", de->d_name);
+			if (unlink(path) != 0)
+				elog(ERROR, "could not unlink file \"%s\": %m", path);
+		}
+	}
+	FreeDir(dir);
+}
+
+/*
+ * Write out the undo log meta data to the pg_undo directory.  The actual
+ * contents of undo logs is in shared buffers and therefore handled by
+ * CheckPointBuffers(), but here we record the table of undo logs and their
+ * properties.
+ */
+void
+CheckPointUndoLogs(XLogRecPtr checkPointRedo, XLogRecPtr priorCheckPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogMetaData *serialized = NULL;
+	size_t	serialized_size = 0;
+	char   *data;
+	char	path[MAXPGPATH];
+	int		num_logs;
+	int		fd;
+	int		i;
+	pg_crc32c crc;
+
+	/*
+	 * We acquire UndoLogLock to prevent any undo logs from being created or
+	 * discarded while we build a snapshot of them.  This isn't expected to
+	 * take long on a healthy system because the number of active logs should
+	 * be around the number of backends.  Holding this lock won't prevent
+	 * concurrent access to the undo log, except when segments need to be
+	 * added or removed.
+	 */
+	LWLockAcquire(UndoLogLock, LW_SHARED);
+
+	/*
+	 * Rather than doing the file IO while we hold locks, we'll copy the
+	 * meta-data into a palloc'd buffer.
+	 */
+	serialized_size = sizeof(UndoLogMetaData) * UndoLogNumSlots();
+	serialized = (UndoLogMetaData *) palloc0(serialized_size);
+
+	/* Scan through all slots looking for non-empty ones. */
+	num_logs = 0;
+	for (i = 0; i < UndoLogNumSlots(); ++i)
+	{
+		UndoLogControl *slot = &shared->logs[i];
+
+		/* Skip empty slots. */
+		if (slot->logno == InvalidUndoLogNumber)
+			continue;
+
+		/* Capture snapshot while holding each mutex. */
+		LWLockAcquire(&slot->mutex, LW_EXCLUSIVE);
+		serialized[num_logs++] = slot->meta;
+		slot->need_attach_wal_record = true; /* XXX: ?!? */
+		LWLockRelease(&slot->mutex);
+	}
+
+	LWLockRelease(UndoLogLock);
+
+	/* Dump into a file under pg_undo. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_WRITE);
+	fd = OpenTransientFile(path, O_RDWR | O_CREAT | PG_BINARY);
+	if (fd < 0)
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not create file \"%s\": %m", path)));
+
+	/* Compute header checksum. */
+	INIT_CRC32C(crc);
+	COMP_CRC32C(crc, &shared->low_logno, sizeof(shared->low_logno));
+	COMP_CRC32C(crc, &shared->next_logno, sizeof(shared->next_logno));
+	COMP_CRC32C(crc, &num_logs, sizeof(num_logs));
+	FIN_CRC32C(crc);
+
+	/* Write out the number of active logs + crc. */
+	if ((write(fd, &shared->low_logno, sizeof(shared->low_logno)) != sizeof(shared->low_logno)) ||
+		(write(fd, &shared->next_logno, sizeof(shared->next_logno)) != sizeof(shared->next_logno)) ||
+		(write(fd, &num_logs, sizeof(num_logs)) != sizeof(num_logs)) ||
+		(write(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+	/* Write out the meta data for all active undo logs. */
+	data = (char *) serialized;
+	INIT_CRC32C(crc);
+	serialized_size = num_logs * sizeof(UndoLogMetaData);
+	while (serialized_size > 0)
+	{
+		ssize_t written;
+
+		written = write(fd, data, serialized_size);
+		if (written < 0)
+			ereport(ERROR,
+					(errcode_for_file_access(),
+					 errmsg("could not write to file \"%s\": %m", path)));
+		COMP_CRC32C(crc, data, written);
+		serialized_size -= written;
+		data += written;
+	}
+	FIN_CRC32C(crc);
+
+	if (write(fd, &crc, sizeof(crc)) != sizeof(crc))
+		ereport(ERROR,
+				(errcode_for_file_access(),
+				 errmsg("could not write to file \"%s\": %m", path)));
+
+
+	/* Flush file and directory entry. */
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_SYNC);
+	pg_fsync(fd);
+	CloseTransientFile(fd);
+	fsync_fname("pg_undo", true);
+	pgstat_report_wait_end();
+
+	if (serialized)
+		pfree(serialized);
+
+	CleanUpUndoCheckPointFiles(priorCheckPointRedo);
+	undolog_xid_map_gc();
+}
+
+void
+StartupUndoLogs(XLogRecPtr checkPointRedo)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char	path[MAXPGPATH];
+	int		i;
+	int		fd;
+	int		nlogs;
+	pg_crc32c crc;
+	pg_crc32c new_crc;
+
+	/* If initdb is calling, there is no file to read yet. */
+	if (IsBootstrapProcessingMode())
+		return;
+
+	/* Open the pg_undo file corresponding to the given checkpoint. */
+	snprintf(path, MAXPGPATH, "pg_undo/%016" INT64_MODIFIER "X",
+			 checkPointRedo);
+	pgstat_report_wait_start(WAIT_EVENT_UNDO_CHECKPOINT_READ);
+	fd = OpenTransientFile(path, O_RDONLY | PG_BINARY);
+	if (fd < 0)
+		elog(ERROR, "cannot open undo checkpoint snapshot \"%s\": %m", path);
+
+	/* Read the active log number range. */
+	if ((read(fd, &shared->low_logno, sizeof(shared->low_logno))
+		 != sizeof(shared->low_logno)) ||
+		(read(fd, &shared->next_logno, sizeof(shared->next_logno))
+		 != sizeof(shared->next_logno)) ||
+		(read(fd, &nlogs, sizeof(nlogs)) != sizeof(nlogs)) ||
+		(read(fd, &crc, sizeof(crc)) != sizeof(crc)))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+
+	/* Verify the header checksum. */
+	INIT_CRC32C(new_crc);
+	COMP_CRC32C(new_crc, &shared->low_logno, sizeof(shared->low_logno));
+	COMP_CRC32C(new_crc, &shared->next_logno, sizeof(shared->next_logno));
+	COMP_CRC32C(new_crc, &nlogs, sizeof(shared->next_logno));
+	FIN_CRC32C(new_crc);
+
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	/*
+	 * We'll acquire UndoLogLock just because allocate_undo_log() asserts we
+	 * hold it (we don't actually expect concurrent access yet).
+	 */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+	/* Initialize all the logs and set up the freelist. */
+	INIT_CRC32C(new_crc);
+	for (i = 0; i < nlogs; ++i)
+	{
+		ssize_t size;
+		UndoLogControl *log;
+
+		/*
+		 * Get a new slot to hold this UndoLogControl object.  If this
+		 * checkpoint was created on a system with a higher max_connections
+		 * setting, it's theoretically possible that we don't have enough
+		 * space and cannot start up.
+		 */
+		log = allocate_undo_log();
+		if (!log)
+			ereport(ERROR,
+					(errmsg("not enough undo log slots to recover from checkpoint: need at least %d, have %zu",
+							nlogs, UndoLogNumSlots()),
+					 errhint("Consider increasing max_connections")));
+
+		/* Read in the meta data for this undo log. */
+		if ((size = read(fd, &log->meta, sizeof(log->meta))) != sizeof(log->meta))
+			elog(ERROR, "short read of pg_undo meta data in file \"%s\": %m (got %zu, wanted %zu)",
+				 path, size, sizeof(log->meta));
+		COMP_CRC32C(new_crc, &log->meta, sizeof(log->meta));
+
+		/*
+		 * At normal start-up, or during recovery, all active undo logs start
+		 * out on the appropriate free list.
+		 */
+		log->logno = log->meta.logno;
+		log->pid = InvalidPid;
+		log->xid = InvalidTransactionId;
+		if (log->meta.status == UNDO_LOG_STATUS_ACTIVE)
+		{
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = log->logno;
+		}
+	}
+	FIN_CRC32C(new_crc);
+
+	LWLockRelease(UndoLogLock);
+
+	/* Verify body checksum. */
+	if (read(fd, &crc, sizeof(crc)) != sizeof(crc))
+		elog(ERROR, "pg_undo file \"%s\" is corrupted", path);
+	if (crc != new_crc)
+		elog(ERROR,
+			 "pg_undo file \"%s\" has incorrect checksum", path);
+
+	CloseTransientFile(fd);
+	pgstat_report_wait_end();
+}
+
+/*
+ * Return a pointer to a newly allocated UndoLogControl object in shared
+ * memory, or return NULL if there are no free slots.  The caller should
+ * acquire the mutex and set up the object.
+ */
+static UndoLogControl *
+allocate_undo_log(void)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log;
+	int		i;
+
+	Assert(LWLockHeldByMeInMode(UndoLogLock, LW_EXCLUSIVE));
+
+	for (i = 0; i < UndoLogNumSlots(); ++i)
+	{
+		log = &shared->logs[i];
+		if (log->logno == InvalidUndoLogNumber)
+		{
+			memset(&log->meta, 0, sizeof(log->meta));
+			log->next_free = InvalidUndoLogNumber;
+			/* TODO: oldest_xid etc? */
+			return log;
+		}
+	}
+
+	return NULL;
+}
+
+/*
+ * Free an UndoLogControl object in shared memory, so that it can be reused.
+ */
+static void
+free_undo_log(UndoLogControl *log)
+{
+	/*
+	 * When removing an undo log from a slot in shared memory, we acquire
+	 * UndoLogLock and log->mutex, so that other code can hold either lock to
+	 * prevent the object from disappearing.
+	 */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	Assert(log->logno != InvalidUndoLogNumber);
+	log->logno = InvalidUndoLogNumber;
+	memset(&log->meta, 0, sizeof(log->meta));
+	LWLockRelease(&log->mutex);
+	LWLockRelease(UndoLogLock);
+}
+
+/*
+ * Get the UndoLogControl object for a given log number.
+ *
+ * The caller may or may not already hold UndoLogLock, and should indicate
+ * this by passing 'locked'.  We'll acquire it in the slow path if necessary.
+ * Either way, the caller must deal with the possibility that the returned
+ * UndoLogControl object pointed to no longer contains the requested logno by
+ * the time it is accessed.
+ *
+ * To do that, one of the following approaches must be taken by the calling
+ * code:
+ *
+ * 1.  If it is known that the calling backend is attached to the log, then it
+ * can be assumed that the UndoLogControl slot still holds the same undo log
+ * number.  The UndoLogControl slot can only change with the cooperation of
+ * the undo log that is attached to it (it must first be marked as
+ * UNDO_LOG_STATUS_FULL, which happens when a backend detaches).  Calling
+ * code should probably assert that it is attached and the logno is as
+ * expected, however.
+ *
+ * 2.  Acquire log->mutex before accessing any members, and after doing so,
+ * check that the logno is as expected.  If it is not, the entire undo log
+ * must be assumed to be discarded and the caller must behave accordingly.
+ *
+ * Return NULL if the undo log has been entirely discarded.  It is an error to
+ * ask for undo logs that have never been created.
+ */
+static UndoLogControl *
+get_undo_log(UndoLogNumber logno, bool locked)
+{
+	UndoLogControl *result = NULL;
+	UndoLogTableEntry *entry;
+	bool	   found;
+
+	Assert(locked == LWLockHeldByMe(UndoLogLock));
+
+	/* First see if we already have it in our cache. */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	if (likely(entry))
+		result = entry->control;
+	else
+	{
+		UndoLogSharedData *shared = MyUndoLogState.shared;
+		int		i;
+
+		/* Nope.  Linear search for the slot in shared memory. */
+		if (!locked)
+			LWLockAcquire(UndoLogLock, LW_SHARED);
+		for (i = 0; i < UndoLogNumSlots(); ++i)
+		{
+			if (shared->logs[i].logno == logno)
+			{
+				/* Found it. */
+
+				/*
+				 * TODO: Should this function be usable in a critical section?
+				 * Woudl it make sense to detect that we are in a critical
+				 * section and just return the pointer to the log without
+				 * updating the cache, to avoid any chance of allocating
+				 * memory?
+				 */
+
+				entry = undologtable_insert(undologtable_cache, logno, &found);
+				entry->number = logno;
+				entry->control = &shared->logs[i];
+				entry->tablespace = entry->control->meta.tablespace;
+				result = entry->control;
+				break;
+			}
+		}
+
+		/*
+		 * If we didn't find it, then it must already have been entirely
+		 * discarded.  We create a negative cache entry so that we can answer
+		 * this question quickly next time.
+		 *
+		 * TODO: We could track the lowest known undo log number, to reduce
+		 * the negative cache entry bloat.
+		 */
+		if (result == NULL)
+		{
+			/*
+			 * Sanity check: the caller should not be asking about undo logs
+			 * that have never existed.
+			 */
+			if (logno >= shared->next_logno)
+				elog(PANIC, "undo log %u hasn't been created yet", logno);
+			entry = undologtable_insert(undologtable_cache, logno, &found);
+			entry->number = logno;
+			entry->control = NULL;
+			entry->tablespace = 0;
+		}
+		if (!locked)
+			LWLockRelease(UndoLogLock);
+	}
+
+	return result;
+}
+
+/*
+ * Get a pointer to an UndoLogControl object corresponding to a given logno.
+ *
+ * In general, the caller must acquire the UndoLogControl's mutex to access
+ * the contents, and at that time must consider that the logno might have
+ * changed because the undo log it contained has been entirely discarded.
+ *
+ * If the calling backend is currently attached to the undo log, that is not
+ * possible, because logs can only reach UNDO_LOG_STATUS_DISCARDED after first
+ * reaching UNDO_LOG_STATUS_FULL, and that only happens while detaching.
+ */
+UndoLogControl *
+UndoLogGet(UndoLogNumber logno, bool missing_ok)
+{
+	UndoLogControl *log = get_undo_log(logno, false);
+
+	if (log == NULL && !missing_ok)
+		elog(ERROR, "unknown undo log number %d", logno);
+
+	return log;
+}
+
+/*
+ * Attach to an undo log, possibly creating or recycling one as required.
+ */
+static void
+attach_undo_log(UndoPersistence persistence, Oid tablespace)
+{
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log = NULL;
+	UndoLogNumber logno;
+	UndoLogNumber *place;
+
+	Assert(!InRecovery);
+	Assert(MyUndoLogState.logs[persistence] == NULL);
+
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+
+	/*
+	 * For now we have a simple linked list of unattached undo logs for each
+	 * persistence level.  We'll grovel though it to find something for the
+	 * tablespace you asked for.  If you're not using multiple tablespaces
+	 * it'll be able to pop one off the front.  We might need a hash table
+	 * keyed by tablespace if this simple scheme turns out to be too slow when
+	 * using many tablespaces and many undo logs, but that seems like an
+	 * unusual use case not worth optimizing for.
+	 */
+	place = &shared->free_lists[persistence];
+	while (*place != InvalidUndoLogNumber)
+	{
+		UndoLogControl *candidate = get_undo_log(*place, true);
+
+		/*
+		 * There should never be an undo log on the freelist that has been
+		 * entirely discarded, or hasn't been created yet.  The persistence
+		 * level should match the freelist.
+		 */
+		if (unlikely(candidate == NULL))
+			elog(ERROR,
+				 "corrupted undo log freelist, no such undo log %u", *place);
+		if (unlikely(candidate->meta.persistence != persistence))
+			elog(ERROR,
+				 "corrupted undo log freelist, undo log %u with persistence %d found on freelist %d",
+				 *place, candidate->meta.persistence, persistence);
+
+		if (candidate->meta.tablespace == tablespace)
+		{
+			logno = *place;
+			log = candidate;
+			*place = candidate->next_free;
+			break;
+		}
+		place = &candidate->next_free;
+	}
+
+	/*
+	 * All existing undo logs for this tablespace and persistence level are
+	 * busy, so we'll have to create a new one.
+	 */
+	if (log == NULL)
+	{
+		if (shared->next_logno > MaxUndoLogNumber)
+		{
+			/*
+			 * You've used up all 16 exabytes of undo log addressing space.
+			 * This is a difficult state to reach using only 16 exabytes of
+			 * WAL.
+			 */
+			elog(ERROR, "undo log address space exhausted");
+		}
+
+		/* Allocate a slot from the UndoLogControl pool. */
+		log = allocate_undo_log();
+		if (unlikely(!log))
+			ereport(ERROR,
+					(errmsg("could not create new undo log"),
+					 errdetail("The maximum number of active undo logs is %zu.",
+							   UndoLogNumSlots()),
+					 errhint("Consider increasing max_connections.")));
+		log->logno = logno = shared->next_logno;
+
+		/*
+		 * The insert and discard pointers start after the first block's
+		 * header.  XXX That means that insert is > end for a short time in a
+		 * newly created undo log.  Is there any problem with that?
+		 */
+		log->meta.insert = UndoLogBlockHeaderSize;
+		log->meta.discard = UndoLogBlockHeaderSize;
+
+		log->meta.logno = logno;
+		log->meta.tablespace = tablespace;
+		log->meta.persistence = persistence;
+		log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+
+		/* Move the high log number pointer past this one. */
+		++shared->next_logno;
+
+		/* WAL-log the creation of this new undo log. */
+		{
+			xl_undolog_create xlrec;
+
+			xlrec.logno = logno;
+			xlrec.tablespace = log->meta.tablespace;
+			xlrec.persistence = log->meta.persistence;
+
+			XLogBeginInsert();
+			XLogRegisterData((char *) &xlrec, sizeof(xlrec));
+			XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_CREATE);
+		}
+
+		/*
+		 * This undo log has no segments.  UndoLogAllocate will create the
+		 * first one on demand.
+		 */
+	}
+	LWLockRelease(UndoLogLock);
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->pid = MyProcPid;
+	log->xid = InvalidTransactionId;
+	log->need_attach_wal_record = true;
+	LWLockRelease(&log->mutex);
+
+	MyUndoLogState.logs[persistence] = log;
+}
+
+/*
+ * Free chunks of the xid/undo log map that relate to transactions that are no
+ * longer running.  This is run at each checkpoint.
+ */
+static void
+undolog_xid_map_gc(void)
+{
+	UndoLogNumber **xid_map = MyUndoLogState.xid_map;
+	TransactionId oldest_xid;
+	uint16 new_oldest_chunk;
+	uint16 oldest_chunk;
+
+	if (xid_map == NULL)
+		return;
+
+	/*
+	 * During crash recovery, it may not be possible to call GetOldestXmin()
+	 * yet because latestCompletedXid is invalid.
+	 */
+	if (!TransactionIdIsNormal(ShmemVariableCache->latestCompletedXid))
+		return;
+
+	oldest_xid = GetOldestXmin(NULL, PROCARRAY_FLAGS_DEFAULT);
+	new_oldest_chunk = UndoLogGetXidHigh(oldest_xid);
+	oldest_chunk = MyUndoLogState.xid_map_oldest_chunk;
+
+	while (oldest_chunk != new_oldest_chunk)
+	{
+		if (xid_map[oldest_chunk])
+		{
+			pfree(xid_map[oldest_chunk]);
+			xid_map[oldest_chunk] = NULL;
+		}
+		oldest_chunk = (oldest_chunk + 1) % (1 << UndoLogXidHighBits);
+	}
+	MyUndoLogState.xid_map_oldest_chunk = new_oldest_chunk;
+}
+
+/*
+ * Associate a xid with an undo log, during recovery.  In a primary server,
+ * this isn't necessary because backends know which undo log they're attached
+ * to.  During recovery, the natural association between backends and xids is
+ * lost, so we need to manage that explicitly.
+ */
+static void
+undolog_xid_map_add(TransactionId xid, UndoLogNumber logno)
+{
+	uint16		high_bits;
+	uint16		low_bits;
+
+	high_bits = UndoLogGetXidHigh(xid);
+	low_bits = UndoLogGetXidLow(xid);
+
+	if (unlikely(MyUndoLogState.xid_map == NULL))
+	{
+		/* First time through.  Create mapping array. */
+		MyUndoLogState.xid_map =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber *) *
+								   (1 << (32 - UndoLogXidLowBits)));
+		MyUndoLogState.xid_map_oldest_chunk = high_bits;
+	}
+
+	if (unlikely(MyUndoLogState.xid_map[high_bits] == NULL))
+	{
+		/* This bank of mappings doesn't exist yet.  Create it. */
+		MyUndoLogState.xid_map[high_bits] =
+			MemoryContextAllocZero(TopMemoryContext,
+								   sizeof(UndoLogNumber) *
+								   (1 << UndoLogXidLowBits));
+	}
+
+	/* Associate this xid with this undo log number. */
+	MyUndoLogState.xid_map[high_bits][low_bits] = logno;
+}
+
+/* check_hook: validate new undo_tablespaces */
+bool
+check_undo_tablespaces(char **newval, void **extra, GucSource source)
+{
+	char	   *rawname;
+	List	   *namelist;
+
+	/* Need a modifiable copy of string */
+	rawname = pstrdup(*newval);
+
+	/*
+	 * Parse string into list of identifiers, just to check for
+	 * well-formedness (unfortunateley we can't validate the names in the
+	 * catalog yet).
+	 */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+	{
+		/* syntax error in name list */
+		GUC_check_errdetail("List syntax is invalid.");
+		pfree(rawname);
+		list_free(namelist);
+		return false;
+	}
+
+	/*
+	 * Make sure we aren't already in a transaction that has been assigned an
+	 * XID.  This ensures we don't detach from an undo log that we might have
+	 * started writing undo data into for this transaction.
+	 */
+	if (GetTopTransactionIdIfAny() != InvalidTransactionId)
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 (errmsg("undo_tablespaces cannot be changed while a transaction is in progress"))));
+	list_free(namelist);
+
+	return true;
+}
+
+/* assign_hook: do extra actions as needed */
+void
+assign_undo_tablespaces(const char *newval, void *extra)
+{
+	/*
+	 * This is normally called only when GetTopTransactionIdIfAny() ==
+	 * InvalidTransactionId (because you can't change undo_tablespaces in the
+	 * middle of a transaction that's been asigned an xid), but we can't
+	 * assert that because it's also called at the end of a transaction that's
+	 * rolling back, to reset the GUC if it was set inside the transaction.
+	 */
+
+	/* Tell UndoLogAllocate() to reexamine undo_tablespaces. */
+	MyUndoLogState.need_to_choose_tablespace = true;
+}
+
+static bool
+choose_undo_tablespace(bool force_detach, Oid *tablespace)
+{
+	char   *rawname;
+	List   *namelist;
+	bool	need_to_unlock;
+	int		length;
+	int		i;
+
+	/* We need a modifiable copy of string. */
+	rawname = pstrdup(undo_tablespaces);
+
+	/* Break string into list of identifiers. */
+	if (!SplitIdentifierString(rawname, ',', &namelist))
+		elog(ERROR, "undo_tablespaces is unexpectedly malformed");
+
+	length = list_length(namelist);
+	if (length == 0 ||
+		(length == 1 && ((char *) linitial(namelist))[0] == '\0'))
+	{
+		/*
+		 * If it's an empty string, then we'll use the default tablespace.  No
+		 * locking is required because it can't be dropped.
+		 */
+		*tablespace = DEFAULTTABLESPACE_OID;
+		need_to_unlock = false;
+	}
+	else
+	{
+		/*
+		 * Choose an OID using our pid, so that if several backends have the
+		 * same multi-tablespace setting they'll spread out.  We could easily
+		 * do better than this if more serious load balancing is judged
+		 * useful.
+		 */
+		int		index = MyProcPid % length;
+		int		first_index = index;
+		Oid		oid = InvalidOid;
+
+		/*
+		 * Take the tablespace create/drop lock while we look the name up.
+		 * This prevents the tablespace from being dropped while we're trying
+		 * to resolve the name, or while the called is trying to create an
+		 * undo log in it.  The caller will have to release this lock.
+		 */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		for (;;)
+		{
+			const char *name = list_nth(namelist, index);
+
+			oid = get_tablespace_oid(name, true);
+			if (oid == InvalidOid)
+			{
+				/* Unknown tablespace, try the next one. */
+				index = (index + 1) % length;
+				/*
+				 * But if we've tried them all, it's time to complain.  We'll
+				 * arbitrarily complain about the last one we tried in the
+				 * error message.
+				 */
+				if (index == first_index)
+					ereport(ERROR,
+							(errcode(ERRCODE_UNDEFINED_OBJECT),
+							 errmsg("tablespace \"%s\" does not exist", name),
+							 errhint("Create the tablespace or set undo_tablespaces to a valid or empty list.")));
+				continue;
+			}
+			if (oid == GLOBALTABLESPACE_OID)
+				ereport(ERROR,
+						(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+						 errmsg("undo logs cannot be placed in pg_global tablespace")));
+			/* If we got here we succeeded in finding one. */
+			break;
+		}
+
+		Assert(oid != InvalidOid);
+		*tablespace = oid;
+		need_to_unlock = true;
+	}
+
+	/*
+	 * If we came here because the user changed undo_tablesaces, then detach
+	 * from any undo logs we happen to be attached to.
+	 */
+	if (force_detach)
+	{
+		for (i = 0; i < UndoPersistenceLevels; ++i)
+		{
+			UndoLogControl *log = MyUndoLogState.logs[i];
+			UndoLogSharedData *shared = MyUndoLogState.shared;
+
+			if (log != NULL)
+			{
+				LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+				log->pid = InvalidPid;
+				log->xid = InvalidTransactionId;
+				LWLockRelease(&log->mutex);
+
+				LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+				log->next_free = shared->free_lists[i];
+				shared->free_lists[i] = log->logno;
+				LWLockRelease(UndoLogLock);
+
+				MyUndoLogState.logs[i] = NULL;
+			}
+		}
+	}
+
+	return need_to_unlock;
+}
+
+bool
+DropUndoLogsInTablespace(Oid tablespace)
+{
+	DIR *dir;
+	char undo_path[MAXPGPATH];
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	UndoLogControl *log;
+	int		i;
+
+	Assert(LWLockHeldByMe(TablespaceCreateLock));
+	Assert(tablespace != DEFAULTTABLESPACE_OID);
+
+	/* First, try to kick everyone off any undo logs in this tablespace. */
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		bool ok;
+		bool return_to_freelist = false;
+
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/* Check if this undo log can be forcibly detached. */
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		if (log->meta.discard == log->meta.insert &&
+			(log->xid == InvalidTransactionId ||
+			 !TransactionIdIsInProgress(log->xid)))
+		{
+			log->xid = InvalidTransactionId;
+			if (log->pid != InvalidPid)
+			{
+				log->pid = InvalidPid;
+				return_to_freelist = true;
+			}
+			ok = true;
+		}
+		else
+		{
+			/*
+			 * There is data we need in this undo log.  We can't force it to
+			 * be detached.
+			 */
+			ok = false;
+		}
+		LWLockRelease(&log->mutex);
+
+		/* If we failed, then give up now and report failure. */
+		if (!ok)
+			return false;
+
+		/*
+		 * Put this undo log back on the appropriate free-list.  No one can
+		 * attach to it while we hold TablespaceCreateLock, but if we return
+		 * earlier in a future go around this loop, we need the undo log to
+		 * remain usable.  We'll remove all appropriate logs from the
+		 * free-lists in a separate step below.
+		 */
+		if (return_to_freelist)
+		{
+			LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+			log->next_free = shared->free_lists[log->meta.persistence];
+			shared->free_lists[log->meta.persistence] = log->logno;
+			LWLockRelease(UndoLogLock);
+		}
+	}
+
+	/*
+	 * We detached all backends from undo logs in this tablespace, and no one
+	 * can attach to any non-default-tablespace undo logs while we hold
+	 * TablespaceCreateLock.  We can now drop the undo logs.
+	 */
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		/* Skip undo logs in other tablespaces. */
+		if (log->meta.tablespace != tablespace)
+			continue;
+
+		/*
+		 * Make sure no buffers remain.  When that is done by UndoDiscard(),
+		 * the final page is left in shared_buffers because it may contain
+		 * data, or at least be needed again very soon.  Here we need to drop
+		 * even that page from the buffer pool.
+		 */
+		forget_undo_buffers(log->logno, log->meta.discard, log->meta.discard, true);
+
+		/*
+		 * TODO: For now we drop the undo log, meaning that it will never be
+		 * used again.  That wastes the rest of its address space.  Instead,
+		 * we should put it onto a special list of 'offline' undo logs, ready
+		 * to be reactivated in some other tablespace.  Then we can keep the
+		 * unused portion of its address space.
+		 */
+		LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+		log->meta.status = UNDO_LOG_STATUS_DISCARDED;
+		LWLockRelease(&log->mutex);
+	}
+
+	/* Unlink all undo segment files in this tablespace. */
+	UndoLogDirectory(tablespace, undo_path);
+
+	dir = AllocateDir(undo_path);
+	if (dir != NULL)
+	{
+		struct dirent *de;
+
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strcmp(de->d_name, ".") == 0 ||
+				strcmp(de->d_name, "..") == 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+	}
+
+	/* Remove all dropped undo logs from the free-lists. */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+	{
+		UndoLogControl *log;
+		UndoLogNumber *place;
+
+		place = &shared->free_lists[i];
+		while (*place != InvalidUndoLogNumber)
+		{
+			log = get_undo_log(*place, true);
+			if (!log)
+				elog(ERROR,
+					 "corrupted undo log freelist, unknown log %u", *place);
+			if (log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+				*place = log->next_free;
+			else
+				place = &log->next_free;
+		}
+	}
+	LWLockRelease(UndoLogLock);
+
+	return true;
+}
+
+void
+ResetUndoLogs(UndoPersistence persistence)
+{
+	UndoLogControl *log;
+
+	for (log = UndoLogNext(NULL); log != NULL; log = UndoLogNext(log))
+	{
+		DIR	   *dir;
+		struct dirent *de;
+		char	undo_path[MAXPGPATH];
+		char	segment_prefix[MAXPGPATH];
+		size_t	segment_prefix_size;
+
+		if (log->meta.persistence != persistence)
+			continue;
+
+		/* Scan the directory for files belonging to this undo log. */
+		snprintf(segment_prefix, sizeof(segment_prefix), "%06X.", log->logno);
+		segment_prefix_size = strlen(segment_prefix);
+		UndoLogDirectory(log->meta.tablespace, undo_path);
+		dir = AllocateDir(undo_path);
+		if (dir == NULL)
+			continue;
+		while ((de = ReadDirExtended(dir, undo_path, LOG)) != NULL)
+		{
+			char segment_path[MAXPGPATH];
+
+			if (strncmp(de->d_name, segment_prefix, segment_prefix_size) != 0)
+				continue;
+			snprintf(segment_path, sizeof(segment_path), "%s/%s",
+					 undo_path, de->d_name);
+			elog(LOG, "unlinked undo segment \"%s\"", segment_path); /* XXX: remove me */
+			if (unlink(segment_path) < 0)
+				elog(LOG, "couldn't unlink file \"%s\": %m", segment_path);
+		}
+		FreeDir(dir);
+
+		/*
+		 * We have no segment files.  Set the pointers to indicate that there
+		 * is no data.  The discard and insert pointers point to the first
+		 * usable byte in the segment we will create when we next try to
+		 * allocate.  This is a bit strange, because it means that they are
+		 * past the end pointer.  That's the same as when new undo logs are
+		 * created.
+		 *
+		 * TODO: Should we rewind to zero instead, so we can reuse that (now)
+		 * unreferenced address space?
+		 */
+		log->meta.insert = log->meta.discard = log->meta.end +
+			UndoLogBlockHeaderSize;
+	}
+}
+
+Datum
+pg_stat_get_undo_logs(PG_FUNCTION_ARGS)
+{
+#define PG_STAT_GET_UNDO_LOGS_COLS 10
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	TupleDesc	tupdesc;
+	Tuplestorestate *tupstore;
+	MemoryContext per_query_ctx;
+	MemoryContext oldcontext;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+	char *tablespace_name = NULL;
+	Oid last_tablespace = InvalidOid;
+	int			i;
+
+	/* check to see if caller supports us returning a tuplestore */
+	if (rsinfo == NULL || !IsA(rsinfo, ReturnSetInfo))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("set-valued function called in context that cannot accept a set")));
+	if (!(rsinfo->allowedModes & SFRM_Materialize))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg("materialize mode required, but it is not " \
+						"allowed in this context")));
+
+	/* Build a tuple descriptor for our result type */
+	if (get_call_result_type(fcinfo, NULL, &tupdesc) != TYPEFUNC_COMPOSITE)
+		elog(ERROR, "return type must be a row type");
+
+	per_query_ctx = rsinfo->econtext->ecxt_per_query_memory;
+	oldcontext = MemoryContextSwitchTo(per_query_ctx);
+
+	tupstore = tuplestore_begin_heap(true, false, work_mem);
+	rsinfo->returnMode = SFRM_Materialize;
+	rsinfo->setResult = tupstore;
+	rsinfo->setDesc = tupdesc;
+
+	MemoryContextSwitchTo(oldcontext);
+
+	/* Scan all undo logs to build the results. */
+	for (i = 0; i < shared->array_size; ++i)
+	{
+		UndoLogControl *log = &shared->logs[i];
+		char buffer[17];
+		Datum values[PG_STAT_GET_UNDO_LOGS_COLS];
+		bool nulls[PG_STAT_GET_UNDO_LOGS_COLS] = { false };
+		Oid tablespace;
+
+		if (log == NULL)
+			continue;
+
+		/*
+		 * This won't be a consistent result overall, but the values for each
+		 * log will be consistent because we'll take the per-log lock while
+		 * copying them.
+		 */
+		LWLockAcquire(&log->mutex, LW_SHARED);
+
+		/* Skip unused slots and entirely discarded undo logs. */
+		if (log->logno == InvalidUndoLogNumber ||
+			log->meta.status == UNDO_LOG_STATUS_DISCARDED)
+		{
+			LWLockRelease(&log->mutex);
+			continue;
+		}
+
+		values[0] = ObjectIdGetDatum((Oid) log->logno);
+		values[1] = CStringGetTextDatum(
+			log->meta.persistence == UNDO_PERMANENT ? "permanent" :
+			log->meta.persistence == UNDO_UNLOGGED ? "unlogged" :
+			log->meta.persistence == UNDO_TEMP ? "temporary" : "<uknown>");
+		tablespace = log->meta.tablespace;
+
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.discard));
+		values[3] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.insert));
+		values[4] = CStringGetTextDatum(buffer);
+		snprintf(buffer, sizeof(buffer), UndoRecPtrFormat,
+				 MakeUndoRecPtr(log->logno, log->meta.end));
+		values[5] = CStringGetTextDatum(buffer);
+		if (log->xid == InvalidTransactionId)
+			nulls[6] = true;
+		else
+			values[6] = TransactionIdGetDatum(log->xid);
+		if (log->pid == InvalidPid)
+			nulls[7] = true;
+		else
+			values[7] = Int32GetDatum((int64) log->pid);
+		switch (log->meta.status)
+		{
+		case UNDO_LOG_STATUS_ACTIVE:
+			values[8] = CStringGetTextDatum("ACTIVE"); break;
+		case UNDO_LOG_STATUS_FULL:
+			values[8] = CStringGetTextDatum("FULL"); break;
+		default:
+			nulls[8] = true;
+		}
+		LWLockRelease(&log->mutex);
+
+		/*
+		 * Deal with potentially slow tablespace name lookup without the lock.
+		 * Avoid making multiple calls to that expensive function for the
+		 * common case of repeating tablespace.
+		 */
+		if (tablespace != last_tablespace)
+		{
+			if (tablespace_name)
+				pfree(tablespace_name);
+			tablespace_name = get_tablespace_name(tablespace);
+			last_tablespace = tablespace;
+		}
+		if (tablespace_name)
+		{
+			values[2] = CStringGetTextDatum(tablespace_name);
+			nulls[2] = false;
+		}
+		else
+			nulls[2] = true;
+
+		tuplestore_putvalues(tupstore, tupdesc, values, nulls);
+	}
+
+	if (tablespace_name)
+		pfree(tablespace_name);
+	tuplestore_donestoring(tupstore);
+
+	return (Datum) 0;
+}
+
+/*
+ * replay the creation of a new undo log
+ */
+static void
+undolog_xlog_create(XLogReaderState *record)
+{
+	xl_undolog_create *xlrec = (xl_undolog_create *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogSharedData *shared = MyUndoLogState.shared;
+
+	/* Create meta-data space in shared memory. */
+	LWLockAcquire(UndoLogLock, LW_EXCLUSIVE);
+	/* TODO: assert that it doesn't exist already? */
+	log = allocate_undo_log();
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->logno = xlrec->logno;
+	log->meta.logno = xlrec->logno;
+	log->meta.status = UNDO_LOG_STATUS_ACTIVE;
+	log->meta.persistence = xlrec->persistence;
+	log->meta.tablespace = xlrec->tablespace;
+	log->meta.insert = UndoLogBlockHeaderSize;
+	log->meta.discard = UndoLogBlockHeaderSize;
+	shared->next_logno = Max(xlrec->logno + 1, shared->next_logno);
+	LWLockRelease(&log->mutex);
+	LWLockRelease(UndoLogLock);
+}
+
+/*
+ * replay the addition of a new segment to an undo log
+ */
+static void
+undolog_xlog_extend(XLogReaderState *record)
+{
+	xl_undolog_extend *xlrec = (xl_undolog_extend *) XLogRecGetData(record);
+
+	/* Extend exactly as we would during DO phase. */
+	extend_undo_log(xlrec->logno, xlrec->end);
+}
+
+/*
+ * replay the association of an xid with a specific undo log
+ */
+static void
+undolog_xlog_attach(XLogReaderState *record)
+{
+	xl_undolog_attach *xlrec = (xl_undolog_attach *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	undolog_xid_map_add(xlrec->xid, xlrec->logno);
+
+	/* Restore current dbid */
+	MyUndoLogState.dbid = xlrec->dbid;
+
+	/*
+	 * Whatever follows is the first record for this transaction.  Zheap will
+	 * use this to add UREC_INFO_TRANSACTION.
+	 */
+	log = get_undo_log(xlrec->logno, false);
+	/* TODO */
+	log->meta.is_first_rec = true;
+	log->xid = xlrec->xid;
+}
+
+/*
+ * replay the undo-log switch wal.  Store the transaction's undo record
+ * pointer of the previous log in MyUndoLogState temporarily, which will
+ * be reset after reading first time.
+ */
+static void
+undolog_xlog_switch(XLogReaderState *record)
+{
+	UndoRecPtr prevlogurp = *((UndoRecPtr *) XLogRecGetData(record));
+
+	MyUndoLogState.prevlogurp = prevlogurp;
+}
+/*
+ * Drop all buffers for the given undo log, from the old_discard to up
+ * new_discard.  If drop_tail is true, also drop the buffer that holds
+ * new_discard; this is used when discarding undo logs completely, for example
+ * via DROP TABLESPACE.  If it is false, then the final buffer is not dropped
+ * because it may contain data.
+ *
+ */
+static void
+forget_undo_buffers(int logno, UndoLogOffset old_discard,
+					UndoLogOffset new_discard, bool drop_tail)
+{
+	BlockNumber old_blockno;
+	BlockNumber new_blockno;
+	RelFileNode	rnode;
+
+	UndoRecPtrAssignRelFileNode(rnode, MakeUndoRecPtr(logno, old_discard));
+	old_blockno = old_discard / BLCKSZ;
+	new_blockno = new_discard / BLCKSZ;
+	if (drop_tail)
+		++new_blockno;
+	while (old_blockno < new_blockno)
+		ForgetBuffer(rnode, UndoLogForkNum, old_blockno++);
+}
+
+/*
+ * replay an undo segment discard record
+ */
+static void
+undolog_xlog_discard(XLogReaderState *record)
+{
+	xl_undolog_discard *xlrec = (xl_undolog_discard *) XLogRecGetData(record);
+	UndoLogControl *log;
+	UndoLogOffset discard;
+	UndoLogOffset end;
+	UndoLogOffset old_segment_begin;
+	UndoLogOffset new_segment_begin;
+	RelFileNode rnode = {0};
+	char	dir[MAXPGPATH];
+
+	log = get_undo_log(xlrec->logno, false);
+	if (log == NULL)
+		elog(ERROR, "unknown undo log %d", xlrec->logno);
+
+	/*
+	 * We're about to discard undologs. In Hot Standby mode, ensure that
+	 * there's no queries running which need to get tuple from discarded undo.
+	 *
+	 * XXX we are passing empty rnode to the conflict function so that it can
+	 * check conflict in all the backend regardless of which database the
+	 * backend is connected.
+	 */
+	if (InHotStandby && TransactionIdIsValid(xlrec->latestxid))
+		ResolveRecoveryConflictWithSnapshot(xlrec->latestxid, rnode);
+
+	/*
+	 * See if we need to unlink or rename any files, but don't consider it an
+	 * error if we find that files are missing.  Since UndoLogDiscard()
+	 * performs filesystem operations before WAL logging or updating shmem
+	 * which could be checkpointed, a crash could have left files already
+	 * deleted, but we could replay WAL that expects the files to be there.
+	 */
+
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	Assert(log->logno == xlrec->logno);
+	discard = log->meta.discard;
+	end = log->meta.end;
+	LWLockRelease(&log->mutex);
+
+	/* Drop buffers before we remove/recycle any files. */
+	forget_undo_buffers(xlrec->logno, discard, xlrec->discard,
+						xlrec->entirely_discarded);
+
+	/* Rewind to the start of the segment. */
+	old_segment_begin = discard - discard % UndoLogSegmentSize;
+	new_segment_begin = xlrec->discard - xlrec->discard % UndoLogSegmentSize;
+
+	/* Unlink or rename segments that are no longer in range. */
+	while (old_segment_begin < new_segment_begin)
+	{
+		char	discard_path[MAXPGPATH];
+
+		/*
+		 * Before removing the file, make sure that undofile_sync knows that
+		 * it might be missing.
+		 */
+		undofile_forgetsync(log->logno,
+							log->meta.tablespace,
+							old_segment_begin / UndoLogSegmentSize);
+
+		UndoLogSegmentPath(xlrec->logno, old_segment_begin / UndoLogSegmentSize,
+						   log->meta.tablespace, discard_path);
+
+		/* Can we recycle the oldest segment? */
+		if (end < xlrec->end)
+		{
+			char	recycle_path[MAXPGPATH];
+
+			UndoLogSegmentPath(xlrec->logno, end / UndoLogSegmentSize,
+							   log->meta.tablespace, recycle_path);
+			if (rename(discard_path, recycle_path) == 0)
+			{
+				elog(LOG, "recycled undo segment \"%s\" -> \"%s\"", discard_path, recycle_path); /* XXX: remove me */
+				end += UndoLogSegmentSize;
+			}
+			else
+			{
+				elog(LOG, "could not rename \"%s\" to \"%s\": %m",
+					 discard_path, recycle_path);
+			}
+		}
+		else
+		{
+			if (unlink(discard_path) == 0)
+				elog(LOG, "unlinked undo segment \"%s\"", discard_path); /* XXX: remove me */
+			else
+				elog(LOG, "could not unlink \"%s\": %m", discard_path);
+		}
+		old_segment_begin += UndoLogSegmentSize;
+	}
+
+	/* Create any further new segments that are needed the slow way. */
+	while (end < xlrec->end)
+	{
+		allocate_empty_undo_segment(xlrec->logno, log->meta.tablespace, end);
+		end += UndoLogSegmentSize;
+	}
+
+	/* Flush the directory entries. */
+	UndoLogDirectory(log->meta.tablespace, dir);
+	fsync_fname(dir, true);
+
+	/* Update shmem. */
+	LWLockAcquire(&log->mutex, LW_EXCLUSIVE);
+	log->meta.discard = xlrec->discard;
+	log->meta.end = end;
+	LWLockRelease(&log->mutex);
+
+	/* If we discarded everything, the slot can be given up. */
+	if (xlrec->entirely_discarded)
+		free_undo_log(log);
+}
+
+/*
+ * replay the rewind of a undo log
+ */
+static void
+undolog_xlog_rewind(XLogReaderState *record)
+{
+	xl_undolog_rewind *xlrec = (xl_undolog_rewind *) XLogRecGetData(record);
+	UndoLogControl *log;
+
+	log = get_undo_log(xlrec->logno, false);
+	log->meta.insert = xlrec->insert;
+	log->meta.prevlen = xlrec->prevlen;
+}
+
+void
+undolog_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	switch (info)
+	{
+		case XLOG_UNDOLOG_CREATE:
+			undolog_xlog_create(record);
+			break;
+		case XLOG_UNDOLOG_EXTEND:
+			undolog_xlog_extend(record);
+			break;
+		case XLOG_UNDOLOG_ATTACH:
+			undolog_xlog_attach(record);
+			break;
+		case XLOG_UNDOLOG_DISCARD:
+			undolog_xlog_discard(record);
+			break;
+		case XLOG_UNDOLOG_REWIND:
+			undolog_xlog_rewind(record);
+			break;
+		case XLOG_UNDOLOG_SWITCH:
+			undolog_xlog_switch(record);
+			break;
+		default:
+			elog(PANIC, "undo_redo: unknown op code %u", info);
+	}
+}
+
+/*
+ * For assertions only.
+ */
+bool
+AmAttachedToUndoLog(UndoLogControl *log)
+{
+	/*
+	 * In general, we can't access log's members without locking.  But this
+	 * function is intended only for asserting that you are attached, and
+	 * while you're attached the slot can't be recycled, so don't bother
+	 * locking.
+	 */
+	return MyUndoLogState.logs[log->meta.persistence] == log;
+}
+
+/*
+ * For testing use only.  This function is only used by the test_undo module.
+ */
+void
+UndoLogDetachFull(void)
+{
+	int		i;
+
+	for (i = 0; i < UndoPersistenceLevels; ++i)
+		if (MyUndoLogState.logs[i])
+			detach_current_undo_log(i, true);
+}
+
+/*
+ * Fetch database id from the undo log state
+ */
+Oid
+UndoLogStateGetDatabaseId()
+{
+	Assert(InRecovery);
+	return MyUndoLogState.dbid;
+}
+
+/*
+ * Get transaction start header in the previous log
+ * 
+ * This should be only called during recovery.  The value of prevlogurp
+ * is restored in MyUndoLogState while replying the UNDOLOG_XLOG_SWITCH
+ * wal and it will be cleared in this function.
+ */
+UndoRecPtr
+UndoLogStateGetAndClearPrevLogXactUrp()
+{
+	UndoRecPtr	prevlogurp;
+
+	Assert(InRecovery);
+	prevlogurp = MyUndoLogState.prevlogurp;
+	MyUndoLogState.prevlogurp = InvalidUndoRecPtr;
+
+	return prevlogurp;
+}
+
+/*
+ * Get the undo log number my backend is attached to
+ */
+UndoLogNumber
+UndoLogAmAttachedTo(UndoPersistence persistence)
+{
+	if (MyUndoLogState.logs[persistence] == NULL)
+		return InvalidUndoLogNumber;
+	return MyUndoLogState.logs[persistence]->logno;
+}
+
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index f4d9e9d..c1e4409 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -940,6 +940,10 @@ GRANT SELECT (subdbid, subname, subowner, subenabled, subslotname, subpublicatio
     ON pg_subscription TO public;
 
 
+CREATE VIEW pg_stat_undo_logs AS
+    SELECT *
+    FROM pg_stat_get_undo_logs();
+
 --
 -- We have a few function definitions in here, too.
 -- At some point there might be enough to justify breaking them out into
diff --git a/src/backend/commands/tablespace.c b/src/backend/commands/tablespace.c
index 946e1b9..9739c43 100644
--- a/src/backend/commands/tablespace.c
+++ b/src/backend/commands/tablespace.c
@@ -54,6 +54,7 @@
 #include "access/reloptions.h"
 #include "access/htup_details.h"
 #include "access/sysattr.h"
+#include "access/undolog.h"
 #include "access/xact.h"
 #include "access/xlog.h"
 #include "access/xloginsert.h"
@@ -488,6 +489,20 @@ DropTableSpace(DropTableSpaceStmt *stmt)
 	LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
 
 	/*
+	 * Drop the undo logs in this tablespace.  This will fail (without
+	 * dropping anything) if there are undo logs that we can't afford to drop
+	 * because they contain non-discarded data or a transaction is in
+	 * progress.  Since we hold TablespaceCreateLock, no other session will be
+	 * able to attach to an undo log in this tablespace (or any tablespace
+	 * except default) concurrently.
+	 */
+	if (!DropUndoLogsInTablespace(tablespaceoid))
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("tablespace \"%s\" cannot be dropped because it contains non-empty undo logs",
+						tablespacename)));
+
+	/*
 	 * Try to remove the physical infrastructure.
 	 */
 	if (!destroy_tablespace_directories(tablespaceoid, false))
@@ -1487,6 +1502,14 @@ tblspc_redo(XLogReaderState *record)
 	{
 		xl_tblspc_drop_rec *xlrec = (xl_tblspc_drop_rec *) XLogRecGetData(record);
 
+		/* This shouldn't be able to fail in recovery. */
+		LWLockAcquire(TablespaceCreateLock, LW_EXCLUSIVE);
+		if (!DropUndoLogsInTablespace(xlrec->ts_id))
+			ereport(ERROR,
+					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+					 errmsg("tablespace cannot be dropped because it contains non-empty undo logs")));
+		LWLockRelease(TablespaceCreateLock);
+
 		/*
 		 * If we issued a WAL record for a drop tablespace it implies that
 		 * there were no files in it at all when the DROP was done. That means
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c
index eec3a22..a8aa11a 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -154,6 +154,7 @@ LogicalDecodingProcessRecord(LogicalDecodingContext *ctx, XLogReaderState *recor
 		case RM_COMMIT_TS_ID:
 		case RM_REPLORIGIN_ID:
 		case RM_GENERIC_ID:
+		case RM_UNDOLOG_ID:
 			/* just deal with xid, and done */
 			ReorderBufferProcessXid(ctx->reorder, XLogRecGetXid(record),
 									buf.origptr);
diff --git a/src/backend/storage/ipc/ipci.c b/src/backend/storage/ipc/ipci.c
index 2849e47..faedafb 100644
--- a/src/backend/storage/ipc/ipci.c
+++ b/src/backend/storage/ipc/ipci.c
@@ -21,6 +21,7 @@
 #include "access/nbtree.h"
 #include "access/subtrans.h"
 #include "access/twophase.h"
+#include "access/undolog.h"
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pgstat.h"
@@ -126,6 +127,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 		size = add_size(size, ProcGlobalShmemSize());
 		size = add_size(size, XLOGShmemSize());
 		size = add_size(size, CLOGShmemSize());
+		size = add_size(size, UndoLogShmemSize());
 		size = add_size(size, CommitTsShmemSize());
 		size = add_size(size, SUBTRANSShmemSize());
 		size = add_size(size, TwoPhaseShmemSize());
@@ -217,6 +219,7 @@ CreateSharedMemoryAndSemaphores(bool makePrivate, int port)
 	 */
 	XLOGShmemInit();
 	CLOGShmemInit();
+	UndoLogShmemInit();
 	CommitTsShmemInit();
 	SUBTRANSShmemInit();
 	MultiXactShmemInit();
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 81dac45..01b03c9 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -521,6 +521,8 @@ RegisterLWLockTranches(void)
 	LWLockRegisterTranche(LWTRANCHE_TBM, "tbm");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_APPEND, "parallel_append");
 	LWLockRegisterTranche(LWTRANCHE_PARALLEL_HASH_JOIN, "parallel_hash_join");
+	LWLockRegisterTranche(LWTRANCHE_UNDOLOG, "undo_log");
+	LWLockRegisterTranche(LWTRANCHE_UNDODISCARD, "undo_discard");
 
 	/* Register named tranches. */
 	for (i = 0; i < NamedLWLockTrancheRequests; i++)
diff --git a/src/backend/storage/lmgr/lwlocknames.txt b/src/backend/storage/lmgr/lwlocknames.txt
index db47843..bd07ed6 100644
--- a/src/backend/storage/lmgr/lwlocknames.txt
+++ b/src/backend/storage/lmgr/lwlocknames.txt
@@ -49,3 +49,4 @@ MultiXactTruncationLock				41
 OldSnapshotTimeMapLock				42
 LogicalRepWorkerLock				43
 CLogTruncationLock					44
+UndoLogLock					45
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index c3373df..6013c38 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -556,6 +556,7 @@ BaseInit(void)
 	InitFileAccess();
 	smgrinit();
 	InitBufferPoolAccess();
+	UndoLogInit();
 }
 
 
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index f81e042..c5698b5 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -119,6 +119,7 @@ extern int	CommitDelay;
 extern int	CommitSiblings;
 extern char *default_tablespace;
 extern char *temp_tablespaces;
+extern char *undo_tablespaces;
 extern bool ignore_checksum_failure;
 extern bool synchronize_seqscans;
 
@@ -3534,6 +3535,17 @@ static struct config_string ConfigureNamesString[] =
 	},
 
 	{
+		{"undo_tablespaces", PGC_USERSET, CLIENT_CONN_STATEMENT,
+			gettext_noop("Sets the tablespace(s) to use for undo logs."),
+			NULL,
+			GUC_LIST_INPUT | GUC_LIST_QUOTE
+		},
+		&undo_tablespaces,
+		"",
+		check_undo_tablespaces, assign_undo_tablespaces, NULL
+	},
+
+	{
 		{"dynamic_library_path", PGC_SUSET, CLIENT_CONN_OTHER,
 			gettext_noop("Sets the path for dynamically loadable modules."),
 			gettext_noop("If a dynamically loadable module needs to be opened and "
diff --git a/src/bin/initdb/initdb.c b/src/bin/initdb/initdb.c
index 3ebe05d..771d479 100644
--- a/src/bin/initdb/initdb.c
+++ b/src/bin/initdb/initdb.c
@@ -208,11 +208,13 @@ static const char *const subdirs[] = {
 	"pg_snapshots",
 	"pg_subtrans",
 	"pg_twophase",
+	"pg_undo",
 	"pg_multixact",
 	"pg_multixact/members",
 	"pg_multixact/offsets",
 	"base",
 	"base/1",
+	"base/undo",
 	"pg_replslot",
 	"pg_tblspc",
 	"pg_stat",
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 852d8ca..938150d 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -20,6 +20,7 @@
 #include "access/nbtxlog.h"
 #include "access/rmgr.h"
 #include "access/spgxlog.h"
+#include "access/undolog_xlog.h"
 #include "access/xact.h"
 #include "access/xlog_internal.h"
 #include "catalog/storage_xlog.h"
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 3c0db2c..6945e3e 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL)
+PG_RMGR(RM_UNDOLOG_ID, "UndoLog", undolog_redo, undolog_desc, undolog_identify, NULL, NULL, NULL)
diff --git a/src/include/access/undolog.h b/src/include/access/undolog.h
new file mode 100644
index 0000000..8a7e1e4
--- /dev/null
+++ b/src/include/access/undolog.h
@@ -0,0 +1,398 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog.h
+ *
+ * PostgreSQL undo log manager.  This module is responsible for lifecycle
+ * management of undo logs and backing files, associating undo logs with
+ * backends, allocating and managing space within undo logs.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_H
+#define UNDOLOG_H
+
+#include "access/xlogreader.h"
+#include "catalog/pg_class.h"
+#include "common/relpath.h"
+#include "storage/bufpage.h"
+
+#ifndef FRONTEND
+#include "storage/lwlock.h"
+#endif
+
+/* The type used to identify an undo log and position within it. */
+typedef uint64 UndoRecPtr;
+
+/* The type used for undo record lengths. */
+typedef uint16 UndoRecordSize;
+
+/* Undo log statuses. */
+typedef enum
+{
+	UNDO_LOG_STATUS_UNUSED = 0,
+	UNDO_LOG_STATUS_ACTIVE,
+	UNDO_LOG_STATUS_FULL,
+	UNDO_LOG_STATUS_DISCARDED
+} UndoLogStatus;
+
+/*
+ * Undo log persistence levels.  These have a one-to-one correspondence with
+ * relpersistence values, but are small integers so that we can use them as an
+ * index into the "logs" and "lognos" arrays.
+ */
+typedef enum
+{
+	UNDO_PERMANENT = 0,
+	UNDO_UNLOGGED = 1,
+	UNDO_TEMP = 2
+} UndoPersistence;
+
+#define UndoPersistenceLevels 3
+
+/*
+ * Convert from relpersistence ('p', 'u', 't') to an UndoPersistence
+ * enumerator.
+ */
+#define UndoPersistenceForRelPersistence(rp)						\
+	((rp) == RELPERSISTENCE_PERMANENT ? UNDO_PERMANENT :			\
+	 (rp) == RELPERSISTENCE_UNLOGGED ? UNDO_UNLOGGED : UNDO_TEMP)
+
+/*
+ * Convert from UndoPersistence to a relpersistence value.
+ */
+#define RelPersistenceForUndoPersistence(up)				\
+	((up) == UNDO_PERMANENT ? RELPERSISTENCE_PERMANENT :	\
+	 (up) == UNDO_UNLOGGED ? RELPERSISTENCE_UNLOGGED :		\
+	 RELPERSISTENCE_TEMP)
+
+/*
+ * Get the appropriate UndoPersistence value from a Relation.
+ */
+#define UndoPersistenceForRelation(rel)									\
+	(UndoPersistenceForRelPersistence((rel)->rd_rel->relpersistence))
+
+/* Type for offsets within undo logs */
+typedef uint64 UndoLogOffset;
+
+/* printf-family format string for UndoRecPtr. */
+#define UndoRecPtrFormat "%016" INT64_MODIFIER "X"
+
+/* printf-family format string for UndoLogOffset. */
+#define UndoLogOffsetFormat UINT64_FORMAT
+
+/* Number of blocks of BLCKSZ in an undo log segment file.  128 = 1MB. */
+#define UNDOSEG_SIZE 128
+
+/* Size of an undo log segment file in bytes. */
+#define UndoLogSegmentSize ((size_t) BLCKSZ * UNDOSEG_SIZE)
+
+/* The width of an undo log number in bits.  24 allows for 16.7m logs. */
+#define UndoLogNumberBits 24
+
+/* The maximum valid undo log number. */
+#define MaxUndoLogNumber ((1 << UndoLogNumberBits) - 1)
+
+/* The width of an undo log offset in bits.  40 allows for 1TB per log.*/
+#define UndoLogOffsetBits (64 - UndoLogNumberBits)
+
+/* Special value for undo record pointer which indicates that it is invalid. */
+#define	InvalidUndoRecPtr	((UndoRecPtr) 0)
+
+/* End-of-list value when building linked lists of undo logs. */
+#define InvalidUndoLogNumber -1
+
+/*
+ * This undo record pointer will be used in the transaction header this special
+ * value is the indication that currently we don't have the value of the the
+ * next transactions start point but it will be updated with a valid value
+ * in the future.
+ */
+#define SpecialUndoRecPtr	((UndoRecPtr) 0xFFFFFFFFFFFFFFFF)
+
+/*
+ * The maximum amount of data that can be stored in an undo log.  Can be set
+ * artificially low to test full log behavior.
+ */
+#define UndoLogMaxSize ((UndoLogOffset) 1 << UndoLogOffsetBits)
+
+/* Type for numbering undo logs. */
+typedef int UndoLogNumber;
+
+/* Extract the undo log number from an UndoRecPtr. */
+#define UndoRecPtrGetLogNo(urp)					\
+	((urp) >> UndoLogOffsetBits)
+
+/* Extract the offset from an UndoRecPtr. */
+#define UndoRecPtrGetOffset(urp)				\
+	((urp) & ((UINT64CONST(1) << UndoLogOffsetBits) - 1))
+
+/* Make an UndoRecPtr from an log number and offset. */
+#define MakeUndoRecPtr(logno, offset)			\
+	(((uint64) (logno) << UndoLogOffsetBits) | (offset))
+
+/* The number of unusable bytes in the header of each block. */
+#define UndoLogBlockHeaderSize SizeOfPageHeaderData
+
+/* The number of usable bytes we can store per block. */
+#define UndoLogUsableBytesPerPage (BLCKSZ - UndoLogBlockHeaderSize)
+
+/* The pseudo-database OID used for undo logs. */
+#define UndoLogDatabaseOid 9
+
+/* Length of undo checkpoint filename */
+#define UNDO_CHECKPOINT_FILENAME_LENGTH	16
+
+/*
+ * UndoRecPtrIsValid
+ *		True iff undoRecPtr is valid.
+ */
+#define UndoRecPtrIsValid(undoRecPtr) \
+	((bool) ((UndoRecPtr) (undoRecPtr) != InvalidUndoRecPtr))
+
+/* Extract the relnode for an undo log. */
+#define UndoRecPtrGetRelNode(urp)				\
+	UndoRecPtrGetLogNo(urp)
+
+/* The only valid fork number for undo log buffers. */
+#define UndoLogForkNum MAIN_FORKNUM
+
+/* Compute the block number that holds a given UndoRecPtr. */
+#define UndoRecPtrGetBlockNum(urp)				\
+	(UndoRecPtrGetOffset(urp) / BLCKSZ)
+
+/* Compute the offset of a given UndoRecPtr in the page that holds it. */
+#define UndoRecPtrGetPageOffset(urp)			\
+	(UndoRecPtrGetOffset(urp) % BLCKSZ)
+
+/* Compare two undo checkpoint files to find the oldest file. */
+#define UndoCheckPointFilenamePrecedes(file1, file2)	\
+	(strcmp(file1, file2) < 0)
+
+/* What is the offset of the i'th non-header byte? */
+#define UndoLogOffsetFromUsableByteNo(i)								\
+	(((i) / UndoLogUsableBytesPerPage) * BLCKSZ +						\
+	 UndoLogBlockHeaderSize +											\
+	 ((i) % UndoLogUsableBytesPerPage))
+
+/* How many non-header bytes are there before a given offset? */
+#define UndoLogOffsetToUsableByteNo(offset)				\
+	(((offset) % BLCKSZ - UndoLogBlockHeaderSize) +		\
+	 ((offset) / BLCKSZ) * UndoLogUsableBytesPerPage)
+
+/* Add 'n' usable bytes to offset stepping over headers to find new offset. */
+#define UndoLogOffsetPlusUsableBytes(offset, n)							\
+	UndoLogOffsetFromUsableByteNo(UndoLogOffsetToUsableByteNo(offset) + (n))
+
+/* Populate a RelFileNode from an UndoRecPtr. */
+#define UndoRecPtrAssignRelFileNode(rfn, urp)			\
+	do													\
+	{													\
+		(rfn).spcNode = UndoRecPtrGetTablespace(urp);	\
+		(rfn).dbNode = UndoLogDatabaseOid;				\
+		(rfn).relNode = UndoRecPtrGetRelNode(urp);		\
+	} while (false);
+
+/*
+ * Control metadata for an active undo log.  Lives in shared memory inside an
+ * UndoLogControl object, but also written to disk during checkpoints.
+ */
+typedef struct UndoLogMetaData
+{
+	UndoLogNumber logno;
+	UndoLogStatus status;
+	Oid		tablespace;
+	UndoPersistence persistence;	/* permanent, unlogged, temp? */
+	UndoLogOffset insert;			/* next insertion point (head) */
+	UndoLogOffset end;				/* one past end of highest segment */
+	UndoLogOffset discard;			/* oldest data needed (tail) */
+	UndoLogOffset last_xact_start;	/* last transactions start undo offset */
+
+	bool	is_first_rec;
+
+	/*
+	 * last undo record's length. We need to save this in undo meta and WAL
+	 * log so that the value can be preserved across restart so that the first
+	 * undo record after the restart can get this value properly.  This will be
+	 * used going to the previous record of the transaction during rollback.
+	 * In case the transaction have done some operation before checkpoint and
+	 * remaining after checkpoint in such case if we can't get the previous
+	 * record prevlen which which before checkpoint we can not properly
+	 * rollback.  And, undo worker is also fetch this value when rolling back
+	 * the last transaction in the undo log for locating the last undo record
+	 * of the transaction.
+	 */
+	uint16	prevlen;
+} UndoLogMetaData;
+
+#ifndef FRONTEND
+
+/*
+ * The in-memory control object for an undo log.  We have a fixed-sized array
+ * of these.
+ */
+typedef struct UndoLogControl
+{
+	/*
+	 * Protected by UndoLogLock and 'mutex'.  Both must be held to steal this
+	 * slot for another undolog.  Either may be held to prevent that from
+	 * happening.
+	 */
+	UndoLogNumber logno;			/* InvalidUndoLogNumber for unused slots */
+
+	/* Protected by UndoLogLock. */
+	UndoLogNumber next_free;		/* link for active unattached undo logs */
+
+	/* Protected by 'mutex'. */
+	LWLock	mutex;
+	UndoLogMetaData meta;			/* current meta-data */
+	XLogRecPtr      lsn;
+	bool	need_attach_wal_record;	/* need_attach_wal_record */
+	pid_t		pid;				/* InvalidPid for unattached */
+	TransactionId xid;
+
+	/* Protected by 'discard_lock'.  State used by undo workers. */
+	LWLock		discard_lock;		/* prevents discarding while reading */
+	TransactionId	oldest_xid;		/* cache of oldest transaction's xid */
+	uint32		oldest_xidepoch;
+	UndoRecPtr	oldest_data;
+
+} UndoLogControl;
+
+extern UndoLogControl *UndoLogGet(UndoLogNumber logno, bool missing_ok);
+extern UndoLogControl *UndoLogNext(UndoLogControl *log);
+extern bool AmAttachedToUndoLog(UndoLogControl *log);
+extern UndoRecPtr UndoLogGetFirstValidRecord(UndoLogControl *log, bool *full);
+
+/*
+ * Each backend maintains a small hash table mapping undo log numbers to
+ * UndoLogControl objects in shared memory.
+ *
+ * We also cache the tablespace here, since we need fast access to that when
+ * resolving UndoRecPtr to an buffer tag.  We could also reach that via
+ * control->meta.tablespace, but that can't be accessed without locking (since
+ * the UndoLogControl object might be recycled).  Since the tablespace for a
+ * given undo log is constant for the whole life of the undo log, there is no
+ * invalidation problem to worry about.
+ */
+typedef struct UndoLogTableEntry
+{
+	UndoLogNumber	number;
+	UndoLogControl *control;
+	Oid				tablespace;
+	char			status;
+} UndoLogTableEntry;
+
+/*
+ * Instantiate fast inline hash table access functions.  We use an identity
+ * hash function for speed, since we already have integers and don't expect
+ * many collisions.
+ */
+#define SH_PREFIX undologtable
+#define SH_ELEMENT_TYPE UndoLogTableEntry
+#define SH_KEY_TYPE UndoLogNumber
+#define SH_KEY number
+#define SH_HASH_KEY(tb, key) (key)
+#define SH_EQUAL(tb, a, b) ((a) == (b))
+#define SH_SCOPE static inline
+#define SH_DECLARE
+#define SH_DEFINE
+#include "lib/simplehash.h"
+
+extern PGDLLIMPORT undologtable_hash *undologtable_cache;
+
+/*
+ * Find the OID of the tablespace that holds a given UndoRecPtr.  This is
+ * included in the header so it can be inlined by UndoRecPtrAssignRelFileNode.
+ */
+static inline Oid
+UndoRecPtrGetTablespace(UndoRecPtr urp)
+{
+	UndoLogNumber		logno = UndoRecPtrGetLogNo(urp);
+	UndoLogTableEntry  *entry;
+
+	/*
+	 * Fast path, for undo logs we've seen before.  This is safe because
+	 * tablespaces are constant for the lifetime of an undo log number.
+	 */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	if (likely(entry))
+		return entry->tablespace;
+
+	/*
+	 * Slow path: force cache entry to be created.  Raises an error if the
+	 * undo log has been entirely discarded, or hasn't been created yet.  That
+	 * is appropriate here, because this interface is designed for accessing
+	 * undo pages via bufmgr, and we should never be trying to access undo
+	 * pages that have been discarded.
+	 */
+	UndoLogGet(logno, false);
+
+	/*
+	 * We use the value from the newly created cache entry, because it's
+	 * cheaper than acquiring log->mutex and reading log->meta.tablespace.
+	 */
+	entry = undologtable_lookup(undologtable_cache, logno);
+	return entry->tablespace;
+}
+#endif
+
+/* Space management. */
+extern UndoRecPtr UndoLogAllocate(size_t size,
+								  UndoPersistence level);
+extern UndoRecPtr UndoLogAllocateInRecovery(TransactionId xid,
+											size_t size,
+											UndoPersistence persistence);
+extern void UndoLogAdvance(UndoRecPtr insertion_point,
+						   size_t size,
+						   UndoPersistence persistence);
+extern void UndoLogDiscard(UndoRecPtr discard_point, TransactionId xid);
+extern bool UndoLogIsDiscarded(UndoRecPtr point);
+
+/* Initialization interfaces. */
+extern void StartupUndoLogs(XLogRecPtr checkPointRedo);
+extern void UndoLogShmemInit(void);
+extern Size UndoLogShmemSize(void);
+extern void UndoLogInit(void);
+extern void UndoLogSegmentPath(UndoLogNumber logno, int segno, Oid tablespace,
+							   char *path);
+extern void ResetUndoLogs(UndoPersistence persistence);
+
+/* Interface use by tablespace.c. */
+extern bool DropUndoLogsInTablespace(Oid tablespace);
+
+/* GUC interfaces. */
+extern void assign_undo_tablespaces(const char *newval, void *extra);
+
+/* Checkpointing interfaces. */
+extern void CheckPointUndoLogs(XLogRecPtr checkPointRedo,
+							   XLogRecPtr priorCheckPointRedo);
+
+extern void UndoLogSetLastXactStartPoint(UndoRecPtr point);
+extern UndoRecPtr UndoLogGetLastXactStartPoint(UndoLogNumber logno);
+extern UndoRecPtr UndoLogGetNextInsertPtr(UndoLogNumber logno,
+										  TransactionId xid);
+extern UndoRecPtr UndoLogGetLastRecordPtr(UndoLogNumber,
+										  TransactionId xid);
+extern void UndoLogRewind(UndoRecPtr insert_urp, uint16 prevlen);
+extern bool IsTransactionFirstRec(TransactionId xid);
+extern void UndoLogSetPrevLen(UndoLogNumber logno, uint16 prevlen);
+extern uint16 UndoLogGetPrevLen(UndoLogNumber logno);
+extern void UndoLogSetLSN(XLogRecPtr lsn);
+void UndoLogNewSegment(UndoLogNumber logno, Oid tablespace, int segno);
+/* Redo interface. */
+extern void undolog_redo(XLogReaderState *record);
+/* Discard the undo logs for temp tables */
+extern void TempUndoDiscard(UndoLogNumber);
+extern UndoRecPtr UndoLogStateGetAndClearPrevLogXactUrp(void);
+extern UndoLogNumber UndoLogAmAttachedTo(UndoPersistence persistence);
+extern Oid UndoLogStateGetDatabaseId(void);
+
+/* Test-only interfacing. */
+extern void UndoLogDetachFull(void);
+
+#endif
diff --git a/src/include/access/undolog_xlog.h b/src/include/access/undolog_xlog.h
new file mode 100644
index 0000000..34a622e
--- /dev/null
+++ b/src/include/access/undolog_xlog.h
@@ -0,0 +1,73 @@
+/*-------------------------------------------------------------------------
+ *
+ * undolog_xlog.h
+ *	  undo log access XLOG definitions.
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undolog_xlog.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOLOG_XLOG_H
+#define UNDOLOG_XLOG_H
+
+#include "access/undolog.h"
+#include "access/xlogreader.h"
+#include "lib/stringinfo.h"
+
+/* XLOG records */
+#define XLOG_UNDOLOG_CREATE		0x00
+#define XLOG_UNDOLOG_EXTEND		0x10
+#define XLOG_UNDOLOG_ATTACH		0x20
+#define XLOG_UNDOLOG_DISCARD	0x30
+#define XLOG_UNDOLOG_REWIND		0x40
+#define XLOG_UNDOLOG_META		0x50
+#define XLOG_UNDOLOG_SWITCH		0x60
+
+/* Create a new undo log. */
+typedef struct xl_undolog_create
+{
+	UndoLogNumber logno;
+	Oid		tablespace;
+	UndoPersistence persistence;
+} xl_undolog_create;
+
+/* Extend an undo log by adding a new segment. */
+typedef struct xl_undolog_extend
+{
+	UndoLogNumber logno;
+	UndoLogOffset end;
+} xl_undolog_extend;
+
+/* Record the undo log number used for a transaction. */
+typedef struct xl_undolog_attach
+{
+	TransactionId xid;
+	UndoLogNumber logno;
+	Oid				dbid;
+} xl_undolog_attach;
+
+/* Discard space, and possibly destroy or recycle undo log segments. */
+typedef struct xl_undolog_discard
+{
+	UndoLogNumber logno;
+	UndoLogOffset discard;
+	UndoLogOffset end;
+	TransactionId latestxid;	/* latest xid whose undolog are discarded. */
+	bool		  entirely_discarded;
+} xl_undolog_discard;
+
+/* Rewind insert location of the undo log. */
+typedef struct xl_undolog_rewind
+{
+	UndoLogNumber logno;
+	UndoLogOffset insert;
+	uint16		  prevlen;
+} xl_undolog_rewind;
+
+extern void undolog_desc(StringInfo buf,XLogReaderState *record);
+extern const char *undolog_identify(uint8 info);
+
+#endif
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 3ecc2e1..19dab2f 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -10509,4 +10509,11 @@
   proargnames => '{rootrelid,relid,parentrelid,isleaf,level}',
   prosrc => 'pg_partition_tree' },
 
+# undo logs
+{ oid => '5032', descr => 'list undo logs',
+  proname => 'pg_stat_get_undo_logs', procost => '1', prorows => '10', proretset => 't',
+  prorettype => 'record', proargtypes => '',
+  proallargtypes => '{oid,text,text,text,text,text,xid,int4,oid,text}', proargmodes => '{o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{logno,persistence,tablespace,discard,insert,end,xid,pid,prev_logno,status}', prosrc => 'pg_stat_get_undo_logs' },
+
 ]
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 96c7732..a86c993 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -219,6 +219,8 @@ typedef enum BuiltinTrancheIds
 	LWTRANCHE_SHARED_TUPLESTORE,
 	LWTRANCHE_TBM,
 	LWTRANCHE_PARALLEL_APPEND,
+	LWTRANCHE_UNDOLOG,
+	LWTRANCHE_UNDODISCARD,
 	LWTRANCHE_FIRST_USER_DEFINED
 }			BuiltinTrancheIds;
 
diff --git a/src/include/utils/guc.h b/src/include/utils/guc.h
index c07e7b9..7ed5445 100644
--- a/src/include/utils/guc.h
+++ b/src/include/utils/guc.h
@@ -426,6 +426,8 @@ extern void GUC_check_errcode(int sqlerrcode);
 extern bool check_default_tablespace(char **newval, void **extra, GucSource source);
 extern bool check_temp_tablespaces(char **newval, void **extra, GucSource source);
 extern void assign_temp_tablespaces(const char *newval, void *extra);
+extern bool check_undo_tablespaces(char **newval, void **extra, GucSource source);
+extern void assign_undo_tablespaces(const char *newval, void *extra);
 
 /* in catalog/namespace.c */
 extern bool check_search_path(char **newval, void **extra, GucSource source);
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index e384cd2..a981a4b 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -1919,6 +1919,17 @@ pg_stat_sys_tables| SELECT pg_stat_all_tables.relid,
     pg_stat_all_tables.autoanalyze_count
    FROM pg_stat_all_tables
   WHERE ((pg_stat_all_tables.schemaname = ANY (ARRAY['pg_catalog'::name, 'information_schema'::name])) OR (pg_stat_all_tables.schemaname ~ '^pg_toast'::text));
+pg_stat_undo_logs| SELECT pg_stat_get_undo_logs.logno,
+    pg_stat_get_undo_logs.persistence,
+    pg_stat_get_undo_logs.tablespace,
+    pg_stat_get_undo_logs.discard,
+    pg_stat_get_undo_logs.insert,
+    pg_stat_get_undo_logs."end",
+    pg_stat_get_undo_logs.xid,
+    pg_stat_get_undo_logs.pid,
+    pg_stat_get_undo_logs.prev_logno,
+    pg_stat_get_undo_logs.status
+   FROM pg_stat_get_undo_logs() pg_stat_get_undo_logs(logno, persistence, tablespace, discard, insert, "end", xid, pid, prev_logno, status);
 pg_stat_user_functions| SELECT p.oid AS funcid,
     n.nspname AS schemaname,
     p.proname AS funcname,
-- 
1.8.3.1

0003-undo-interface-v13.patchapplication/octet-stream; name=0003-undo-interface-v13.patchDownload

From 169bd3f5f5490965b99fac33bcb1e69cbe637688 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Sat, 5 Jan 2019 10:43:18 +0530
Subject: [PATCH 3/3] undo-interface-v13

Provide an interface for prepare, insert, or fetch the undo
records. This layer will use undo-log-storage to reserve the space for
the undo records and buffer management routine to write and read the
undo records.

Dilip Kumar with help from Rafia Sabih based on an early prototype
for forming undo record by Robert Haas and design inputs from Amit Kapila
Reviewed by Amit Kapila.
---
 src/backend/access/transam/xact.c    |   28 +
 src/backend/access/transam/xlog.c    |   30 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1243 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  464 +++++++++++++
 src/include/access/undoinsert.h      |   50 ++
 src/include/access/undorecord.h      |  197 ++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 2016 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f665e38..806320b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/undoinsert.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -66,6 +67,7 @@
 #include "utils/timestamp.h"
 #include "pg_trace.h"
 
+#define	AtAbort_ResetUndoBuffers() ResetUndoBuffers()
 
 /*
  *	User-tweakable parameters
@@ -189,6 +191,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +918,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
@@ -2631,6 +2657,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtAbort_ResetUndoBuffers();
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4815,6 +4842,7 @@ AbortSubTransaction(void)
 		AtEOSubXact_PgStat(false, s->nestingLevel);
 		AtSubAbort_Snapshot(s->nestingLevel);
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
+		AtAbort_ResetUndoBuffers();
 	}
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c4c5ab4..390ccba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8329,6 +8329,36 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/*
+	 * Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..b6c0491
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1243 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ * Undo records are stored in sequential order in the undo log.  Each undo
+ * record consists of a variable length header, tuple data, and payload
+ * information.  The first undo record of each transaction contains a
+ * transaction header that points to the next transaction's start header.
+ * This allows us to discard the entire transaction's log at one-shot rather
+ * than record-by-record.  The callers are not aware of transaction header,
+ * this is entirely maintained and used by undo record layer.   See
+ * undorecord.h for detailed information about undo record header.
+ *
+ * Multiple logs:
+ *
+ * It is possible that the undo records for a transaction spans across
+ * multiple undo logs.  We need some special handling while inserting them to
+ * ensure that discard and rollbacks can work sanely.
+ *
+ * When the undorecord for a transaction gets inserted in the next log then we
+ * insert a transaction header for the first record in the new log and update
+ * the transaction header with this new logs location.  We will also keep
+ * a back pointer to the last undo record of previous log in the first record
+ * of new log, so that we can traverse the previous record during rollback.
+ * Incase, this is not the first record in new log (aka new log already
+ * contains some other transactions data), we also update that transactions
+ * next start header with this new undo records location.  This will allow us
+ * to connect transaction's undo records across logs when the same transaction
+ * span across log.
+ *
+ * There is some difference in the way the rollbacks work when the undo for
+ * same transaction spans across multiple logs depending on which log is
+ * processed first by the discard worker.  If it processes the first log which
+ * contains the transactions first record, then it can get the last record
+ * of that transaction even if it is in different log and then processes all
+ * the undo records from last to first.  OTOH, if the next log get processed
+ * first, we don't need to trace back the actual start pointer of the
+ * transaction, rather we only execute the undo actions from the current log
+ * and avoid re-executing them next time.  There is a possibility that after
+ * executing the undo actions, the undo got discarded, now in later stage while
+ * processing the previous log, it might try to fetch the undo record in the
+ * discarded log while chasing the transaction header chain which can cause
+ * trouble.  We avoid this situation by first checking if the next_urec of
+ * the transaction is already discarded and if so, we start executing from
+ * the last undo record in the current log.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "access/undolog_xlog.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * This defines the max number of previous xact infos we need to update.
+ * Usually it's 1 for updating next link of previous transaction's header
+ * if we are starting a new transaction.  But, in some cases where the same
+ * transaction is spilled to the next log, we update our own transaction's
+ * header in previous undo log as well as the header of the previous
+ * transaction in the new log.
+ */
+#define MAX_XACT_UNDO_INFO	2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record as well.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + MAX_XACT_UNDO_INFO) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId prev_txid[UndoPersistenceLevels] = {0};
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber logno;		/* Undo log number */
+	BlockNumber blk;			/* block number */
+	Buffer		buf;			/* buffer allocated for the block */
+	bool		zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr	urp;			/* undo record pointer */
+	UnpackedUndoRecord *urec;	/* undo record */
+	int			undo_buffer_idx[MAX_BUFFER_PER_UNDO];	/* undo_buffer array
+														 * index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace def_prepared[MAX_PREPARED_UNDO];
+static int	prepare_idx;
+static int	max_prepared_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr prepared_urec_ptr = InvalidUndoRecPtr;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.  This
+ * is populated while current transaction is updating its undo record pointer
+ * in previous transactions first undo record.
+ */
+typedef struct XactUndoRecordInfo
+{
+	UndoRecPtr	urecptr;		/* txn's start urecptr */
+	int			idx_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;		/* undo record header */
+} XactUndoRecordInfo;
+
+static XactUndoRecordInfo xact_urec_info[MAX_XACT_UNDO_INFO];
+static int	xact_urec_info_idx;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
+				 UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+						   UndoRecPtr xact_urp);
+static void UndoRecordUpdateTransInfo(int idx);
+static int UndoGetBufferSlot(RelFileNode rnode, BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl * log,
+				  UndoRecPtr urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or
+		 * not so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that
+		 * the doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, UndoRecPtr xact_urp)
+{
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber cur_blk;
+	RelFileNode rnode;
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(xact_urp))
+		return;
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(xact_urp), false);
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 * UndoRecordIsValid will release the lock if it returns false.
+	 */
+	if (!UndoRecordIsValid(log, xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(xact_urp);
+
+	/*
+	 * Read undo record header in by calling UnpackUndoRecord, if the undo
+	 * record header is split across buffers then we need to read the complete
+	 * header by invoking UnpackUndoRecord multiple times.
+	 */
+	while (true)
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk,
+								   RBM_NORMAL,
+								   log->meta.persistence);
+		xact_urec_info[xact_urec_info_idx].idx_undo_buffers[index++] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+
+		if (UnpackUndoRecord(&xact_urec_info[xact_urec_info_idx].uur, page,
+							 starting_byte, &already_decoded, true))
+			break;
+
+		/* Could not fetch the complete header so go to the next block. */
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	xact_urec_info[xact_urec_info_idx].uur.uur_next = urecptr;
+	xact_urec_info[xact_urec_info_idx].urecptr = xact_urp;
+	xact_urec_info_idx++;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(int idx)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(xact_urec_info[idx].urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			i = 0;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	urec_ptr = xact_urec_info[idx].urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, urec_ptr))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(urec_ptr);
+
+	do
+	{
+		Buffer		buffer;
+		int			buf_idx;
+
+		buf_idx = xact_urec_info[idx].idx_undo_buffers[i];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&xact_urec_info[idx].uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		i++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while (true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+UndoGetBufferSlot(RelFileNode rnode,
+				  BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence)
+{
+	int			i;
+	Buffer		buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g when
+		 * previous transaction start header is in previous undo log) so
+		 * compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+																	   GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId txid, UndoPersistence upersistence)
+{
+	UnpackedUndoRecord *urec = NULL;
+	UndoLogControl *log;
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	UndoRecPtr	prevlogurp = InvalidUndoRecPtr;
+	UndoLogNumber prevlogno = InvalidUndoLogNumber;
+	bool		need_xact_hdr = false;
+	bool		log_switched = false;
+	int			i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/* Is this the first undo record of the transaction? */
+	if ((InRecovery && IsTransactionFirstRec(txid)) ||
+		(!InRecovery && prev_txid[upersistence] != txid))
+		need_xact_hdr = true;
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		/*
+		 * Prepare the transacion header for the first undo record of
+		 * transaction.
+		 *
+		 * XXX There is also an option that instead of adding the information
+		 * to this record we can prepare a new record which only contain
+		 * transaction informations, but we can't see any clear advantage of
+		 * the same.
+		 */
+		if (need_xact_hdr && i == 0)
+		{
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			if (log_switched)
+			{
+				/*
+				 * If undo log is switched then during rollback we can not go
+				 * to the previous undo record of the transaction by prevlen
+				 * so we store the previous undo record pointer in the
+				 * transaction header.
+				 */
+				Assert(UndoRecPtrIsValid(prevlogno));
+				log = UndoLogGet(prevlogno, false);
+				urec->uur_prevurp = MakeUndoRecPtr(prevlogno,
+												   log->meta.insert - log->meta.prevlen);
+			}
+			else
+				urec->uur_prevurp = InvalidUndoRecPtr;
+
+			/* During recovery, get the database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+
+			/* Set uur_info to include the transaction header. */
+			urec->uur_info |= UREC_INFO_TRANSACTION;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables with invalid values as
+			 * these are used only with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+			urec->uur_prevurp = InvalidUndoRecPtr;
+		}
+
+		/* Calculate the size of the undo record based on the info required. */
+		UndoRecordSetInfo(urec);
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	/*
+	 * Check whether the undo log got switched while we are in a transaction.
+	 */
+	if (InRecovery)
+	{
+		/*
+		 * During recovery we can identify the log switch by checking the
+		 * prevlogurp from the MyUndoLogState.  The WAL replay action for log
+		 * switch would have set the value and we need to clear it after
+		 * retrieving the latest value.
+		 */
+		prevlogurp = UndoLogStateGetAndClearPrevLogXactUrp();
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+		if (UndoRecPtrIsValid(prevlogurp))
+		{
+			prevlogno = UndoRecPtrGetLogNo(prevlogurp);
+			log_switched = true;
+		}
+	}
+	else
+	{
+		/*
+		 * Check whether the current log is switched after allocation.  We can
+		 * determine that by simply checking to which log we are attached
+		 * before and after allocation.
+		 */
+		prevlogno = UndoLogAmAttachedTo(upersistence);
+		urecptr = UndoLogAllocate(size, upersistence);
+		if (!need_xact_hdr &&
+			prevlogno != InvalidUndoLogNumber &&
+			prevlogno != UndoRecPtrGetLogNo(urecptr))
+		{
+			log = UndoLogGet(prevlogno, false);
+			prevlogurp = MakeUndoRecPtr(prevlogno, log->meta.last_xact_start);
+			log_switched = true;
+		}
+	}
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+	/*
+	 * By now, we must be attached to some undo log unless we are in recovery.
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space) or the undo log got switched, we'll
+	 * need a new transaction header. If we weren't already generating one,
+	 * then do it now.
+	 */
+	if (!need_xact_hdr &&
+		(log->meta.insert == log->meta.last_xact_start || log_switched))
+	{
+		need_xact_hdr = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+		goto resize;
+	}
+
+	/* Update the previous transaction's start undo record, if required. */
+	if (need_xact_hdr || log_switched)
+	{
+		/*
+		 * If the undo log is switched then we need to update our own
+		 * transaction header in the previous log as well as the previous
+		 * transaction's header in the new log.  Read detail comments for
+		 * multi-log handling atop this file.
+		 */
+		if (log_switched)
+			UndoRecordPrepareTransInfo(urecptr, prevlogurp);
+
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr,
+									   MakeUndoRecPtr(log->logno, log->meta.last_xact_start));
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	/*
+	 * Write WAL for log switch.  This is required to identify the log switch
+	 * during recovery.
+	 */
+	if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
+		XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
+	}
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many undo records can be
+ * prepared before we can insert them.  If the size is greater than
+ * MAX_PREPARED_UNDO then it will allocate extra memory to hold the extra
+ * prepared undo.
+ *
+ * This is normally used when more than one undo record needs to be prepared.
+ */
+void
+UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	prepared_urec_ptr = UndoRecordAllocate(undorecords, nrecords, txid,
+										   upersistence);
+	if (nrecords <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(nrecords * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's starting
+	 * undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((nrecords + 1) * MAX_BUFFER_PER_UNDO *
+						  sizeof(UndoBuffers));
+	max_prepared_undo = nrecords;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ *
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, TransactionId xid,
+				  UndoPersistence upersistence)
+{
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	RelFileNode rnode;
+	UndoRecordSize cur_size = 0;
+	BlockNumber cur_blk;
+	TransactionId txid;
+	int			starting_byte;
+	int			index = 0;
+	int			bufidx;
+	ReadBufferMode rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepared_undo)
+		elog(ERROR, "already reached the maximum prepared limit");
+
+
+	if (xid == InvalidTransactionId)
+	{
+		/* During recovery, we must have a valid transaction id. */
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping
+		 * for the top most transactions.
+		 */
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(prepared_urec_ptr))
+		urecptr = UndoRecordAllocate(urec, 1, txid, upersistence);
+	else
+		urecptr = prepared_urec_ptr;
+
+	/* advance the prepared ptr location for next record. */
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(prepared_urec_ptr))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(prepared_urec_ptr);
+
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		prepared_urec_ptr = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned and locked. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+		cur_blk++;
+	} while (cur_size < size);
+
+	/*
+	 * Save the undo record information to be later used by InsertPreparedUndo
+	 * to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  This step should be performed after entering a
+ * criticalsection; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page		page;
+	int			starting_byte;
+	int			already_written;
+	int			bufidx = 0;
+	int			idx;
+	uint16		undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord *uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+
+	/* There must be atleast one prepared undo record. */
+	Assert(prepare_idx > 0);
+
+	/*
+	 * This must be called under a critical section or we must be in recovery.
+	 */
+	Assert(InRecovery || CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+		/*
+		 * Store the previous undo record length in the header.  We can read
+		 * meta.prevlen without locking, because only we can write to it.
+		 */
+		uur->uur_prevlen = log->meta.prevlen;
+
+		/*
+		 * If starting a new log then there is no prevlen to store.
+		 */
+		if (offset == UndoLogBlockHeaderSize)
+			uur->uur_prevlen = 0;
+
+		/*
+		 * if starting from a new page then consider block header size in
+		 * prevlen calculation.
+		 */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+			uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer		buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.  We start writting immediately after the block header.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			starting_byte = UndoLogBlockHeaderSize;
+			undo_len += UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while (true);
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), undo_len);
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+
+	/* Update previously prepared transaction headers. */
+	if (xact_urec_info_idx > 0)
+	{
+		int			i = 0;
+
+		for (i = 0; i < xact_urec_info_idx; i++)
+			UndoRecordUpdateTransInfo(i);
+	}
+
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer, so caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if it wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord *
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer		buffer = urec->uur_buffer;
+	Page		page;
+	int			starting_byte = UndoRecPtrGetPageOffset(urp);
+	int			already_decoded = 0;
+	BlockNumber cur_blk;
+	bool		is_undo_rec_split = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a buffer pin then no need to allocate a new one. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * XXX This can be optimized to just fetch header first and only if
+		 * matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_rec_split = true;
+
+		/*
+		 * The record spans more than a page so we would have copied it (see
+		 * UnpackUndoRecord).  In such cases, we can release the buffer.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer, otherwise, just
+	 * unlock it.
+	 */
+	if (is_undo_rec_split)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * ResetUndoRecord - Helper function for UndoFetchRecord to reset the current
+ * record.
+ */
+static void
+ResetUndoRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode *rnode,
+				RelFileNode *prevrec_rnode)
+{
+	/*
+	 * If we have a valid buffer pinned then just ensure that we want to find
+	 * the next tuple from the same block.  Otherwise release the buffer and
+	 * set it invalid
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		/*
+		 * Undo buffer will be changed if the next undo record belongs to a
+		 * different block or undo log.
+		 */
+		if ((UndoRecPtrGetBlockNum(urp) !=
+			 BufferGetBlockNumber(urec->uur_buffer)) ||
+			(prevrec_rnode->relNode != rnode->relNode))
+		{
+			ReleaseBuffer(urec->uur_buffer);
+			urec->uur_buffer = InvalidBuffer;
+		}
+	}
+	else
+	{
+		/*
+		 * If there is not a valid buffer in urec->uur_buffer that means we
+		 * had copied the payload data and tuple data so free them.
+		 */
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	/* Reset the urec before fetching the tuple */
+	urec->uur_tuple.data = NULL;
+	urec->uur_tuple.len = 0;
+	urec->uur_payload.data = NULL;
+	urec->uur_payload.len = 0;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  The same tuple can be modified by multiple transactions, so during
+ * undo chain traversal sometimes we need to distinguish based on transaction
+ * id.  Callers that don't have any such requirement can pass
+ * InvalidTransactionId.
+ *
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ *
+ * callback function decides whether particular undo record satisfies the
+ * condition of caller.
+ *
+ * Returns the required undo record if found, otherwise, return NULL which
+ * means either the record is already discarded or there is no such record
+ * in the undo chain.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode rnode,
+				prevrec_rnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int			logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+	UndoRecPtrAssignRelFileNode(rnode, urp);
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecordIsValid(log, urp))
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+		prevrec_rnode = rnode;
+
+		/* Get rnode for the current undo record pointer. */
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/* Reset the current undorecord before fetching the next. */
+		ResetUndoRecord(urec, urp, &rnode, &prevrec_rnode);
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * A valid value of prevurp indicates that the previous undo record
+ * pointer is in some other log and caller can directly use that.
+ * Otherwise this will calculate the previous undo record pointer
+ * by using current urp and the prevlen.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen, UndoRecPtr prevurp)
+{
+	if (UndoRecPtrIsValid(prevurp))
+		return prevurp;
+	else
+	{
+		UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+		UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+		/* calculate the previous undo record pointer */
+		return MakeUndoRecPtr(logno, offset - prevlen);
+	}
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree(urec);
+}
+
+/*
+ * RegisterUndoLogBuffers - Register the undo buffers.
+ */
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int			idx;
+	int			flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+/*
+ * UndoLogBuffersSetLSN - Set LSN on undo page.
+*/
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int			idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Reset the global variables related to undo buffers. This is required at the
+ * transaction abort and while releasing the undo buffers.
+ */
+void
+ResetUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+	{
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	for (i = 0; i < xact_urec_info_idx; i++)
+		xact_urec_info[i].urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	xact_urec_info_idx = 0;
+	prepared_urec_ptr = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepared_undo limit is changed so free the allocated memory and
+	 * reset all the variable back to their default value.
+	 */
+	if (max_prepared_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepared_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Unlock and release the undo buffers.  This step must be performed after
+ * exiting any critical section where we have perfomed undo actions.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+
+	ResetUndoBuffers();
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..a13abe3
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,464 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size		size;
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.
+ *
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin writing,
+ * while *already_written is the number of bytes written to previous pages.
+ *
+ * Returns true if the remainder of the record was written and false if more
+ * bytes remain to be written; in either case, *already_written is set to the
+ * number of bytes written thus far.
+ *
+ * This function assumes that if *already_written is non-zero on entry, the
+ * same UnpackedUndoRecord is passed each time.  It also assumes that
+ * UnpackUndoRecord is not called between successive calls to InsertUndoRecord
+ * for the same UnpackedUndoRecord.
+ *
+ * If this function is called again to continue writing the record, the
+ * previous value for *already_written should be passed again, and
+ * starting_byte should be passed as sizeof(PageHeaderData) (since the record
+ * will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char	   *writeptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_written = *already_written;
+
+	/* The undo record must contain a valid information. */
+	Assert(uur->uur_info != 0);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption that
+	 * it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_reloid = uur->uur_reloid;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_prevurp = uur->uur_prevurp;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before, or
+		 * caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_reloid == uur->uur_reloid);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_prevurp == uur->uur_prevurp);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int			can_write;
+	int			remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing to do
+	 * except update *my_bytes_written, which we must do to ensure that the
+	 * next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool
+UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+				 int *already_decoded, bool header_only)
+{
+	char	   *readptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_decoded = *already_decoded;
+	bool		is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_reloid = work_hdr.urec_reloid;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_prevurp = work_txn.urec_prevurp;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any
+		 * of the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int			can_read;
+	int			remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..41384e1
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid);
+
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+				  UndoPersistence);
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+extern void UnlockReleaseUndoBuffers(void);
+
+extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp,
+				BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback);
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+extern void UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence);
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen, UndoRecPtr prevurp);
+extern void ResetUndoBuffers(void);
+
+#endif							/* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..ebc638a
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,197 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_reloid;	/* relation OID */
+
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older than oldestXidWithEpochHavingUndo, then we can
+	 * consider the tuple in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;		/* Transaction id */
+	CommandId	urec_cid;		/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_TRANSACTION is set, an UndoRecordTransaction structure
+ * follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the fork number.  If the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	ForkNumber	urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	UndoRecPtr	urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.  This
+ * also stores the dbid and the progress of the undo apply during rollback.
+ */
+typedef struct UndoRecordTransaction
+{
+	/*
+	 * This indicates undo action apply progress, 0 means not started, 1 means
+	 * completed.  In future, it can also be used to show the progress of how
+	 * much undo has been applied so far with some formula.
+	 */
+	uint32		urec_progress;
+	uint32		urec_xidepoch;	/* epoch of the current transaction */
+	Oid			urec_dbid;		/* database id */
+
+	/*
+	 * Transaction's previous undo record pointer when a transaction spans
+	 * across undo logs.  The first undo record in the new log stores the
+	 * previous undo record pointer in the previous log as we can't calculate
+	 * that directly using prevlen during rollback.
+	 */
+	UndoRecPtr	urec_prevurp;
+	UndoRecPtr	urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;	/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordSetInfo or InsertUndoRecord.  We do set it in
+ * UndoRecordAllocate for transaction specific header information.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_reloid;		/* relation OID */
+	TransactionId uur_prevxid;	/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	ForkNumber	uur_fork;		/* fork number */
+	UndoRecPtr	uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	UndoRecPtr	uur_prevurp;	/* urec pointer to the previous record in
+								 * the different log */
+	UndoRecPtr	uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id */
+
+	/* undo applying progress, see detail comment in UndoRecordTransaction */
+	uint32		uur_progress;
+	StringInfoData uur_payload; /* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif							/* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 169cf28..ddaa633 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f90a6a9..d18a1cd 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -310,6 +310,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

#32

Amit Kapila

amit.kapila16@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#31)

1 attachment(s)

Re: Undo logs

On Sat, Jan 5, 2019 at 11:29 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 1, 2019 at 4:37 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
Thanks, the new changes look mostly okay to me, but I have few comments:
1.
+ /*
+ * WAL log, for log switch.  This is required to identify the log switch
+ * during recovery.
+ */
+ if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
+ {
+ XLogBeginInsert();
+ XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
+ XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
+ }
+
Don't we want to do this under critical section?
I think we are not making any buffer changes here and just inserting a
WAL, IMHO we don't need any critical section. Am I missing
something?.

No, you are correct.

Few more comments:
--------------------------------
Few comments:
----------------
1.
+ * undorecord.c
+ *   encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group

Change the year in Copyright notice for all new files?

2.
+ * This function sets uur->uur_info as a side effect.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+ int starting_byte, int *already_written, bool header_only)

Is the above part of comment still correct? I don't see uur_info being set here.

3.
+ work_txn.urec_next = uur->uur_next;
+ work_txn.urec_xidepoch = uur->uur_xidepoch;
+ work_txn.urec_progress = uur->uur_progress;
+ work_txn.urec_prevurp = uur->uur_prevurp;
+ work_txn.urec_dbid = uur->uur_dbid;

It would be better if we initialize these members in the order in
which they appear in the actual structure. All other undo header
structures are initialized that way, so this looks out-of-place.

4.
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
..
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+ char **writeptr, char *endptr,
+ int *my_bytes_written, int *total_bytes_written)
+{
..
+
+ /* Update bookkeeeping infrormation. */
+ *writeptr += can_write;
+ *total_bytes_written += can_write;
+ *my_bytes_written = 0;

I don't understand the above comment where it is written: "We must
update it for the bytes we write.". We always set 'my_bytes_written'
as 0 if we write. Can you clarify? I guess this part of the comment
is about total_bytes_written or here does it mean to say that caller
should update it. I think some wording change might be required based
on what we intend to say here.

Similar to above, there is a confusion in the description of
my_bytes_read atop ReadUndoBytes.

5.
+uint32
+GetEpochForXid(TransactionId xid)
{
..
+ /*
+ * Xid can be on either side when near wrap-around.  Xid is certainly
+ * logically later than ckptXid.
..

From the usage of this function in the patch, can we say that Xid is
always later than ckptxid, if so, how? Also, I think you previously
told in this thread that usage of uur_xidepoch is mainly for zheap, so
we might want to postpone the inclusion of this undo records. On
thinking again, I think we should follow your advice as I think the
correct usage here would require the patch by Thomas to fix our epoch
stuff [1]/messages/by-id/CAEepm=2YYAtkSnens=TR2S=oRcAF9=2P7GPMK0wMJtxKF1QRig@mail.gmail.com? Am, I correct, if so, I think we should postpone it for
now.

6.
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
{
..
}

I think you can use some comments atop this function to explain the
usage of this function or how will callers use it.

I am done with the first level of code-review for this patch. I am
sure we might need few interface changes here and there while
integrating and testing this with other patches, but the basic idea
and code look reasonable to me. I have modified the proposed commit
message in the attached patch, see if that looks fine to you.

To be clear, this patch can't be independently committed/tested, we
need undo logs and undo worker machinery patches to be ready as well.
I will review those next.

[1]: /messages/by-id/CAEepm=2YYAtkSnens=TR2S=oRcAF9=2P7GPMK0wMJtxKF1QRig@mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-Provide-interfaces-to-store-and-fetch-undo-records-v14.patchapplication/octet-stream; name=0003-Provide-interfaces-to-store-and-fetch-undo-records-v14.patchDownload

From 61e115afb715b220ae7311f5b777bc0bd74537d6 Mon Sep 17 00:00:00 2001
From: Amit Kapila <akapila@postgresql.org>
Date: Tue, 8 Jan 2019 14:06:57 +0530
Subject: [PATCH] Provide interfaces to store and fetch undo records.

Add the capability to form undo records and store them in undo logs.  We
also provide the capability to fetch the undo records.  This layer will use
undo-log-storage to reserve the space for the undo records and buffer
management routines to write and read the undo records.

Undo records are stored in sequential order in the undo log.  Each undo
record consists of a variable length header, tuple data, and payload
information.  The undo records are stored without any sort of alignment
padding and a undo record can span across multiple pages.  The undo records
for a transaction can span across multiple undo logs.

Author: Dilip Kumar with contributions from Robert Haas, Amit Kapila,
	Thomas Munro and Rafia Sabih
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma
Discussion: https://www.postgresql.org/message-id/CAFiTN-uVxxopn0UZ64%3DF-sydbETBbGjWapnBikNo1%3DXv78UeFw%40mail.gmail.com
---
 src/backend/access/transam/xact.c    |   28 +
 src/backend/access/transam/xlog.c    |   30 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1243 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  464 +++++++++++++
 src/include/access/undoinsert.h      |   50 ++
 src/include/access/undorecord.h      |  197 ++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 9 files changed, 2016 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f665e38..806320b 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/undoinsert.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -66,6 +67,7 @@
 #include "utils/timestamp.h"
 #include "pg_trace.h"
 
+#define	AtAbort_ResetUndoBuffers() ResetUndoBuffers()
 
 /*
  *	User-tweakable parameters
@@ -189,6 +191,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +918,26 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
@@ -2631,6 +2657,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtAbort_ResetUndoBuffers();
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4815,6 +4842,7 @@ AbortSubTransaction(void)
 		AtEOSubXact_PgStat(false, s->nestingLevel);
 		AtSubAbort_Snapshot(s->nestingLevel);
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
+		AtAbort_ResetUndoBuffers();
 	}
 
 	/*
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c4c5ab4..390ccba 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -8329,6 +8329,36 @@ GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch)
 }
 
 /*
+ * GetEpochForXid - get the epoch associated with the xid
+ */
+uint32
+GetEpochForXid(TransactionId xid)
+{
+	uint32		ckptXidEpoch;
+	TransactionId ckptXid;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	ckptXidEpoch = XLogCtl->ckptXidEpoch;
+	ckptXid = XLogCtl->ckptXid;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/*
+	 * Xid can be on either side when near wrap-around.  Xid is certainly
+	 * logically later than ckptXid.  So if it's numerically less, it must
+	 * have wrapped into the next epoch.  OTOH, if it is numerically more,
+	 * but logically lesser, then it belongs to previous epoch.
+	 */
+	if (xid > ckptXid &&
+		TransactionIdPrecedes(xid, ckptXid))
+		ckptXidEpoch--;
+	else if (xid < ckptXid &&
+			 TransactionIdFollows(xid, ckptXid))
+		ckptXidEpoch++;
+
+	return ckptXidEpoch;
+}
+
+/*
  * This must be called ONCE during postmaster or standalone-backend shutdown
  */
 void
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..b6c0491
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1243 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ * Undo records are stored in sequential order in the undo log.  Each undo
+ * record consists of a variable length header, tuple data, and payload
+ * information.  The first undo record of each transaction contains a
+ * transaction header that points to the next transaction's start header.
+ * This allows us to discard the entire transaction's log at one-shot rather
+ * than record-by-record.  The callers are not aware of transaction header,
+ * this is entirely maintained and used by undo record layer.   See
+ * undorecord.h for detailed information about undo record header.
+ *
+ * Multiple logs:
+ *
+ * It is possible that the undo records for a transaction spans across
+ * multiple undo logs.  We need some special handling while inserting them to
+ * ensure that discard and rollbacks can work sanely.
+ *
+ * When the undorecord for a transaction gets inserted in the next log then we
+ * insert a transaction header for the first record in the new log and update
+ * the transaction header with this new logs location.  We will also keep
+ * a back pointer to the last undo record of previous log in the first record
+ * of new log, so that we can traverse the previous record during rollback.
+ * Incase, this is not the first record in new log (aka new log already
+ * contains some other transactions data), we also update that transactions
+ * next start header with this new undo records location.  This will allow us
+ * to connect transaction's undo records across logs when the same transaction
+ * span across log.
+ *
+ * There is some difference in the way the rollbacks work when the undo for
+ * same transaction spans across multiple logs depending on which log is
+ * processed first by the discard worker.  If it processes the first log which
+ * contains the transactions first record, then it can get the last record
+ * of that transaction even if it is in different log and then processes all
+ * the undo records from last to first.  OTOH, if the next log get processed
+ * first, we don't need to trace back the actual start pointer of the
+ * transaction, rather we only execute the undo actions from the current log
+ * and avoid re-executing them next time.  There is a possibility that after
+ * executing the undo actions, the undo got discarded, now in later stage while
+ * processing the previous log, it might try to fetch the undo record in the
+ * discarded log while chasing the transaction header chain which can cause
+ * trouble.  We avoid this situation by first checking if the next_urec of
+ * the transaction is already discarded and if so, we start executing from
+ * the last undo record in the current log.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "access/undolog_xlog.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * This defines the max number of previous xact infos we need to update.
+ * Usually it's 1 for updating next link of previous transaction's header
+ * if we are starting a new transaction.  But, in some cases where the same
+ * transaction is spilled to the next log, we update our own transaction's
+ * header in previous undo log as well as the header of the previous
+ * transaction in the new log.
+ */
+#define MAX_XACT_UNDO_INFO	2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record as well.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + MAX_XACT_UNDO_INFO) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId prev_txid[UndoPersistenceLevels] = {0};
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber logno;		/* Undo log number */
+	BlockNumber blk;			/* block number */
+	Buffer		buf;			/* buffer allocated for the block */
+	bool		zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr	urp;			/* undo record pointer */
+	UnpackedUndoRecord *urec;	/* undo record */
+	int			undo_buffer_idx[MAX_BUFFER_PER_UNDO];	/* undo_buffer array
+														 * index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace def_prepared[MAX_PREPARED_UNDO];
+static int	prepare_idx;
+static int	max_prepared_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr prepared_urec_ptr = InvalidUndoRecPtr;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.  This
+ * is populated while current transaction is updating its undo record pointer
+ * in previous transactions first undo record.
+ */
+typedef struct XactUndoRecordInfo
+{
+	UndoRecPtr	urecptr;		/* txn's start urecptr */
+	int			idx_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;		/* undo record header */
+} XactUndoRecordInfo;
+
+static XactUndoRecordInfo xact_urec_info[MAX_XACT_UNDO_INFO];
+static int	xact_urec_info_idx;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
+				 UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+						   UndoRecPtr xact_urp);
+static void UndoRecordUpdateTransInfo(int idx);
+static int UndoGetBufferSlot(RelFileNode rnode, BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl * log,
+				  UndoRecPtr urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or
+		 * not so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that
+		 * the doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, UndoRecPtr xact_urp)
+{
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber cur_blk;
+	RelFileNode rnode;
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(xact_urp))
+		return;
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(xact_urp), false);
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 * UndoRecordIsValid will release the lock if it returns false.
+	 */
+	if (!UndoRecordIsValid(log, xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(xact_urp);
+
+	/*
+	 * Read undo record header in by calling UnpackUndoRecord, if the undo
+	 * record header is split across buffers then we need to read the complete
+	 * header by invoking UnpackUndoRecord multiple times.
+	 */
+	while (true)
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk,
+								   RBM_NORMAL,
+								   log->meta.persistence);
+		xact_urec_info[xact_urec_info_idx].idx_undo_buffers[index++] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+
+		if (UnpackUndoRecord(&xact_urec_info[xact_urec_info_idx].uur, page,
+							 starting_byte, &already_decoded, true))
+			break;
+
+		/* Could not fetch the complete header so go to the next block. */
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	xact_urec_info[xact_urec_info_idx].uur.uur_next = urecptr;
+	xact_urec_info[xact_urec_info_idx].urecptr = xact_urp;
+	xact_urec_info_idx++;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(int idx)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(xact_urec_info[idx].urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			i = 0;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	urec_ptr = xact_urec_info[idx].urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, urec_ptr))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(urec_ptr);
+
+	do
+	{
+		Buffer		buffer;
+		int			buf_idx;
+
+		buf_idx = xact_urec_info[idx].idx_undo_buffers[i];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&xact_urec_info[idx].uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		i++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while (true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+UndoGetBufferSlot(RelFileNode rnode,
+				  BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence)
+{
+	int			i;
+	Buffer		buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g when
+		 * previous transaction start header is in previous undo log) so
+		 * compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+																	   GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId txid, UndoPersistence upersistence)
+{
+	UnpackedUndoRecord *urec = NULL;
+	UndoLogControl *log;
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	UndoRecPtr	prevlogurp = InvalidUndoRecPtr;
+	UndoLogNumber prevlogno = InvalidUndoLogNumber;
+	bool		need_xact_hdr = false;
+	bool		log_switched = false;
+	int			i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/* Is this the first undo record of the transaction? */
+	if ((InRecovery && IsTransactionFirstRec(txid)) ||
+		(!InRecovery && prev_txid[upersistence] != txid))
+		need_xact_hdr = true;
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		/*
+		 * Prepare the transacion header for the first undo record of
+		 * transaction.
+		 *
+		 * XXX There is also an option that instead of adding the information
+		 * to this record we can prepare a new record which only contain
+		 * transaction informations, but we can't see any clear advantage of
+		 * the same.
+		 */
+		if (need_xact_hdr && i == 0)
+		{
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = GetEpochForXid(txid);
+			urec->uur_progress = 0;
+
+			if (log_switched)
+			{
+				/*
+				 * If undo log is switched then during rollback we can not go
+				 * to the previous undo record of the transaction by prevlen
+				 * so we store the previous undo record pointer in the
+				 * transaction header.
+				 */
+				Assert(UndoRecPtrIsValid(prevlogno));
+				log = UndoLogGet(prevlogno, false);
+				urec->uur_prevurp = MakeUndoRecPtr(prevlogno,
+												   log->meta.insert - log->meta.prevlen);
+			}
+			else
+				urec->uur_prevurp = InvalidUndoRecPtr;
+
+			/* During recovery, get the database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+
+			/* Set uur_info to include the transaction header. */
+			urec->uur_info |= UREC_INFO_TRANSACTION;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables with invalid values as
+			 * these are used only with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_xidepoch = 0;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+			urec->uur_prevurp = InvalidUndoRecPtr;
+		}
+
+		/* Calculate the size of the undo record based on the info required. */
+		UndoRecordSetInfo(urec);
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	/*
+	 * Check whether the undo log got switched while we are in a transaction.
+	 */
+	if (InRecovery)
+	{
+		/*
+		 * During recovery we can identify the log switch by checking the
+		 * prevlogurp from the MyUndoLogState.  The WAL replay action for log
+		 * switch would have set the value and we need to clear it after
+		 * retrieving the latest value.
+		 */
+		prevlogurp = UndoLogStateGetAndClearPrevLogXactUrp();
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+		if (UndoRecPtrIsValid(prevlogurp))
+		{
+			prevlogno = UndoRecPtrGetLogNo(prevlogurp);
+			log_switched = true;
+		}
+	}
+	else
+	{
+		/*
+		 * Check whether the current log is switched after allocation.  We can
+		 * determine that by simply checking to which log we are attached
+		 * before and after allocation.
+		 */
+		prevlogno = UndoLogAmAttachedTo(upersistence);
+		urecptr = UndoLogAllocate(size, upersistence);
+		if (!need_xact_hdr &&
+			prevlogno != InvalidUndoLogNumber &&
+			prevlogno != UndoRecPtrGetLogNo(urecptr))
+		{
+			log = UndoLogGet(prevlogno, false);
+			prevlogurp = MakeUndoRecPtr(prevlogno, log->meta.last_xact_start);
+			log_switched = true;
+		}
+	}
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+	/*
+	 * By now, we must be attached to some undo log unless we are in recovery.
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space) or the undo log got switched, we'll
+	 * need a new transaction header. If we weren't already generating one,
+	 * then do it now.
+	 */
+	if (!need_xact_hdr &&
+		(log->meta.insert == log->meta.last_xact_start || log_switched))
+	{
+		need_xact_hdr = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+		goto resize;
+	}
+
+	/* Update the previous transaction's start undo record, if required. */
+	if (need_xact_hdr || log_switched)
+	{
+		/*
+		 * If the undo log is switched then we need to update our own
+		 * transaction header in the previous log as well as the previous
+		 * transaction's header in the new log.  Read detail comments for
+		 * multi-log handling atop this file.
+		 */
+		if (log_switched)
+			UndoRecordPrepareTransInfo(urecptr, prevlogurp);
+
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr,
+									   MakeUndoRecPtr(log->logno, log->meta.last_xact_start));
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	/*
+	 * Write WAL for log switch.  This is required to identify the log switch
+	 * during recovery.
+	 */
+	if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
+		XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
+	}
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many undo records can be
+ * prepared before we can insert them.  If the size is greater than
+ * MAX_PREPARED_UNDO then it will allocate extra memory to hold the extra
+ * prepared undo.
+ *
+ * This is normally used when more than one undo record needs to be prepared.
+ */
+void
+UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	prepared_urec_ptr = UndoRecordAllocate(undorecords, nrecords, txid,
+										   upersistence);
+	if (nrecords <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(nrecords * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's starting
+	 * undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((nrecords + 1) * MAX_BUFFER_PER_UNDO *
+						  sizeof(UndoBuffers));
+	max_prepared_undo = nrecords;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ *
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, TransactionId xid,
+				  UndoPersistence upersistence)
+{
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	RelFileNode rnode;
+	UndoRecordSize cur_size = 0;
+	BlockNumber cur_blk;
+	TransactionId txid;
+	int			starting_byte;
+	int			index = 0;
+	int			bufidx;
+	ReadBufferMode rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepared_undo)
+		elog(ERROR, "already reached the maximum prepared limit");
+
+
+	if (xid == InvalidTransactionId)
+	{
+		/* During recovery, we must have a valid transaction id. */
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping
+		 * for the top most transactions.
+		 */
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(prepared_urec_ptr))
+		urecptr = UndoRecordAllocate(urec, 1, txid, upersistence);
+	else
+		urecptr = prepared_urec_ptr;
+
+	/* advance the prepared ptr location for next record. */
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(prepared_urec_ptr))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(prepared_urec_ptr);
+
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		prepared_urec_ptr = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned and locked. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+		cur_blk++;
+	} while (cur_size < size);
+
+	/*
+	 * Save the undo record information to be later used by InsertPreparedUndo
+	 * to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  This step should be performed after entering a
+ * criticalsection; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page		page;
+	int			starting_byte;
+	int			already_written;
+	int			bufidx = 0;
+	int			idx;
+	uint16		undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord *uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+
+	/* There must be atleast one prepared undo record. */
+	Assert(prepare_idx > 0);
+
+	/*
+	 * This must be called under a critical section or we must be in recovery.
+	 */
+	Assert(InRecovery || CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+		/*
+		 * Store the previous undo record length in the header.  We can read
+		 * meta.prevlen without locking, because only we can write to it.
+		 */
+		uur->uur_prevlen = log->meta.prevlen;
+
+		/*
+		 * If starting a new log then there is no prevlen to store.
+		 */
+		if (offset == UndoLogBlockHeaderSize)
+			uur->uur_prevlen = 0;
+
+		/*
+		 * if starting from a new page then consider block header size in
+		 * prevlen calculation.
+		 */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+			uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer		buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.  We start writting immediately after the block header.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			starting_byte = UndoLogBlockHeaderSize;
+			undo_len += UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while (true);
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), undo_len);
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+
+	/* Update previously prepared transaction headers. */
+	if (xact_urec_info_idx > 0)
+	{
+		int			i = 0;
+
+		for (i = 0; i < xact_urec_info_idx; i++)
+			UndoRecordUpdateTransInfo(i);
+	}
+
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer, so caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if it wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord *
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer		buffer = urec->uur_buffer;
+	Page		page;
+	int			starting_byte = UndoRecPtrGetPageOffset(urp);
+	int			already_decoded = 0;
+	BlockNumber cur_blk;
+	bool		is_undo_rec_split = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a buffer pin then no need to allocate a new one. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * XXX This can be optimized to just fetch header first and only if
+		 * matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_rec_split = true;
+
+		/*
+		 * The record spans more than a page so we would have copied it (see
+		 * UnpackUndoRecord).  In such cases, we can release the buffer.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer, otherwise, just
+	 * unlock it.
+	 */
+	if (is_undo_rec_split)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * ResetUndoRecord - Helper function for UndoFetchRecord to reset the current
+ * record.
+ */
+static void
+ResetUndoRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode *rnode,
+				RelFileNode *prevrec_rnode)
+{
+	/*
+	 * If we have a valid buffer pinned then just ensure that we want to find
+	 * the next tuple from the same block.  Otherwise release the buffer and
+	 * set it invalid
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		/*
+		 * Undo buffer will be changed if the next undo record belongs to a
+		 * different block or undo log.
+		 */
+		if ((UndoRecPtrGetBlockNum(urp) !=
+			 BufferGetBlockNumber(urec->uur_buffer)) ||
+			(prevrec_rnode->relNode != rnode->relNode))
+		{
+			ReleaseBuffer(urec->uur_buffer);
+			urec->uur_buffer = InvalidBuffer;
+		}
+	}
+	else
+	{
+		/*
+		 * If there is not a valid buffer in urec->uur_buffer that means we
+		 * had copied the payload data and tuple data so free them.
+		 */
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	/* Reset the urec before fetching the tuple */
+	urec->uur_tuple.data = NULL;
+	urec->uur_tuple.len = 0;
+	urec->uur_payload.data = NULL;
+	urec->uur_payload.len = 0;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  The same tuple can be modified by multiple transactions, so during
+ * undo chain traversal sometimes we need to distinguish based on transaction
+ * id.  Callers that don't have any such requirement can pass
+ * InvalidTransactionId.
+ *
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ *
+ * callback function decides whether particular undo record satisfies the
+ * condition of caller.
+ *
+ * Returns the required undo record if found, otherwise, return NULL which
+ * means either the record is already discarded or there is no such record
+ * in the undo chain.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode rnode,
+				prevrec_rnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int			logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+	UndoRecPtrAssignRelFileNode(rnode, urp);
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecordIsValid(log, urp))
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+		prevrec_rnode = rnode;
+
+		/* Get rnode for the current undo record pointer. */
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/* Reset the current undorecord before fetching the next. */
+		ResetUndoRecord(urec, urp, &rnode, &prevrec_rnode);
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * A valid value of prevurp indicates that the previous undo record
+ * pointer is in some other log and caller can directly use that.
+ * Otherwise this will calculate the previous undo record pointer
+ * by using current urp and the prevlen.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen, UndoRecPtr prevurp)
+{
+	if (UndoRecPtrIsValid(prevurp))
+		return prevurp;
+	else
+	{
+		UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+		UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+		/* calculate the previous undo record pointer */
+		return MakeUndoRecPtr(logno, offset - prevlen);
+	}
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree(urec);
+}
+
+/*
+ * RegisterUndoLogBuffers - Register the undo buffers.
+ */
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int			idx;
+	int			flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+/*
+ * UndoLogBuffersSetLSN - Set LSN on undo page.
+*/
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int			idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Reset the global variables related to undo buffers. This is required at the
+ * transaction abort and while releasing the undo buffers.
+ */
+void
+ResetUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+	{
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	for (i = 0; i < xact_urec_info_idx; i++)
+		xact_urec_info[i].urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	xact_urec_info_idx = 0;
+	prepared_urec_ptr = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepared_undo limit is changed so free the allocated memory and
+	 * reset all the variable back to their default value.
+	 */
+	if (max_prepared_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepared_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Unlock and release the undo buffers.  This step must be performed after
+ * exiting any critical section where we have perfomed undo actions.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+
+	ResetUndoBuffers();
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..a13abe3
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,464 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size		size;
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.
+ *
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin writing,
+ * while *already_written is the number of bytes written to previous pages.
+ *
+ * Returns true if the remainder of the record was written and false if more
+ * bytes remain to be written; in either case, *already_written is set to the
+ * number of bytes written thus far.
+ *
+ * This function assumes that if *already_written is non-zero on entry, the
+ * same UnpackedUndoRecord is passed each time.  It also assumes that
+ * UnpackUndoRecord is not called between successive calls to InsertUndoRecord
+ * for the same UnpackedUndoRecord.
+ *
+ * If this function is called again to continue writing the record, the
+ * previous value for *already_written should be passed again, and
+ * starting_byte should be passed as sizeof(PageHeaderData) (since the record
+ * will continue immediately following the page header).
+ *
+ * This function sets uur->uur_info as a side effect.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char	   *writeptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_written = *already_written;
+
+	/* The undo record must contain a valid information. */
+	Assert(uur->uur_info != 0);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption that
+	 * it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_reloid = uur->uur_reloid;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_next = uur->uur_next;
+		work_txn.urec_xidepoch = uur->uur_xidepoch;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_prevurp = uur->uur_prevurp;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before, or
+		 * caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_reloid == uur->uur_reloid);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_prevurp == uur->uur_prevurp);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must likewise be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int			can_write;
+	int			remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing to do
+	 * except update *my_bytes_written, which we must do to ensure that the
+	 * next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool
+UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+				 int *already_decoded, bool header_only)
+{
+	char	   *readptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_decoded = *already_decoded;
+	bool		is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_reloid = work_hdr.urec_reloid;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_xidepoch = work_txn.urec_xidepoch;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_prevurp = work_txn.urec_prevurp;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any
+		 * of the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we read.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int			can_read;
+	int			remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..41384e1
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid);
+
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+				  UndoPersistence);
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+extern void UnlockReleaseUndoBuffers(void);
+
+extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp,
+				BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback);
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+extern void UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence);
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen, UndoRecPtr prevurp);
+extern void ResetUndoBuffers(void);
+
+#endif							/* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..ebc638a
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,197 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_reloid;	/* relation OID */
+
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older than oldestXidWithEpochHavingUndo, then we can
+	 * consider the tuple in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;		/* Transaction id */
+	CommandId	urec_cid;		/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_TRANSACTION is set, an UndoRecordTransaction structure
+ * follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the fork number.  If the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	ForkNumber	urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	UndoRecPtr	urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.  This
+ * also stores the dbid and the progress of the undo apply during rollback.
+ */
+typedef struct UndoRecordTransaction
+{
+	/*
+	 * This indicates undo action apply progress, 0 means not started, 1 means
+	 * completed.  In future, it can also be used to show the progress of how
+	 * much undo has been applied so far with some formula.
+	 */
+	uint32		urec_progress;
+	uint32		urec_xidepoch;	/* epoch of the current transaction */
+	Oid			urec_dbid;		/* database id */
+
+	/*
+	 * Transaction's previous undo record pointer when a transaction spans
+	 * across undo logs.  The first undo record in the new log stores the
+	 * previous undo record pointer in the previous log as we can't calculate
+	 * that directly using prevlen during rollback.
+	 */
+	UndoRecPtr	urec_prevurp;
+	UndoRecPtr	urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;	/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordSetInfo or InsertUndoRecord.  We do set it in
+ * UndoRecordAllocate for transaction specific header information.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_reloid;		/* relation OID */
+	TransactionId uur_prevxid;	/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	ForkNumber	uur_fork;		/* fork number */
+	UndoRecPtr	uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	uint32		uur_xidepoch;	/* epoch of the inserting transaction. */
+	UndoRecPtr	uur_prevurp;	/* urec pointer to the previous record in
+								 * the different log */
+	UndoRecPtr	uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id */
+
+	/* undo applying progress, see detail comment in UndoRecordTransaction */
+	uint32		uur_progress;
+	StringInfoData uur_payload; /* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif							/* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 169cf28..ddaa633 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f90a6a9..d18a1cd 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -310,6 +310,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

#33

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Amit Kapila (#32)

1 attachment(s)

Re: Undo logs

On Tue, Jan 8, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

Few more comments:
--------------------------------
Few comments:
----------------
1.
+ * undorecord.c
+ *   encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2018, PostgreSQL Global Development Group

Change the year in Copyright notice for all new files?

Done

2.
+ * This function sets uur->uur_info as a side effect.
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+ int starting_byte, int *already_written, bool header_only)
Is the above part of comment still correct? I don't see uur_info being set
here.

Changed

3.
+ work_txn.urec_next = uur->uur_next;
+ work_txn.urec_xidepoch = uur->uur_xidepoch;
+ work_txn.urec_progress = uur->uur_progress;
+ work_txn.urec_prevurp = uur->uur_prevurp;
+ work_txn.urec_dbid = uur->uur_dbid;
It would be better if we initialize these members in the order in
which they appear in the actual structure. All other undo header
structures are initialized that way, so this looks out-of-place.

Fixed

4.
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.  We must update it for the bytes we write.
+ *
..
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+ char **writeptr, char *endptr,
+ int *my_bytes_written, int *total_bytes_written)
+{
..
+
+ /* Update bookkeeeping infrormation. */
+ *writeptr += can_write;
+ *total_bytes_written += can_write;
+ *my_bytes_written = 0;
I don't understand the above comment where it is written: "We must
update it for the bytes we write.". We always set 'my_bytes_written'
as 0 if we write. Can you clarify? I guess this part of the comment
is about total_bytes_written or here does it mean to say that caller
should update it. I think some wording change might be required based
on what we intend to say here.

Similar to above, there is a confusion in the description of
my_bytes_read atop ReadUndoBytes.

Fixed

5.
+uint32
+GetEpochForXid(TransactionId xid)
{
..
+ /*
+ * Xid can be on either side when near wrap-around.  Xid is certainly
+ * logically later than ckptXid.
..
From the usage of this function in the patch, can we say that Xid is
always later than ckptxid, if so, how? Also, I think you previously
told in this thread that usage of uur_xidepoch is mainly for zheap, so
we might want to postpone the inclusion of this undo records. On
thinking again, I think we should follow your advice as I think the
correct usage here would require the patch by Thomas to fix our epoch
stuff [1]? Am, I correct, if so, I think we should postpone it for
now.

Removed

6.
/*
+ * SetCurrentUndoLocation
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
{
..
}
I think you can use some comments atop this function to explain the
usage of this function or how will callers use it.

Done

I am done with the first level of code-review for this patch. I am
sure we might need few interface changes here and there while
integrating and testing this with other patches, but the basic idea
and code look reasonable to me. I have modified the proposed commit
message in the attached patch, see if that looks fine to you.

To be clear, this patch can't be independently committed/tested, we
need undo logs and undo worker machinery patches to be ready as well.
I will review those next.

Make sense

[1] -
/messages/by-id/CAEepm=2YYAtkSnens=TR2S=oRcAF9=2P7GPMK0wMJtxKF1QRig@mail.gmail.com

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-Provide-interfaces-to-store-and-fetch-undo-records_v15.patchapplication/octet-stream; name=0003-Provide-interfaces-to-store-and-fetch-undo-records_v15.patchDownload

From 7db2c2d9b6ff9bb96bc2979b3078cc49d95fc182 Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Sat, 5 Jan 2019 10:43:18 +0530
Subject: [PATCH] Provide interfaces to store and fetch undo records.

Add the capability to form undo records and store them in undo logs.  We
also provide the capability to fetch the undo records.  This layer will use
undo-log-storage to reserve the space for the undo records and buffer
management routines to write and read the undo records.

Undo records are stored in sequential order in the undo log.  Each undo
record consists of a variable length header, tuple data, and payload
information.  The undo records are stored without any sort of alignment
padding and a undo record can span across multiple pages.  The undo records
for a transaction can span across multiple undo logs.

Author: Dilip Kumar with contributions from Robert Haas, Amit Kapila,
	Thomas Munro and Rafia Sabih
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma
Discussion: https://www.postgresql.org/message-id/CAFiTN-uVxxopn0UZ64%3DF-sydbETBbGjWapnBikNo1%3DXv78UeFw%40mail.gmail.com
---
 src/backend/access/transam/xact.c    |   37 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1241 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  460 +++++++++++++
 src/include/access/undoinsert.h      |   50 ++
 src/include/access/undorecord.h      |  195 ++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 8 files changed, 1987 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f665e38..8af75a9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/undoinsert.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -66,6 +67,7 @@
 #include "utils/timestamp.h"
 #include "pg_trace.h"
 
+#define	AtAbort_ResetUndoBuffers() ResetUndoBuffers()
 
 /*
  *	User-tweakable parameters
@@ -189,6 +191,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +918,35 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ * 
+ * Update the start and the latest undo record pointer for the transaction.
+ * 
+ * start_urec_ptr is set only for the first undo for the transaction i.e.
+ * start_urec_ptr is invalid.  Update the latest_urec_ptr whenever a new
+ * undo is inserted for the transaction.
+ * 
+ * start and latest undo record pointer are tracked separately for each
+ * persistent level.
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
@@ -2631,6 +2666,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtAbort_ResetUndoBuffers();
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4815,6 +4851,7 @@ AbortSubTransaction(void)
 		AtEOSubXact_PgStat(false, s->nestingLevel);
 		AtSubAbort_Snapshot(s->nestingLevel);
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
+		AtAbort_ResetUndoBuffers();
 	}
 
 	/*
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..9ffa46c
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1241 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ * Undo records are stored in sequential order in the undo log.  Each undo
+ * record consists of a variable length header, tuple data, and payload
+ * information.  The first undo record of each transaction contains a
+ * transaction header that points to the next transaction's start header.
+ * This allows us to discard the entire transaction's log at one-shot rather
+ * than record-by-record.  The callers are not aware of transaction header,
+ * this is entirely maintained and used by undo record layer.   See
+ * undorecord.h for detailed information about undo record header.
+ *
+ * Multiple logs:
+ *
+ * It is possible that the undo records for a transaction spans across
+ * multiple undo logs.  We need some special handling while inserting them to
+ * ensure that discard and rollbacks can work sanely.
+ *
+ * When the undorecord for a transaction gets inserted in the next log then we
+ * insert a transaction header for the first record in the new log and update
+ * the transaction header with this new logs location.  We will also keep
+ * a back pointer to the last undo record of previous log in the first record
+ * of new log, so that we can traverse the previous record during rollback.
+ * Incase, this is not the first record in new log (aka new log already
+ * contains some other transactions data), we also update that transactions
+ * next start header with this new undo records location.  This will allow us
+ * to connect transaction's undo records across logs when the same transaction
+ * span across log.
+ *
+ * There is some difference in the way the rollbacks work when the undo for
+ * same transaction spans across multiple logs depending on which log is
+ * processed first by the discard worker.  If it processes the first log which
+ * contains the transactions first record, then it can get the last record
+ * of that transaction even if it is in different log and then processes all
+ * the undo records from last to first.  OTOH, if the next log get processed
+ * first, we don't need to trace back the actual start pointer of the
+ * transaction, rather we only execute the undo actions from the current log
+ * and avoid re-executing them next time.  There is a possibility that after
+ * executing the undo actions, the undo got discarded, now in later stage while
+ * processing the previous log, it might try to fetch the undo record in the
+ * discarded log while chasing the transaction header chain which can cause
+ * trouble.  We avoid this situation by first checking if the next_urec of
+ * the transaction is already discarded and if so, we start executing from
+ * the last undo record in the current log.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "access/undolog_xlog.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * This defines the max number of previous xact infos we need to update.
+ * Usually it's 1 for updating next link of previous transaction's header
+ * if we are starting a new transaction.  But, in some cases where the same
+ * transaction is spilled to the next log, we update our own transaction's
+ * header in previous undo log as well as the header of the previous
+ * transaction in the new log.
+ */
+#define MAX_XACT_UNDO_INFO	2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record as well.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + MAX_XACT_UNDO_INFO) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId prev_txid[UndoPersistenceLevels] = {0};
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber logno;		/* Undo log number */
+	BlockNumber blk;			/* block number */
+	Buffer		buf;			/* buffer allocated for the block */
+	bool		zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr	urp;			/* undo record pointer */
+	UnpackedUndoRecord *urec;	/* undo record */
+	int			undo_buffer_idx[MAX_BUFFER_PER_UNDO];	/* undo_buffer array
+														 * index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace def_prepared[MAX_PREPARED_UNDO];
+static int	prepare_idx;
+static int	max_prepared_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr prepared_urec_ptr = InvalidUndoRecPtr;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.  This
+ * is populated while current transaction is updating its undo record pointer
+ * in previous transactions first undo record.
+ */
+typedef struct XactUndoRecordInfo
+{
+	UndoRecPtr	urecptr;		/* txn's start urecptr */
+	int			idx_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;		/* undo record header */
+} XactUndoRecordInfo;
+
+static XactUndoRecordInfo xact_urec_info[MAX_XACT_UNDO_INFO];
+static int	xact_urec_info_idx;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
+				 UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+						   UndoRecPtr xact_urp);
+static void UndoRecordUpdateTransInfo(int idx);
+static int UndoGetBufferSlot(RelFileNode rnode, BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl * log,
+				  UndoRecPtr urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or
+		 * not so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that
+		 * the doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, UndoRecPtr xact_urp)
+{
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber cur_blk;
+	RelFileNode rnode;
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(xact_urp))
+		return;
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(xact_urp), false);
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 * UndoRecordIsValid will release the lock if it returns false.
+	 */
+	if (!UndoRecordIsValid(log, xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(xact_urp);
+
+	/*
+	 * Read undo record header in by calling UnpackUndoRecord, if the undo
+	 * record header is split across buffers then we need to read the complete
+	 * header by invoking UnpackUndoRecord multiple times.
+	 */
+	while (true)
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk,
+								   RBM_NORMAL,
+								   log->meta.persistence);
+		xact_urec_info[xact_urec_info_idx].idx_undo_buffers[index++] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+
+		if (UnpackUndoRecord(&xact_urec_info[xact_urec_info_idx].uur, page,
+							 starting_byte, &already_decoded, true))
+			break;
+
+		/* Could not fetch the complete header so go to the next block. */
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	xact_urec_info[xact_urec_info_idx].uur.uur_next = urecptr;
+	xact_urec_info[xact_urec_info_idx].urecptr = xact_urp;
+	xact_urec_info_idx++;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(int idx)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(xact_urec_info[idx].urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			i = 0;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	urec_ptr = xact_urec_info[idx].urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, urec_ptr))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(urec_ptr);
+
+	do
+	{
+		Buffer		buffer;
+		int			buf_idx;
+
+		buf_idx = xact_urec_info[idx].idx_undo_buffers[i];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&xact_urec_info[idx].uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		i++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while (true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+UndoGetBufferSlot(RelFileNode rnode,
+				  BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence)
+{
+	int			i;
+	Buffer		buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g when
+		 * previous transaction start header is in previous undo log) so
+		 * compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+																	   GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId txid, UndoPersistence upersistence)
+{
+	UnpackedUndoRecord *urec = NULL;
+	UndoLogControl *log;
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	UndoRecPtr	prevlogurp = InvalidUndoRecPtr;
+	UndoLogNumber prevlogno = InvalidUndoLogNumber;
+	bool		need_xact_hdr = false;
+	bool		log_switched = false;
+	int			i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/* Is this the first undo record of the transaction? */
+	if ((InRecovery && IsTransactionFirstRec(txid)) ||
+		(!InRecovery && prev_txid[upersistence] != txid))
+		need_xact_hdr = true;
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		/*
+		 * Prepare the transacion header for the first undo record of
+		 * transaction.
+		 *
+		 * XXX There is also an option that instead of adding the information
+		 * to this record we can prepare a new record which only contain
+		 * transaction informations, but we can't see any clear advantage of
+		 * the same.
+		 */
+		if (need_xact_hdr && i == 0)
+		{
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_progress = 0;
+
+			if (log_switched)
+			{
+				/*
+				 * If undo log is switched then during rollback we can not go
+				 * to the previous undo record of the transaction by prevlen
+				 * so we store the previous undo record pointer in the
+				 * transaction header.
+				 */
+				Assert(UndoRecPtrIsValid(prevlogno));
+				log = UndoLogGet(prevlogno, false);
+				urec->uur_prevurp = MakeUndoRecPtr(prevlogno,
+												   log->meta.insert - log->meta.prevlen);
+			}
+			else
+				urec->uur_prevurp = InvalidUndoRecPtr;
+
+			/* During recovery, get the database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+
+			/* Set uur_info to include the transaction header. */
+			urec->uur_info |= UREC_INFO_TRANSACTION;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables with invalid values as
+			 * these are used only with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+			urec->uur_prevurp = InvalidUndoRecPtr;
+		}
+
+		/* Calculate the size of the undo record based on the info required. */
+		UndoRecordSetInfo(urec);
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	/*
+	 * Check whether the undo log got switched while we are in a transaction.
+	 */
+	if (InRecovery)
+	{
+		/*
+		 * During recovery we can identify the log switch by checking the
+		 * prevlogurp from the MyUndoLogState.  The WAL replay action for log
+		 * switch would have set the value and we need to clear it after
+		 * retrieving the latest value.
+		 */
+		prevlogurp = UndoLogStateGetAndClearPrevLogXactUrp();
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+		if (UndoRecPtrIsValid(prevlogurp))
+		{
+			prevlogno = UndoRecPtrGetLogNo(prevlogurp);
+			log_switched = true;
+		}
+	}
+	else
+	{
+		/*
+		 * Check whether the current log is switched after allocation.  We can
+		 * determine that by simply checking to which log we are attached
+		 * before and after allocation.
+		 */
+		prevlogno = UndoLogAmAttachedTo(upersistence);
+		urecptr = UndoLogAllocate(size, upersistence);
+		if (!need_xact_hdr &&
+			prevlogno != InvalidUndoLogNumber &&
+			prevlogno != UndoRecPtrGetLogNo(urecptr))
+		{
+			log = UndoLogGet(prevlogno, false);
+			prevlogurp = MakeUndoRecPtr(prevlogno, log->meta.last_xact_start);
+			log_switched = true;
+		}
+	}
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+	/*
+	 * By now, we must be attached to some undo log unless we are in recovery.
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space) or the undo log got switched, we'll
+	 * need a new transaction header. If we weren't already generating one,
+	 * then do it now.
+	 */
+	if (!need_xact_hdr &&
+		(log->meta.insert == log->meta.last_xact_start || log_switched))
+	{
+		need_xact_hdr = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+		goto resize;
+	}
+
+	/* Update the previous transaction's start undo record, if required. */
+	if (need_xact_hdr || log_switched)
+	{
+		/*
+		 * If the undo log is switched then we need to update our own
+		 * transaction header in the previous log as well as the previous
+		 * transaction's header in the new log.  Read detail comments for
+		 * multi-log handling atop this file.
+		 */
+		if (log_switched)
+			UndoRecordPrepareTransInfo(urecptr, prevlogurp);
+
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr,
+									   MakeUndoRecPtr(log->logno, log->meta.last_xact_start));
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	/*
+	 * Write WAL for log switch.  This is required to identify the log switch
+	 * during recovery.
+	 */
+	if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
+		XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
+	}
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many undo records can be
+ * prepared before we can insert them.  If the size is greater than
+ * MAX_PREPARED_UNDO then it will allocate extra memory to hold the extra
+ * prepared undo.
+ *
+ * This is normally used when more than one undo record needs to be prepared.
+ */
+void
+UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	prepared_urec_ptr = UndoRecordAllocate(undorecords, nrecords, txid,
+										   upersistence);
+	if (nrecords <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(nrecords * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's starting
+	 * undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((nrecords + 1) * MAX_BUFFER_PER_UNDO *
+						  sizeof(UndoBuffers));
+	max_prepared_undo = nrecords;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ *
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, TransactionId xid,
+				  UndoPersistence upersistence)
+{
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	RelFileNode rnode;
+	UndoRecordSize cur_size = 0;
+	BlockNumber cur_blk;
+	TransactionId txid;
+	int			starting_byte;
+	int			index = 0;
+	int			bufidx;
+	ReadBufferMode rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepared_undo)
+		elog(ERROR, "already reached the maximum prepared limit");
+
+
+	if (xid == InvalidTransactionId)
+	{
+		/* During recovery, we must have a valid transaction id. */
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping
+		 * for the top most transactions.
+		 */
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(prepared_urec_ptr))
+		urecptr = UndoRecordAllocate(urec, 1, txid, upersistence);
+	else
+		urecptr = prepared_urec_ptr;
+
+	/* advance the prepared ptr location for next record. */
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(prepared_urec_ptr))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(prepared_urec_ptr);
+
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		prepared_urec_ptr = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned and locked. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+		cur_blk++;
+	} while (cur_size < size);
+
+	/*
+	 * Save the undo record information to be later used by InsertPreparedUndo
+	 * to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  This step should be performed after entering a
+ * criticalsection; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page		page;
+	int			starting_byte;
+	int			already_written;
+	int			bufidx = 0;
+	int			idx;
+	uint16		undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord *uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+
+	/* There must be atleast one prepared undo record. */
+	Assert(prepare_idx > 0);
+
+	/*
+	 * This must be called under a critical section or we must be in recovery.
+	 */
+	Assert(InRecovery || CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+		/*
+		 * Store the previous undo record length in the header.  We can read
+		 * meta.prevlen without locking, because only we can write to it.
+		 */
+		uur->uur_prevlen = log->meta.prevlen;
+
+		/*
+		 * If starting a new log then there is no prevlen to store.
+		 */
+		if (offset == UndoLogBlockHeaderSize)
+			uur->uur_prevlen = 0;
+
+		/*
+		 * if starting from a new page then consider block header size in
+		 * prevlen calculation.
+		 */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+			uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer		buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.  We start writting immediately after the block header.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			starting_byte = UndoLogBlockHeaderSize;
+			undo_len += UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while (true);
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), undo_len);
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+
+	/* Update previously prepared transaction headers. */
+	if (xact_urec_info_idx > 0)
+	{
+		int			i = 0;
+
+		for (i = 0; i < xact_urec_info_idx; i++)
+			UndoRecordUpdateTransInfo(i);
+	}
+
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer, so caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if it wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord *
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer		buffer = urec->uur_buffer;
+	Page		page;
+	int			starting_byte = UndoRecPtrGetPageOffset(urp);
+	int			already_decoded = 0;
+	BlockNumber cur_blk;
+	bool		is_undo_rec_split = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a buffer pin then no need to allocate a new one. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * XXX This can be optimized to just fetch header first and only if
+		 * matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_rec_split = true;
+
+		/*
+		 * The record spans more than a page so we would have copied it (see
+		 * UnpackUndoRecord).  In such cases, we can release the buffer.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer, otherwise, just
+	 * unlock it.
+	 */
+	if (is_undo_rec_split)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * ResetUndoRecord - Helper function for UndoFetchRecord to reset the current
+ * record.
+ */
+static void
+ResetUndoRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode *rnode,
+				RelFileNode *prevrec_rnode)
+{
+	/*
+	 * If we have a valid buffer pinned then just ensure that we want to find
+	 * the next tuple from the same block.  Otherwise release the buffer and
+	 * set it invalid
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		/*
+		 * Undo buffer will be changed if the next undo record belongs to a
+		 * different block or undo log.
+		 */
+		if ((UndoRecPtrGetBlockNum(urp) !=
+			 BufferGetBlockNumber(urec->uur_buffer)) ||
+			(prevrec_rnode->relNode != rnode->relNode))
+		{
+			ReleaseBuffer(urec->uur_buffer);
+			urec->uur_buffer = InvalidBuffer;
+		}
+	}
+	else
+	{
+		/*
+		 * If there is not a valid buffer in urec->uur_buffer that means we
+		 * had copied the payload data and tuple data so free them.
+		 */
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	/* Reset the urec before fetching the tuple */
+	urec->uur_tuple.data = NULL;
+	urec->uur_tuple.len = 0;
+	urec->uur_payload.data = NULL;
+	urec->uur_payload.len = 0;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  The same tuple can be modified by multiple transactions, so during
+ * undo chain traversal sometimes we need to distinguish based on transaction
+ * id.  Callers that don't have any such requirement can pass
+ * InvalidTransactionId.
+ *
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ *
+ * callback function decides whether particular undo record satisfies the
+ * condition of caller.
+ *
+ * Returns the required undo record if found, otherwise, return NULL which
+ * means either the record is already discarded or there is no such record
+ * in the undo chain.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode rnode,
+				prevrec_rnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int			logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+	UndoRecPtrAssignRelFileNode(rnode, urp);
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecordIsValid(log, urp))
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+		prevrec_rnode = rnode;
+
+		/* Get rnode for the current undo record pointer. */
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/* Reset the current undorecord before fetching the next. */
+		ResetUndoRecord(urec, urp, &rnode, &prevrec_rnode);
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * A valid value of prevurp indicates that the previous undo record
+ * pointer is in some other log and caller can directly use that.
+ * Otherwise this will calculate the previous undo record pointer
+ * by using current urp and the prevlen.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen, UndoRecPtr prevurp)
+{
+	if (UndoRecPtrIsValid(prevurp))
+		return prevurp;
+	else
+	{
+		UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+		UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+		/* calculate the previous undo record pointer */
+		return MakeUndoRecPtr(logno, offset - prevlen);
+	}
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree(urec);
+}
+
+/*
+ * RegisterUndoLogBuffers - Register the undo buffers.
+ */
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int			idx;
+	int			flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+/*
+ * UndoLogBuffersSetLSN - Set LSN on undo page.
+*/
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int			idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Reset the global variables related to undo buffers. This is required at the
+ * transaction abort and while releasing the undo buffers.
+ */
+void
+ResetUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+	{
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	for (i = 0; i < xact_urec_info_idx; i++)
+		xact_urec_info[i].urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	xact_urec_info_idx = 0;
+	prepared_urec_ptr = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepared_undo limit is changed so free the allocated memory and
+	 * reset all the variable back to their default value.
+	 */
+	if (max_prepared_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepared_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Unlock and release the undo buffers.  This step must be performed after
+ * exiting any critical section where we have perfomed undo actions.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+
+	ResetUndoBuffers();
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..17adff7
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,460 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size		size;
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.
+ *
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin writing,
+ * while *already_written is the number of bytes written to previous pages.
+ *
+ * Returns true if the remainder of the record was written and false if more
+ * bytes remain to be written; in either case, *already_written is set to the
+ * number of bytes written thus far.
+ *
+ * This function assumes that if *already_written is non-zero on entry, the
+ * same UnpackedUndoRecord is passed each time.  It also assumes that
+ * UnpackUndoRecord is not called between successive calls to InsertUndoRecord
+ * for the same UnpackedUndoRecord.
+ *
+ * If this function is called again to continue writing the record, the
+ * previous value for *already_written should be passed again, and
+ * starting_byte should be passed as sizeof(PageHeaderData) (since the record
+ * will continue immediately following the page header).
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char	   *writeptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_written = *already_written;
+
+	/* The undo record must contain a valid information. */
+	Assert(uur->uur_info != 0);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption that
+	 * it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_reloid = uur->uur_reloid;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_txn.urec_prevurp = uur->uur_prevurp;
+		work_txn.urec_next = uur->uur_next;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before, or
+		 * caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_reloid == uur->uur_reloid);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_txn.urec_prevurp == uur->uur_prevurp);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must it must be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int			can_write;
+	int			remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing to do
+	 * except update *my_bytes_written, which we must do to ensure that the
+	 * next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool
+UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+				 int *already_decoded, bool header_only)
+{
+	char	   *readptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_decoded = *already_decoded;
+	bool		is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_reloid = work_hdr.urec_reloid;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_next = work_txn.urec_next;
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_prevurp = work_txn.urec_prevurp;
+		uur->uur_dbid = work_txn.urec_dbid;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any
+		 * of the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int			can_read;
+	int			remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..c333f00
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid);
+
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+				  UndoPersistence);
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+extern void UnlockReleaseUndoBuffers(void);
+
+extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp,
+				BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback);
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+extern void UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence);
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen, UndoRecPtr prevurp);
+extern void ResetUndoBuffers(void);
+
+#endif							/* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..0dcf1b1
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,195 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_reloid;	/* relation OID */
+
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older than oldestXidWithEpochHavingUndo, then we can
+	 * consider the tuple in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;		/* Transaction id */
+	CommandId	urec_cid;		/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_TRANSACTION is set, an UndoRecordTransaction structure
+ * follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the fork number.  If the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	ForkNumber	urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	UndoRecPtr	urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.  This
+ * also stores the dbid and the progress of the undo apply during rollback.
+ */
+typedef struct UndoRecordTransaction
+{
+	/*
+	 * This indicates undo action apply progress, 0 means not started, 1 means
+	 * completed.  In future, it can also be used to show the progress of how
+	 * much undo has been applied so far with some formula.
+	 */
+	uint32		urec_progress;
+	Oid			urec_dbid;		/* database id */
+
+	/*
+	 * Transaction's previous undo record pointer when a transaction spans
+	 * across undo logs.  The first undo record in the new log stores the
+	 * previous undo record pointer in the previous log as we can't calculate
+	 * that directly using prevlen during rollback.
+	 */
+	UndoRecPtr	urec_prevurp;
+	UndoRecPtr	urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;	/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordSetInfo or InsertUndoRecord.  We do set it in
+ * UndoRecordAllocate for transaction specific header information.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_reloid;		/* relation OID */
+	TransactionId uur_prevxid;	/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	ForkNumber	uur_fork;		/* fork number */
+	UndoRecPtr	uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	UndoRecPtr	uur_prevurp;	/* urec pointer to the previous record in
+								 * the different log */
+	UndoRecPtr	uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id */
+
+	/* undo applying progress, see detail comment in UndoRecordTransaction */
+	uint32		uur_progress;
+	StringInfoData uur_payload; /* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif							/* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 169cf28..ddaa633 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f90a6a9..d18a1cd 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -310,6 +310,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

#34

Dilip Kumar

dilipbalaut@gmail.com

about 7 years ago

In reply to: Dilip Kumar (#33)

1 attachment(s)

Re: Undo logs

On Wed, Jan 9, 2019 at 11:40 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Tue, Jan 8, 2019 at 2:11 PM Amit Kapila <amit.kapila16@gmail.com>
wrote:
3.
+ work_txn.urec_next = uur->uur_next;
+ work_txn.urec_xidepoch = uur->uur_xidepoch;
+ work_txn.urec_progress = uur->uur_progress;
+ work_txn.urec_prevurp = uur->uur_prevurp;
+ work_txn.urec_dbid = uur->uur_dbid;
It would be better if we initialize these members in the order in
which they appear in the actual structure. All other undo header
structures are initialized that way, so this looks out-of-place.

One more change in ReadUndoByte on same line.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachments:

0003-Provide-interfaces-to-store-and-fetch-undo-records_v16.patchapplication/octet-stream; name=0003-Provide-interfaces-to-store-and-fetch-undo-records_v16.patchDownload

From 3d7ca1049519a6551fbaa1fc5262ef1a56ac35fa Mon Sep 17 00:00:00 2001
From: Dilip Kumar <dilipkumar@localhost.localdomain>
Date: Sat, 5 Jan 2019 10:43:18 +0530
Subject: [PATCH] Provide interfaces to store and fetch undo records.

Add the capability to form undo records and store them in undo logs.  We
also provide the capability to fetch the undo records.  This layer will use
undo-log-storage to reserve the space for the undo records and buffer
management routines to write and read the undo records.

Undo records are stored in sequential order in the undo log.  Each undo
record consists of a variable length header, tuple data, and payload
information.  The undo records are stored without any sort of alignment
padding and a undo record can span across multiple pages.  The undo records
for a transaction can span across multiple undo logs.

Author: Dilip Kumar with contributions from Robert Haas, Amit Kapila,
	Thomas Munro and Rafia Sabih
Reviewed-by: Amit Kapila
Tested-by: Neha Sharma
Discussion: https://www.postgresql.org/message-id/CAFiTN-uVxxopn0UZ64%3DF-sydbETBbGjWapnBikNo1%3DXv78UeFw%40mail.gmail.com
---
 src/backend/access/transam/xact.c    |   37 +
 src/backend/access/undo/Makefile     |    2 +-
 src/backend/access/undo/undoinsert.c | 1241 ++++++++++++++++++++++++++++++++++
 src/backend/access/undo/undorecord.c |  460 +++++++++++++
 src/include/access/undoinsert.h      |   50 ++
 src/include/access/undorecord.h      |  195 ++++++
 src/include/access/xact.h            |    2 +
 src/include/access/xlog.h            |    1 +
 8 files changed, 1987 insertions(+), 1 deletion(-)
 create mode 100644 src/backend/access/undo/undoinsert.c
 create mode 100644 src/backend/access/undo/undorecord.c
 create mode 100644 src/include/access/undoinsert.h
 create mode 100644 src/include/access/undorecord.h

diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index f665e38..8af75a9 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -30,6 +30,7 @@
 #include "access/xlog.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
+#include "access/undoinsert.h"
 #include "catalog/namespace.h"
 #include "catalog/pg_enum.h"
 #include "catalog/storage.h"
@@ -66,6 +67,7 @@
 #include "utils/timestamp.h"
 #include "pg_trace.h"
 
+#define	AtAbort_ResetUndoBuffers() ResetUndoBuffers()
 
 /*
  *	User-tweakable parameters
@@ -189,6 +191,10 @@ typedef struct TransactionStateData
 	bool		startedInRecovery;	/* did we start in recovery? */
 	bool		didLogXid;		/* has xid been included in WAL record? */
 	int			parallelModeLevel;	/* Enter/ExitParallelMode counter */
+
+	 /* start and end undo record location for each persistence level */
+	UndoRecPtr	start_urec_ptr[UndoPersistenceLevels];
+	UndoRecPtr	latest_urec_ptr[UndoPersistenceLevels];
 	struct TransactionStateData *parent;	/* back link to parent */
 } TransactionStateData;
 
@@ -912,6 +918,35 @@ IsInParallelMode(void)
 }
 
 /*
+ * SetCurrentUndoLocation
+ * 
+ * Update the start and the latest undo record pointer for the transaction.
+ * 
+ * start_urec_ptr is set only for the first undo for the transaction i.e.
+ * start_urec_ptr is invalid.  Update the latest_urec_ptr whenever a new
+ * undo is inserted for the transaction.
+ * 
+ * start and latest undo record pointer are tracked separately for each
+ * persistent level.
+ */
+void
+SetCurrentUndoLocation(UndoRecPtr urec_ptr)
+{
+	UndoLogControl *log = UndoLogGet(UndoRecPtrGetLogNo(urec_ptr), false);
+	UndoPersistence upersistence = log->meta.persistence;
+
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+	/*
+	 * Set the start undo record pointer for first undo record in a
+	 * subtransaction.
+	 */
+	if (!UndoRecPtrIsValid(CurrentTransactionState->start_urec_ptr[upersistence]))
+		CurrentTransactionState->start_urec_ptr[upersistence] = urec_ptr;
+	CurrentTransactionState->latest_urec_ptr[upersistence] = urec_ptr;
+
+}
+
+/*
  *	CommandCounterIncrement
  */
 void
@@ -2631,6 +2666,7 @@ AbortTransaction(void)
 		AtEOXact_HashTables(false);
 		AtEOXact_PgStat(false);
 		AtEOXact_ApplyLauncher(false);
+		AtAbort_ResetUndoBuffers();
 		pgstat_report_xact_timestamp(0);
 	}
 
@@ -4815,6 +4851,7 @@ AbortSubTransaction(void)
 		AtEOSubXact_PgStat(false, s->nestingLevel);
 		AtSubAbort_Snapshot(s->nestingLevel);
 		AtEOSubXact_ApplyLauncher(false, s->nestingLevel);
+		AtAbort_ResetUndoBuffers();
 	}
 
 	/*
diff --git a/src/backend/access/undo/Makefile b/src/backend/access/undo/Makefile
index 219c696..f41e8f7 100644
--- a/src/backend/access/undo/Makefile
+++ b/src/backend/access/undo/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/access/undo
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = undolog.o
+OBJS = undoinsert.o undolog.o undorecord.o
 
 include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/access/undo/undoinsert.c b/src/backend/access/undo/undoinsert.c
new file mode 100644
index 0000000..9ffa46c
--- /dev/null
+++ b/src/backend/access/undo/undoinsert.c
@@ -0,0 +1,1241 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.c
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undoinsert.c
+ *
+ * NOTES:
+ * Undo record layout:
+ *
+ * Undo records are stored in sequential order in the undo log.  Each undo
+ * record consists of a variable length header, tuple data, and payload
+ * information.  The first undo record of each transaction contains a
+ * transaction header that points to the next transaction's start header.
+ * This allows us to discard the entire transaction's log at one-shot rather
+ * than record-by-record.  The callers are not aware of transaction header,
+ * this is entirely maintained and used by undo record layer.   See
+ * undorecord.h for detailed information about undo record header.
+ *
+ * Multiple logs:
+ *
+ * It is possible that the undo records for a transaction spans across
+ * multiple undo logs.  We need some special handling while inserting them to
+ * ensure that discard and rollbacks can work sanely.
+ *
+ * When the undorecord for a transaction gets inserted in the next log then we
+ * insert a transaction header for the first record in the new log and update
+ * the transaction header with this new logs location.  We will also keep
+ * a back pointer to the last undo record of previous log in the first record
+ * of new log, so that we can traverse the previous record during rollback.
+ * Incase, this is not the first record in new log (aka new log already
+ * contains some other transactions data), we also update that transactions
+ * next start header with this new undo records location.  This will allow us
+ * to connect transaction's undo records across logs when the same transaction
+ * span across log.
+ *
+ * There is some difference in the way the rollbacks work when the undo for
+ * same transaction spans across multiple logs depending on which log is
+ * processed first by the discard worker.  If it processes the first log which
+ * contains the transactions first record, then it can get the last record
+ * of that transaction even if it is in different log and then processes all
+ * the undo records from last to first.  OTOH, if the next log get processed
+ * first, we don't need to trace back the actual start pointer of the
+ * transaction, rather we only execute the undo actions from the current log
+ * and avoid re-executing them next time.  There is a possibility that after
+ * executing the undo actions, the undo got discarded, now in later stage while
+ * processing the previous log, it might try to fetch the undo record in the
+ * discarded log while chasing the transaction header chain which can cause
+ * trouble.  We avoid this situation by first checking if the next_urec of
+ * the transaction is already discarded and if so, we start executing from
+ * the last undo record in the current log.
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/transam.h"
+#include "access/xact.h"
+#include "access/xlog.h"
+#include "access/undorecord.h"
+#include "access/undoinsert.h"
+#include "access/undolog_xlog.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+#include "storage/buf.h"
+#include "storage/buf_internals.h"
+#include "storage/bufmgr.h"
+#include "miscadmin.h"
+#include "commands/tablecmds.h"
+
+/*
+ * XXX Do we want to support undo tuple size which is more than the BLCKSZ
+ * if not than undo record can spread across 2 buffers at the max.
+ */
+#define MAX_BUFFER_PER_UNDO    2
+
+/*
+ * This defines the number of undo records that can be prepared before
+ * calling insert by default.  If you need to prepare more than
+ * MAX_PREPARED_UNDO undo records, then you must call UndoSetPrepareSize
+ * first.
+ */
+#define MAX_PREPARED_UNDO 2
+
+/*
+ * This defines the max number of previous xact infos we need to update.
+ * Usually it's 1 for updating next link of previous transaction's header
+ * if we are starting a new transaction.  But, in some cases where the same
+ * transaction is spilled to the next log, we update our own transaction's
+ * header in previous undo log as well as the header of the previous
+ * transaction in the new log.
+ */
+#define MAX_XACT_UNDO_INFO	2
+
+/*
+ * Consider buffers needed for updating previous transaction's
+ * starting undo record as well.
+ */
+#define MAX_UNDO_BUFFERS       (MAX_PREPARED_UNDO + MAX_XACT_UNDO_INFO) * MAX_BUFFER_PER_UNDO
+
+/*
+ * Previous top transaction id which inserted the undo.  Whenever a new main
+ * transaction try to prepare an undo record we will check if its txid not the
+ * same as prev_txid then we will insert the start undo record.
+ */
+static TransactionId prev_txid[UndoPersistenceLevels] = {0};
+
+/* Undo block number to buffer mapping. */
+typedef struct UndoBuffers
+{
+	UndoLogNumber logno;		/* Undo log number */
+	BlockNumber blk;			/* block number */
+	Buffer		buf;			/* buffer allocated for the block */
+	bool		zero;			/* new block full of zeroes */
+} UndoBuffers;
+
+static UndoBuffers def_buffers[MAX_UNDO_BUFFERS];
+static int	buffer_idx;
+
+/*
+ * Structure to hold the prepared undo information.
+ */
+typedef struct PreparedUndoSpace
+{
+	UndoRecPtr	urp;			/* undo record pointer */
+	UnpackedUndoRecord *urec;	/* undo record */
+	int			undo_buffer_idx[MAX_BUFFER_PER_UNDO];	/* undo_buffer array
+														 * index */
+} PreparedUndoSpace;
+
+static PreparedUndoSpace def_prepared[MAX_PREPARED_UNDO];
+static int	prepare_idx;
+static int	max_prepared_undo = MAX_PREPARED_UNDO;
+static UndoRecPtr prepared_urec_ptr = InvalidUndoRecPtr;
+
+/*
+ * By default prepared_undo and undo_buffer points to the static memory.
+ * In case caller wants to support more than default max_prepared undo records
+ * then the limit can be increased by calling UndoSetPrepareSize function.
+ * Therein, dynamic memory will be allocated and prepared_undo and undo_buffer
+ * will start pointing to newly allocated memory, which will be released by
+ * UnlockReleaseUndoBuffers and these variables will again set back to their
+ * default values.
+ */
+static PreparedUndoSpace *prepared_undo = def_prepared;
+static UndoBuffers *undo_buffer = def_buffers;
+
+/*
+ * Structure to hold the previous transaction's undo update information.  This
+ * is populated while current transaction is updating its undo record pointer
+ * in previous transactions first undo record.
+ */
+typedef struct XactUndoRecordInfo
+{
+	UndoRecPtr	urecptr;		/* txn's start urecptr */
+	int			idx_undo_buffers[MAX_BUFFER_PER_UNDO];
+	UnpackedUndoRecord uur;		/* undo record header */
+} XactUndoRecordInfo;
+
+static XactUndoRecordInfo xact_urec_info[MAX_XACT_UNDO_INFO];
+static int	xact_urec_info_idx;
+
+/* Prototypes for static functions. */
+static UnpackedUndoRecord *UndoGetOneRecord(UnpackedUndoRecord *urec,
+				 UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence);
+static void UndoRecordPrepareTransInfo(UndoRecPtr urecptr,
+						   UndoRecPtr xact_urp);
+static void UndoRecordUpdateTransInfo(int idx);
+static int UndoGetBufferSlot(RelFileNode rnode, BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence);
+static bool UndoRecordIsValid(UndoLogControl * log,
+				  UndoRecPtr urp);
+
+/*
+ * Check whether the undo record is discarded or not.  If it's already discarded
+ * return false otherwise return true.
+ *
+ * Caller must hold lock on log->discard_lock.  This function will release the
+ * lock if return false otherwise lock will be held on return and the caller
+ * need to release it.
+ */
+static bool
+UndoRecordIsValid(UndoLogControl * log, UndoRecPtr urp)
+{
+	Assert(LWLockHeldByMeInMode(&log->discard_lock, LW_SHARED));
+
+	if (log->oldest_data == InvalidUndoRecPtr)
+	{
+		/*
+		 * oldest_data is only initialized when the DiscardWorker first time
+		 * attempts to discard undo logs so we can not rely on this value to
+		 * identify whether the undo record pointer is already discarded or
+		 * not so we can check it by calling undo log routine.  If its not yet
+		 * discarded then we have to reacquire the log->discard_lock so that
+		 * the doesn't get discarded concurrently.
+		 */
+		LWLockRelease(&log->discard_lock);
+		if (UndoLogIsDiscarded(urp))
+			return false;
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+	}
+
+	/* Check again if it's already discarded. */
+	if (urp < log->oldest_data)
+	{
+		LWLockRelease(&log->discard_lock);
+		return false;
+	}
+
+	return true;
+}
+
+/*
+ * Prepare to update the previous transaction's next undo pointer to maintain
+ * the transaction chain in the undo.  This will read the header of the first
+ * undo record of the previous transaction and lock the necessary buffers.
+ * The actual update will be done by UndoRecordUpdateTransInfo under the
+ * critical section.
+ */
+static void
+UndoRecordPrepareTransInfo(UndoRecPtr urecptr, UndoRecPtr xact_urp)
+{
+	Buffer		buffer = InvalidBuffer;
+	BlockNumber cur_blk;
+	RelFileNode rnode;
+	UndoLogControl *log;
+	Page		page;
+	int			already_decoded = 0;
+	int			starting_byte;
+	int			bufidx;
+	int			index = 0;
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 */
+	if (!UndoRecPtrIsValid(xact_urp))
+		return;
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(xact_urp), false);
+
+	/*
+	 * Temporary undo logs are discarded on transaction commit so we don't
+	 * need to do anything.
+	 */
+	if (log->meta.persistence == UNDO_TEMP)
+		return;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker doesn't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	/*
+	 * The absence of previous transaction's undo indicate that this backend
+	 * is preparing its first undo in which case we have nothing to update.
+	 * UndoRecordIsValid will release the lock if it returns false.
+	 */
+	if (!UndoRecordIsValid(log, xact_urp))
+		return;
+
+	UndoRecPtrAssignRelFileNode(rnode, xact_urp);
+	cur_blk = UndoRecPtrGetBlockNum(xact_urp);
+	starting_byte = UndoRecPtrGetPageOffset(xact_urp);
+
+	/*
+	 * Read undo record header in by calling UnpackUndoRecord, if the undo
+	 * record header is split across buffers then we need to read the complete
+	 * header by invoking UnpackUndoRecord multiple times.
+	 */
+	while (true)
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk,
+								   RBM_NORMAL,
+								   log->meta.persistence);
+		xact_urec_info[xact_urec_info_idx].idx_undo_buffers[index++] = bufidx;
+		buffer = undo_buffer[bufidx].buf;
+		page = BufferGetPage(buffer);
+
+		if (UnpackUndoRecord(&xact_urec_info[xact_urec_info_idx].uur, page,
+							 starting_byte, &already_decoded, true))
+			break;
+
+		/* Could not fetch the complete header so go to the next block. */
+		starting_byte = UndoLogBlockHeaderSize;
+		cur_blk++;
+	}
+
+	xact_urec_info[xact_urec_info_idx].uur.uur_next = urecptr;
+	xact_urec_info[xact_urec_info_idx].urecptr = xact_urp;
+	xact_urec_info_idx++;
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Overwrite the first undo record of the previous transaction to update its
+ * next pointer.  This will just insert the already prepared record by
+ * UndoRecordPrepareTransInfo.  This must be called under the critical section.
+ * This will just overwrite the undo header not the data.
+ */
+static void
+UndoRecordUpdateTransInfo(int idx)
+{
+	UndoLogNumber logno = UndoRecPtrGetLogNo(xact_urec_info[idx].urecptr);
+	Page		page;
+	int			starting_byte;
+	int			already_written = 0;
+	int			i = 0;
+	UndoRecPtr	urec_ptr = InvalidUndoRecPtr;
+	UndoLogControl *log;
+
+	log = UndoLogGet(logno, false);
+	urec_ptr = xact_urec_info[idx].urecptr;
+
+	/*
+	 * Acquire the discard lock before accessing the undo record so that
+	 * discard worker can't remove the record while we are in process of
+	 * reading it.
+	 */
+	LWLockAcquire(&log->discard_lock, LW_SHARED);
+
+	if (!UndoRecordIsValid(log, urec_ptr))
+		return;
+
+	/*
+	 * Update the next transactions start urecptr in the transaction header.
+	 */
+	starting_byte = UndoRecPtrGetPageOffset(urec_ptr);
+
+	do
+	{
+		Buffer		buffer;
+		int			buf_idx;
+
+		buf_idx = xact_urec_info[idx].idx_undo_buffers[i];
+		buffer = undo_buffer[buf_idx].buf;
+		page = BufferGetPage(buffer);
+
+		/* Overwrite the previously written undo. */
+		if (InsertUndoRecord(&xact_urec_info[idx].uur, page, starting_byte, &already_written, true))
+		{
+			MarkBufferDirty(buffer);
+			break;
+		}
+
+		MarkBufferDirty(buffer);
+		starting_byte = UndoLogBlockHeaderSize;
+		i++;
+
+		Assert(idx < MAX_BUFFER_PER_UNDO);
+	} while (true);
+
+	LWLockRelease(&log->discard_lock);
+}
+
+/*
+ * Find the block number in undo buffer array, if it's present then just return
+ * its index otherwise search the buffer and insert an entry and lock the buffer
+ * in exclusive mode.
+ *
+ * Undo log insertions are append-only.  If the caller is writing new data
+ * that begins exactly at the beginning of a page, then there cannot be any
+ * useful data after that point.  In that case RBM_ZERO can be passed in as
+ * rbm so that we can skip a useless read of a disk block.  In all other
+ * cases, RBM_NORMAL should be passed in, to read the page in if it doesn't
+ * happen to be already in the buffer pool.
+ */
+static int
+UndoGetBufferSlot(RelFileNode rnode,
+				  BlockNumber blk,
+				  ReadBufferMode rbm,
+				  UndoPersistence persistence)
+{
+	int			i;
+	Buffer		buffer;
+
+	/* Don't do anything, if we already have a buffer pinned for the block. */
+	for (i = 0; i < buffer_idx; i++)
+	{
+		/*
+		 * It's not enough to just compare the block number because the
+		 * undo_buffer might holds the undo from different undo logs (e.g when
+		 * previous transaction start header is in previous undo log) so
+		 * compare (logno + blkno).
+		 */
+		if ((blk == undo_buffer[i].blk) &&
+			(undo_buffer[i].logno == rnode.relNode))
+		{
+			/* caller must hold exclusive lock on buffer */
+			Assert(BufferIsLocal(undo_buffer[i].buf) ||
+				   LWLockHeldByMeInMode(BufferDescriptorGetContentLock(
+																	   GetBufferDescriptor(undo_buffer[i].buf - 1)),
+										LW_EXCLUSIVE));
+			break;
+		}
+	}
+
+	/*
+	 * We did not find the block so allocate the buffer and insert into the
+	 * undo buffer array
+	 */
+	if (i == buffer_idx)
+	{
+		/*
+		 * Fetch the buffer in which we want to insert the undo record.
+		 */
+		buffer = ReadBufferWithoutRelcache(rnode,
+										   UndoLogForkNum,
+										   blk,
+										   rbm,
+										   NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		/* Lock the buffer */
+		LockBuffer(buffer, BUFFER_LOCK_EXCLUSIVE);
+
+		undo_buffer[buffer_idx].buf = buffer;
+		undo_buffer[buffer_idx].blk = blk;
+		undo_buffer[buffer_idx].logno = rnode.relNode;
+		undo_buffer[buffer_idx].zero = rbm == RBM_ZERO;
+		buffer_idx++;
+	}
+
+	return i;
+}
+
+/*
+ * Calculate total size required by nrecords and allocate them in bulk. This is
+ * required for some operation which can allocate multiple undo record in one
+ * WAL operation e.g multi-insert.  If we don't allocate undo space for all the
+ * record (which are inserted under one WAL) together than there is possibility
+ * that both of them go under different undo log.  And, currently during
+ * recovery we don't have mechanism to map xid to multiple log number during one
+ * WAL operation.  So in short all the operation under one WAL must allocate
+ * their undo from the same undo log.
+ */
+static UndoRecPtr
+UndoRecordAllocate(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId txid, UndoPersistence upersistence)
+{
+	UnpackedUndoRecord *urec = NULL;
+	UndoLogControl *log;
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	UndoRecPtr	prevlogurp = InvalidUndoRecPtr;
+	UndoLogNumber prevlogno = InvalidUndoLogNumber;
+	bool		need_xact_hdr = false;
+	bool		log_switched = false;
+	int			i;
+
+	/* There must be at least one undo record. */
+	if (nrecords <= 0)
+		elog(ERROR, "cannot allocate space for zero undo records");
+
+	/* Is this the first undo record of the transaction? */
+	if ((InRecovery && IsTransactionFirstRec(txid)) ||
+		(!InRecovery && prev_txid[upersistence] != txid))
+		need_xact_hdr = true;
+
+resize:
+	size = 0;
+
+	for (i = 0; i < nrecords; i++)
+	{
+		urec = undorecords + i;
+
+		/*
+		 * Prepare the transacion header for the first undo record of
+		 * transaction.
+		 *
+		 * XXX There is also an option that instead of adding the information
+		 * to this record we can prepare a new record which only contain
+		 * transaction informations, but we can't see any clear advantage of
+		 * the same.
+		 */
+		if (need_xact_hdr && i == 0)
+		{
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_progress = 0;
+
+			if (log_switched)
+			{
+				/*
+				 * If undo log is switched then during rollback we can not go
+				 * to the previous undo record of the transaction by prevlen
+				 * so we store the previous undo record pointer in the
+				 * transaction header.
+				 */
+				Assert(UndoRecPtrIsValid(prevlogno));
+				log = UndoLogGet(prevlogno, false);
+				urec->uur_prevurp = MakeUndoRecPtr(prevlogno,
+												   log->meta.insert - log->meta.prevlen);
+			}
+			else
+				urec->uur_prevurp = InvalidUndoRecPtr;
+
+			/* During recovery, get the database id from the undo log state. */
+			if (InRecovery)
+				urec->uur_dbid = UndoLogStateGetDatabaseId();
+			else
+				urec->uur_dbid = MyDatabaseId;
+
+			/* Set uur_info to include the transaction header. */
+			urec->uur_info |= UREC_INFO_TRANSACTION;
+		}
+		else
+		{
+			/*
+			 * It is okay to initialize these variables with invalid values as
+			 * these are used only with the first record of transaction.
+			 */
+			urec->uur_next = InvalidUndoRecPtr;
+			urec->uur_dbid = 0;
+			urec->uur_progress = 0;
+			urec->uur_prevurp = InvalidUndoRecPtr;
+		}
+
+		/* Calculate the size of the undo record based on the info required. */
+		UndoRecordSetInfo(urec);
+		size += UndoRecordExpectedSize(urec);
+	}
+
+	/*
+	 * Check whether the undo log got switched while we are in a transaction.
+	 */
+	if (InRecovery)
+	{
+		/*
+		 * During recovery we can identify the log switch by checking the
+		 * prevlogurp from the MyUndoLogState.  The WAL replay action for log
+		 * switch would have set the value and we need to clear it after
+		 * retrieving the latest value.
+		 */
+		prevlogurp = UndoLogStateGetAndClearPrevLogXactUrp();
+		urecptr = UndoLogAllocateInRecovery(txid, size, upersistence);
+		if (UndoRecPtrIsValid(prevlogurp))
+		{
+			prevlogno = UndoRecPtrGetLogNo(prevlogurp);
+			log_switched = true;
+		}
+	}
+	else
+	{
+		/*
+		 * Check whether the current log is switched after allocation.  We can
+		 * determine that by simply checking to which log we are attached
+		 * before and after allocation.
+		 */
+		prevlogno = UndoLogAmAttachedTo(upersistence);
+		urecptr = UndoLogAllocate(size, upersistence);
+		if (!need_xact_hdr &&
+			prevlogno != InvalidUndoLogNumber &&
+			prevlogno != UndoRecPtrGetLogNo(urecptr))
+		{
+			log = UndoLogGet(prevlogno, false);
+			prevlogurp = MakeUndoRecPtr(prevlogno, log->meta.last_xact_start);
+			log_switched = true;
+		}
+	}
+
+	log = UndoLogGet(UndoRecPtrGetLogNo(urecptr), false);
+
+	/*
+	 * By now, we must be attached to some undo log unless we are in recovery.
+	 */
+	Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+	/*
+	 * If we've rewound all the way back to the start of the transaction by
+	 * rolling back the first subtransaction (which we can't detect until
+	 * after we've allocated some space) or the undo log got switched, we'll
+	 * need a new transaction header. If we weren't already generating one,
+	 * then do it now.
+	 */
+	if (!need_xact_hdr &&
+		(log->meta.insert == log->meta.last_xact_start || log_switched))
+	{
+		need_xact_hdr = true;
+		urec->uur_info = 0;		/* force recomputation of info bits */
+		goto resize;
+	}
+
+	/* Update the previous transaction's start undo record, if required. */
+	if (need_xact_hdr || log_switched)
+	{
+		/*
+		 * If the undo log is switched then we need to update our own
+		 * transaction header in the previous log as well as the previous
+		 * transaction's header in the new log.  Read detail comments for
+		 * multi-log handling atop this file.
+		 */
+		if (log_switched)
+			UndoRecordPrepareTransInfo(urecptr, prevlogurp);
+
+		/* Don't update our own start header. */
+		if (log->meta.last_xact_start != log->meta.insert)
+			UndoRecordPrepareTransInfo(urecptr,
+									   MakeUndoRecPtr(log->logno, log->meta.last_xact_start));
+
+		/* Remember the current transaction's xid. */
+		prev_txid[upersistence] = txid;
+
+		/* Store the current transaction's start undorecptr in the undo log. */
+		UndoLogSetLastXactStartPoint(urecptr);
+	}
+
+	UndoLogAdvance(urecptr, size, upersistence);
+
+	/*
+	 * Write WAL for log switch.  This is required to identify the log switch
+	 * during recovery.
+	 */
+	if (!InRecovery && log_switched && upersistence == UNDO_PERMANENT)
+	{
+		XLogBeginInsert();
+		XLogRegisterData((char *) &prevlogurp, sizeof(UndoRecPtr));
+		XLogInsert(RM_UNDOLOG_ID, XLOG_UNDOLOG_SWITCH);
+	}
+
+	return urecptr;
+}
+
+/*
+ * Call UndoSetPrepareSize to set the value of how many undo records can be
+ * prepared before we can insert them.  If the size is greater than
+ * MAX_PREPARED_UNDO then it will allocate extra memory to hold the extra
+ * prepared undo.
+ *
+ * This is normally used when more than one undo record needs to be prepared.
+ */
+void
+UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence)
+{
+	TransactionId txid;
+
+	/* Get the top transaction id. */
+	if (xid == InvalidTransactionId)
+	{
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	prepared_urec_ptr = UndoRecordAllocate(undorecords, nrecords, txid,
+										   upersistence);
+	if (nrecords <= MAX_PREPARED_UNDO)
+		return;
+
+	prepared_undo = palloc0(nrecords * sizeof(PreparedUndoSpace));
+
+	/*
+	 * Consider buffers needed for updating previous transaction's starting
+	 * undo record. Hence increased by 1.
+	 */
+	undo_buffer = palloc0((nrecords + 1) * MAX_BUFFER_PER_UNDO *
+						  sizeof(UndoBuffers));
+	max_prepared_undo = nrecords;
+}
+
+/*
+ * Call PrepareUndoInsert to tell the undo subsystem about the undo record you
+ * intended to insert.  Upon return, the necessary undo buffers are pinned and
+ * locked.
+ *
+ * This should be done before any critical section is established, since it
+ * can fail.
+ *
+ * In recovery, 'xid' refers to the transaction id stored in WAL, otherwise,
+ * it refers to the top transaction id because undo log only stores mapping
+ * for the top most transactions.
+ */
+UndoRecPtr
+PrepareUndoInsert(UnpackedUndoRecord *urec, TransactionId xid,
+				  UndoPersistence upersistence)
+{
+	UndoRecordSize size;
+	UndoRecPtr	urecptr;
+	RelFileNode rnode;
+	UndoRecordSize cur_size = 0;
+	BlockNumber cur_blk;
+	TransactionId txid;
+	int			starting_byte;
+	int			index = 0;
+	int			bufidx;
+	ReadBufferMode rbm;
+
+	/* Already reached maximum prepared limit. */
+	if (prepare_idx == max_prepared_undo)
+		elog(ERROR, "already reached the maximum prepared limit");
+
+
+	if (xid == InvalidTransactionId)
+	{
+		/* During recovery, we must have a valid transaction id. */
+		Assert(!InRecovery);
+		txid = GetTopTransactionId();
+	}
+	else
+	{
+		/*
+		 * Assign the top transaction id because undo log only stores mapping
+		 * for the top most transactions.
+		 */
+		Assert(InRecovery || (xid == GetTopTransactionId()));
+		txid = xid;
+	}
+
+	if (!UndoRecPtrIsValid(prepared_urec_ptr))
+		urecptr = UndoRecordAllocate(urec, 1, txid, upersistence);
+	else
+		urecptr = prepared_urec_ptr;
+
+	/* advance the prepared ptr location for next record. */
+	size = UndoRecordExpectedSize(urec);
+	if (UndoRecPtrIsValid(prepared_urec_ptr))
+	{
+		UndoLogOffset insert = UndoRecPtrGetOffset(prepared_urec_ptr);
+
+		insert = UndoLogOffsetPlusUsableBytes(insert, size);
+		prepared_urec_ptr = MakeUndoRecPtr(UndoRecPtrGetLogNo(urecptr), insert);
+	}
+
+	cur_blk = UndoRecPtrGetBlockNum(urecptr);
+	UndoRecPtrAssignRelFileNode(rnode, urecptr);
+	starting_byte = UndoRecPtrGetPageOffset(urecptr);
+
+	/*
+	 * If we happen to be writing the very first byte into this page, then
+	 * there is no need to read from disk.
+	 */
+	if (starting_byte == UndoLogBlockHeaderSize)
+		rbm = RBM_ZERO;
+	else
+		rbm = RBM_NORMAL;
+
+	do
+	{
+		bufidx = UndoGetBufferSlot(rnode, cur_blk, rbm, upersistence);
+		if (cur_size == 0)
+			cur_size = BLCKSZ - starting_byte;
+		else
+			cur_size += BLCKSZ - UndoLogBlockHeaderSize;
+
+		/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+		Assert(index < MAX_BUFFER_PER_UNDO);
+
+		/* Keep the track of the buffers we have pinned and locked. */
+		prepared_undo[prepare_idx].undo_buffer_idx[index++] = bufidx;
+
+		/*
+		 * If we need more pages they'll be all new so we can definitely skip
+		 * reading from disk.
+		 */
+		rbm = RBM_ZERO;
+		cur_blk++;
+	} while (cur_size < size);
+
+	/*
+	 * Save the undo record information to be later used by InsertPreparedUndo
+	 * to insert the prepared record.
+	 */
+	prepared_undo[prepare_idx].urec = urec;
+	prepared_undo[prepare_idx].urp = urecptr;
+	prepare_idx++;
+
+	return urecptr;
+}
+
+/*
+ * Insert a previously-prepared undo record.  This will write the actual undo
+ * record into the buffers already pinned and locked in PreparedUndoInsert,
+ * and mark them dirty.  This step should be performed after entering a
+ * criticalsection; it should never fail.
+ */
+void
+InsertPreparedUndo(void)
+{
+	Page		page;
+	int			starting_byte;
+	int			already_written;
+	int			bufidx = 0;
+	int			idx;
+	uint16		undo_len = 0;
+	UndoRecPtr	urp;
+	UnpackedUndoRecord *uur;
+	UndoLogOffset offset;
+	UndoLogControl *log;
+
+	/* There must be atleast one prepared undo record. */
+	Assert(prepare_idx > 0);
+
+	/*
+	 * This must be called under a critical section or we must be in recovery.
+	 */
+	Assert(InRecovery || CritSectionCount > 0);
+
+	for (idx = 0; idx < prepare_idx; idx++)
+	{
+		uur = prepared_undo[idx].urec;
+		urp = prepared_undo[idx].urp;
+
+		already_written = 0;
+		bufidx = 0;
+		starting_byte = UndoRecPtrGetPageOffset(urp);
+		offset = UndoRecPtrGetOffset(urp);
+
+		log = UndoLogGet(UndoRecPtrGetLogNo(urp), false);
+		Assert(AmAttachedToUndoLog(log) || InRecovery);
+
+		/*
+		 * Store the previous undo record length in the header.  We can read
+		 * meta.prevlen without locking, because only we can write to it.
+		 */
+		uur->uur_prevlen = log->meta.prevlen;
+
+		/*
+		 * If starting a new log then there is no prevlen to store.
+		 */
+		if (offset == UndoLogBlockHeaderSize)
+			uur->uur_prevlen = 0;
+
+		/*
+		 * if starting from a new page then consider block header size in
+		 * prevlen calculation.
+		 */
+		else if (starting_byte == UndoLogBlockHeaderSize)
+			uur->uur_prevlen += UndoLogBlockHeaderSize;
+
+		undo_len = 0;
+
+		do
+		{
+			PreparedUndoSpace undospace = prepared_undo[idx];
+			Buffer		buffer;
+
+			buffer = undo_buffer[undospace.undo_buffer_idx[bufidx]].buf;
+			page = BufferGetPage(buffer);
+
+			/*
+			 * Initialize the page whenever we try to write the first record
+			 * in page.  We start writting immediately after the block header.
+			 */
+			if (starting_byte == UndoLogBlockHeaderSize)
+				PageInit(page, BLCKSZ, 0);
+
+			/*
+			 * Try to insert the record into the current page. If it doesn't
+			 * succeed then recall the routine with the next page.
+			 */
+			if (InsertUndoRecord(uur, page, starting_byte, &already_written, false))
+			{
+				undo_len += already_written;
+				MarkBufferDirty(buffer);
+				break;
+			}
+
+			MarkBufferDirty(buffer);
+
+			/*
+			 * If we are swithing to the next block then consider the header
+			 * in total undo length.
+			 */
+			starting_byte = UndoLogBlockHeaderSize;
+			undo_len += UndoLogBlockHeaderSize;
+			bufidx++;
+
+			/* undo record can't use buffers more than MAX_BUFFER_PER_UNDO. */
+			Assert(bufidx < MAX_BUFFER_PER_UNDO);
+		} while (true);
+
+		UndoLogSetPrevLen(UndoRecPtrGetLogNo(urp), undo_len);
+
+		/*
+		 * Set the current undo location for a transaction.  This is required
+		 * to perform rollback during abort of transaction.
+		 */
+		SetCurrentUndoLocation(urp);
+	}
+
+	/* Update previously prepared transaction headers. */
+	if (xact_urec_info_idx > 0)
+	{
+		int			i = 0;
+
+		for (i = 0; i < xact_urec_info_idx; i++)
+			UndoRecordUpdateTransInfo(i);
+	}
+
+}
+
+/*
+ * Helper function for UndoFetchRecord.  It will fetch the undo record pointed
+ * by urp and unpack the record into urec.  This function will not release the
+ * pin on the buffer if complete record is fetched from one buffer, so caller
+ * can reuse the same urec to fetch the another undo record which is on the
+ * same block.  Caller will be responsible to release the buffer inside urec
+ * and set it to invalid if it wishes to fetch the record from another block.
+ */
+static UnpackedUndoRecord *
+UndoGetOneRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode rnode,
+				 UndoPersistence persistence)
+{
+	Buffer		buffer = urec->uur_buffer;
+	Page		page;
+	int			starting_byte = UndoRecPtrGetPageOffset(urp);
+	int			already_decoded = 0;
+	BlockNumber cur_blk;
+	bool		is_undo_rec_split = false;
+
+	cur_blk = UndoRecPtrGetBlockNum(urp);
+
+	/* If we already have a buffer pin then no need to allocate a new one. */
+	if (!BufferIsValid(buffer))
+	{
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+
+		urec->uur_buffer = buffer;
+	}
+
+	while (true)
+	{
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+		page = BufferGetPage(buffer);
+
+		/*
+		 * XXX This can be optimized to just fetch header first and only if
+		 * matches with block number and offset then fetch the complete
+		 * record.
+		 */
+		if (UnpackUndoRecord(urec, page, starting_byte, &already_decoded, false))
+			break;
+
+		starting_byte = UndoLogBlockHeaderSize;
+		is_undo_rec_split = true;
+
+		/*
+		 * The record spans more than a page so we would have copied it (see
+		 * UnpackUndoRecord).  In such cases, we can release the buffer.
+		 */
+		urec->uur_buffer = InvalidBuffer;
+		UnlockReleaseBuffer(buffer);
+
+		/* Go to next block. */
+		cur_blk++;
+		buffer = ReadBufferWithoutRelcache(rnode, UndoLogForkNum, cur_blk,
+										   RBM_NORMAL, NULL,
+										   RelPersistenceForUndoPersistence(persistence));
+	}
+
+	/*
+	 * If we have copied the data then release the buffer, otherwise, just
+	 * unlock it.
+	 */
+	if (is_undo_rec_split)
+		UnlockReleaseBuffer(buffer);
+	else
+		LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
+	return urec;
+}
+
+/*
+ * ResetUndoRecord - Helper function for UndoFetchRecord to reset the current
+ * record.
+ */
+static void
+ResetUndoRecord(UnpackedUndoRecord *urec, UndoRecPtr urp, RelFileNode *rnode,
+				RelFileNode *prevrec_rnode)
+{
+	/*
+	 * If we have a valid buffer pinned then just ensure that we want to find
+	 * the next tuple from the same block.  Otherwise release the buffer and
+	 * set it invalid
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		/*
+		 * Undo buffer will be changed if the next undo record belongs to a
+		 * different block or undo log.
+		 */
+		if ((UndoRecPtrGetBlockNum(urp) !=
+			 BufferGetBlockNumber(urec->uur_buffer)) ||
+			(prevrec_rnode->relNode != rnode->relNode))
+		{
+			ReleaseBuffer(urec->uur_buffer);
+			urec->uur_buffer = InvalidBuffer;
+		}
+	}
+	else
+	{
+		/*
+		 * If there is not a valid buffer in urec->uur_buffer that means we
+		 * had copied the payload data and tuple data so free them.
+		 */
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	/* Reset the urec before fetching the tuple */
+	urec->uur_tuple.data = NULL;
+	urec->uur_tuple.len = 0;
+	urec->uur_payload.data = NULL;
+	urec->uur_payload.len = 0;
+}
+
+/*
+ * Fetch the next undo record for given blkno, offset and transaction id (if
+ * valid).  The same tuple can be modified by multiple transactions, so during
+ * undo chain traversal sometimes we need to distinguish based on transaction
+ * id.  Callers that don't have any such requirement can pass
+ * InvalidTransactionId.
+ *
+ * Start the search from urp.  Caller need to call UndoRecordRelease to release the
+ * resources allocated by this function.
+ *
+ * urec_ptr_out is undo record pointer of the qualified undo record if valid
+ * pointer is passed.
+ *
+ * callback function decides whether particular undo record satisfies the
+ * condition of caller.
+ *
+ * Returns the required undo record if found, otherwise, return NULL which
+ * means either the record is already discarded or there is no such record
+ * in the undo chain.
+ */
+UnpackedUndoRecord *
+UndoFetchRecord(UndoRecPtr urp, BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback)
+{
+	RelFileNode rnode,
+				prevrec_rnode = {0};
+	UnpackedUndoRecord *urec = NULL;
+	int			logno;
+
+	if (urec_ptr_out)
+		*urec_ptr_out = InvalidUndoRecPtr;
+
+	urec = palloc0(sizeof(UnpackedUndoRecord));
+	UndoRecPtrAssignRelFileNode(rnode, urp);
+
+	/* Find the undo record pointer we are interested in. */
+	while (true)
+	{
+		UndoLogControl *log;
+
+		logno = UndoRecPtrGetLogNo(urp);
+		log = UndoLogGet(logno, false);
+		if (log == NULL)
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/*
+		 * Prevent UndoDiscardOneLog() from discarding data while we try to
+		 * read it.  Usually we would acquire log->mutex to read log->meta
+		 * members, but in this case we know that discard can't move without
+		 * also holding log->discard_lock.
+		 */
+		LWLockAcquire(&log->discard_lock, LW_SHARED);
+		if (!UndoRecordIsValid(log, urp))
+		{
+			if (BufferIsValid(urec->uur_buffer))
+				ReleaseBuffer(urec->uur_buffer);
+			return NULL;
+		}
+
+		/* Fetch the current undo record. */
+		urec = UndoGetOneRecord(urec, urp, rnode, log->meta.persistence);
+		LWLockRelease(&log->discard_lock);
+
+		if (blkno == InvalidBlockNumber)
+			break;
+
+		/* Check whether the undorecord satisfies conditions */
+		if (callback(urec, blkno, offset, xid))
+			break;
+
+		urp = urec->uur_blkprev;
+		prevrec_rnode = rnode;
+
+		/* Get rnode for the current undo record pointer. */
+		UndoRecPtrAssignRelFileNode(rnode, urp);
+
+		/* Reset the current undorecord before fetching the next. */
+		ResetUndoRecord(urec, urp, &rnode, &prevrec_rnode);
+	}
+
+	if (urec_ptr_out)
+		*urec_ptr_out = urp;
+	return urec;
+}
+
+/*
+ * Return the previous undo record pointer.
+ *
+ * A valid value of prevurp indicates that the previous undo record
+ * pointer is in some other log and caller can directly use that.
+ * Otherwise this will calculate the previous undo record pointer
+ * by using current urp and the prevlen.
+ */
+UndoRecPtr
+UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen, UndoRecPtr prevurp)
+{
+	if (UndoRecPtrIsValid(prevurp))
+		return prevurp;
+	else
+	{
+		UndoLogNumber logno = UndoRecPtrGetLogNo(urp);
+		UndoLogOffset offset = UndoRecPtrGetOffset(urp);
+
+		/* calculate the previous undo record pointer */
+		return MakeUndoRecPtr(logno, offset - prevlen);
+	}
+}
+
+/*
+ * Release the resources allocated by UndoFetchRecord.
+ */
+void
+UndoRecordRelease(UnpackedUndoRecord *urec)
+{
+	/*
+	 * If the undo record has a valid buffer then just release the buffer
+	 * otherwise free the tuple and payload data.
+	 */
+	if (BufferIsValid(urec->uur_buffer))
+	{
+		ReleaseBuffer(urec->uur_buffer);
+	}
+	else
+	{
+		if (urec->uur_payload.data)
+			pfree(urec->uur_payload.data);
+		if (urec->uur_tuple.data)
+			pfree(urec->uur_tuple.data);
+	}
+
+	pfree(urec);
+}
+
+/*
+ * RegisterUndoLogBuffers - Register the undo buffers.
+ */
+void
+RegisterUndoLogBuffers(uint8 first_block_id)
+{
+	int			idx;
+	int			flags;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+	{
+		flags = undo_buffer[idx].zero ? REGBUF_WILL_INIT : 0;
+		XLogRegisterBuffer(first_block_id + idx, undo_buffer[idx].buf, flags);
+	}
+}
+
+/*
+ * UndoLogBuffersSetLSN - Set LSN on undo page.
+*/
+void
+UndoLogBuffersSetLSN(XLogRecPtr recptr)
+{
+	int			idx;
+
+	for (idx = 0; idx < buffer_idx; idx++)
+		PageSetLSN(BufferGetPage(undo_buffer[idx].buf), recptr);
+}
+
+/*
+ * Reset the global variables related to undo buffers. This is required at the
+ * transaction abort and while releasing the undo buffers.
+ */
+void
+ResetUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+	{
+		undo_buffer[i].blk = InvalidBlockNumber;
+		undo_buffer[i].buf = InvalidBuffer;
+	}
+
+	for (i = 0; i < xact_urec_info_idx; i++)
+		xact_urec_info[i].urecptr = InvalidUndoRecPtr;
+
+	/* Reset the prepared index. */
+	prepare_idx = 0;
+	buffer_idx = 0;
+	xact_urec_info_idx = 0;
+	prepared_urec_ptr = InvalidUndoRecPtr;
+
+	/*
+	 * max_prepared_undo limit is changed so free the allocated memory and
+	 * reset all the variable back to their default value.
+	 */
+	if (max_prepared_undo > MAX_PREPARED_UNDO)
+	{
+		pfree(undo_buffer);
+		pfree(prepared_undo);
+		undo_buffer = def_buffers;
+		prepared_undo = def_prepared;
+		max_prepared_undo = MAX_PREPARED_UNDO;
+	}
+}
+
+/*
+ * Unlock and release the undo buffers.  This step must be performed after
+ * exiting any critical section where we have perfomed undo actions.
+ */
+void
+UnlockReleaseUndoBuffers(void)
+{
+	int			i;
+
+	for (i = 0; i < buffer_idx; i++)
+		UnlockReleaseBuffer(undo_buffer[i].buf);
+
+	ResetUndoBuffers();
+}
diff --git a/src/backend/access/undo/undorecord.c b/src/backend/access/undo/undorecord.c
new file mode 100644
index 0000000..23ed374
--- /dev/null
+++ b/src/backend/access/undo/undorecord.c
@@ -0,0 +1,460 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.c
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/backend/access/undo/undorecord.c
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/subtrans.h"
+#include "access/undorecord.h"
+#include "catalog/pg_tablespace.h"
+#include "storage/block.h"
+
+/* Workspace for InsertUndoRecord and UnpackUndoRecord. */
+static UndoRecordHeader work_hdr;
+static UndoRecordRelationDetails work_rd;
+static UndoRecordBlock work_blk;
+static UndoRecordTransaction work_txn;
+static UndoRecordPayload work_payload;
+
+/* Prototypes for static functions. */
+static bool InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written);
+static bool ReadUndoBytes(char *destptr, int readlen,
+			  char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy);
+
+/*
+ * Compute and return the expected size of an undo record.
+ */
+Size
+UndoRecordExpectedSize(UnpackedUndoRecord *uur)
+{
+	Size		size;
+
+	size = SizeOfUndoRecordHeader;
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+		size += SizeOfUndoRecordRelationDetails;
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+		size += SizeOfUndoRecordBlock;
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+		size += SizeOfUndoRecordTransaction;
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		size += SizeOfUndoRecordPayload;
+		size += uur->uur_payload.len;
+		size += uur->uur_tuple.len;
+	}
+
+	return size;
+}
+
+/*
+ * To insert an undo record, call InsertUndoRecord() repeatedly until it
+ * returns true.
+ *
+ * Insert as much of an undo record as will fit in the given page.
+ * starting_byte is the byte within the give page at which to begin writing,
+ * while *already_written is the number of bytes written to previous pages.
+ *
+ * Returns true if the remainder of the record was written and false if more
+ * bytes remain to be written; in either case, *already_written is set to the
+ * number of bytes written thus far.
+ *
+ * This function assumes that if *already_written is non-zero on entry, the
+ * same UnpackedUndoRecord is passed each time.  It also assumes that
+ * UnpackUndoRecord is not called between successive calls to InsertUndoRecord
+ * for the same UnpackedUndoRecord.
+ *
+ * If this function is called again to continue writing the record, the
+ * previous value for *already_written should be passed again, and
+ * starting_byte should be passed as sizeof(PageHeaderData) (since the record
+ * will continue immediately following the page header).
+ */
+bool
+InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only)
+{
+	char	   *writeptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_written = *already_written;
+
+	/* The undo record must contain a valid information. */
+	Assert(uur->uur_info != 0);
+
+	/*
+	 * If this is the first call, copy the UnpackedUndoRecord into the
+	 * temporary variables of the types that will actually be stored in the
+	 * undo pages.  We just initialize everything here, on the assumption that
+	 * it's not worth adding branches to save a handful of assignments.
+	 */
+	if (*already_written == 0)
+	{
+		work_hdr.urec_type = uur->uur_type;
+		work_hdr.urec_info = uur->uur_info;
+		work_hdr.urec_prevlen = uur->uur_prevlen;
+		work_hdr.urec_reloid = uur->uur_reloid;
+		work_hdr.urec_prevxid = uur->uur_prevxid;
+		work_hdr.urec_xid = uur->uur_xid;
+		work_hdr.urec_cid = uur->uur_cid;
+		work_rd.urec_fork = uur->uur_fork;
+		work_blk.urec_blkprev = uur->uur_blkprev;
+		work_blk.urec_block = uur->uur_block;
+		work_blk.urec_offset = uur->uur_offset;
+		work_txn.urec_progress = uur->uur_progress;
+		work_txn.urec_dbid = uur->uur_dbid;
+		work_txn.urec_prevurp = uur->uur_prevurp;
+		work_txn.urec_next = uur->uur_next;
+		work_payload.urec_payload_len = uur->uur_payload.len;
+		work_payload.urec_tuple_len = uur->uur_tuple.len;
+	}
+	else
+	{
+		/*
+		 * We should have been passed the same record descriptor as before, or
+		 * caller has messed up.
+		 */
+		Assert(work_hdr.urec_type == uur->uur_type);
+		Assert(work_hdr.urec_info == uur->uur_info);
+		Assert(work_hdr.urec_prevlen == uur->uur_prevlen);
+		Assert(work_hdr.urec_reloid == uur->uur_reloid);
+		Assert(work_hdr.urec_prevxid == uur->uur_prevxid);
+		Assert(work_hdr.urec_xid == uur->uur_xid);
+		Assert(work_hdr.urec_cid == uur->uur_cid);
+		Assert(work_rd.urec_fork == uur->uur_fork);
+		Assert(work_blk.urec_blkprev == uur->uur_blkprev);
+		Assert(work_blk.urec_block == uur->uur_block);
+		Assert(work_blk.urec_offset == uur->uur_offset);
+		Assert(work_txn.urec_progress == uur->uur_progress);
+		Assert(work_txn.urec_dbid == uur->uur_dbid);
+		Assert(work_txn.urec_prevurp == uur->uur_prevurp);
+		Assert(work_txn.urec_next == uur->uur_next);
+		Assert(work_payload.urec_payload_len == uur->uur_payload.len);
+		Assert(work_payload.urec_tuple_len == uur->uur_tuple.len);
+	}
+
+	/* Write header (if not already done). */
+	if (!InsertUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write relation details (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0 &&
+		!InsertUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write block information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0 &&
+		!InsertUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	/* Write transaction information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0 &&
+		!InsertUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						 &writeptr, endptr,
+						 &my_bytes_written, already_written))
+		return false;
+
+	if (header_only)
+		return true;
+
+	/* Write payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		/* Payload header. */
+		if (!InsertUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Payload bytes. */
+		if (uur->uur_payload.len > 0 &&
+			!InsertUndoBytes(uur->uur_payload.data, uur->uur_payload.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+
+		/* Tuple bytes. */
+		if (uur->uur_tuple.len > 0 &&
+			!InsertUndoBytes(uur->uur_tuple.data, uur->uur_tuple.len,
+							 &writeptr, endptr,
+							 &my_bytes_written, already_written))
+			return false;
+	}
+
+	/* Hooray! */
+	return true;
+}
+
+/*
+ * Write undo bytes from a particular source, but only to the extent that
+ * they weren't written previously and will fit.
+ *
+ * 'sourceptr' points to the source data, and 'sourcelen' is the length of
+ * that data in bytes.
+ *
+ * 'writeptr' points to the insertion point for these bytes, and is updated
+ * for whatever we write.  The insertion point must not pass 'endptr', which
+ * represents the end of the buffer into which we are writing.
+ *
+ * 'my_bytes_written' is a pointer to the count of previous-written bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.
+ *
+ * 'total_bytes_written' points to the count of all previously-written bytes,
+ * and must it must be updated for the bytes we write.
+ *
+ * The return value is false if we ran out of space before writing all
+ * the bytes, and otherwise true.
+ */
+static bool
+InsertUndoBytes(char *sourceptr, int sourcelen,
+				char **writeptr, char *endptr,
+				int *my_bytes_written, int *total_bytes_written)
+{
+	int			can_write;
+	int			remaining;
+
+	/*
+	 * If we've previously written all of these bytes, there's nothing to do
+	 * except update *my_bytes_written, which we must do to ensure that the
+	 * next call to this function gets the right starting value.
+	 */
+	if (*my_bytes_written >= sourcelen)
+	{
+		*my_bytes_written -= sourcelen;
+		return true;
+	}
+
+	/* Compute number of bytes we can write. */
+	remaining = sourcelen - *my_bytes_written;
+	can_write = Min(remaining, endptr - *writeptr);
+
+	/* Bail out if no bytes can be written. */
+	if (can_write == 0)
+		return false;
+
+	/* Copy the bytes we can write. */
+	memcpy(*writeptr, sourceptr + *my_bytes_written, can_write);
+
+	/* Update bookkeeeping infrormation. */
+	*writeptr += can_write;
+	*total_bytes_written += can_write;
+	*my_bytes_written = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_write == remaining);
+}
+
+/*
+ * Call UnpackUndoRecord() one or more times to unpack an undo record.  For
+ * the first call, starting_byte should be set to the beginning of the undo
+ * record within the specified page, and *already_decoded should be set to 0;
+ * the function will update it based on the number of bytes decoded.  The
+ * return value is true if the entire record was unpacked and false if the
+ * record continues on the next page.  In the latter case, the function
+ * should be called again with the next page, passing starting_byte as the
+ * sizeof(PageHeaderData).
+ */
+bool
+UnpackUndoRecord(UnpackedUndoRecord *uur, Page page, int starting_byte,
+				 int *already_decoded, bool header_only)
+{
+	char	   *readptr = (char *) page + starting_byte;
+	char	   *endptr = (char *) page + BLCKSZ;
+	int			my_bytes_decoded = *already_decoded;
+	bool		is_undo_splited = (my_bytes_decoded > 0) ? true : false;
+
+	/* Decode header (if not already done). */
+	if (!ReadUndoBytes((char *) &work_hdr, SizeOfUndoRecordHeader,
+					   &readptr, endptr,
+					   &my_bytes_decoded, already_decoded, false))
+		return false;
+
+	uur->uur_type = work_hdr.urec_type;
+	uur->uur_info = work_hdr.urec_info;
+	uur->uur_prevlen = work_hdr.urec_prevlen;
+	uur->uur_reloid = work_hdr.urec_reloid;
+	uur->uur_prevxid = work_hdr.urec_prevxid;
+	uur->uur_xid = work_hdr.urec_xid;
+	uur->uur_cid = work_hdr.urec_cid;
+
+	if ((uur->uur_info & UREC_INFO_RELATION_DETAILS) != 0)
+	{
+		/* Decode header (if not already done). */
+		if (!ReadUndoBytes((char *) &work_rd, SizeOfUndoRecordRelationDetails,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_fork = work_rd.urec_fork;
+	}
+
+	if ((uur->uur_info & UREC_INFO_BLOCK) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_blk, SizeOfUndoRecordBlock,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_blkprev = work_blk.urec_blkprev;
+		uur->uur_block = work_blk.urec_block;
+		uur->uur_offset = work_blk.urec_offset;
+	}
+
+	if ((uur->uur_info & UREC_INFO_TRANSACTION) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_txn, SizeOfUndoRecordTransaction,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_progress = work_txn.urec_progress;
+		uur->uur_dbid = work_txn.urec_dbid;
+		uur->uur_prevurp = work_txn.urec_prevurp;
+		uur->uur_next = work_txn.urec_next;
+	}
+
+	if (header_only)
+		return true;
+
+	/* Read payload information (if needed and not already done). */
+	if ((uur->uur_info & UREC_INFO_PAYLOAD) != 0)
+	{
+		if (!ReadUndoBytes((char *) &work_payload, SizeOfUndoRecordPayload,
+						   &readptr, endptr,
+						   &my_bytes_decoded, already_decoded, false))
+			return false;
+
+		uur->uur_payload.len = work_payload.urec_payload_len;
+		uur->uur_tuple.len = work_payload.urec_tuple_len;
+
+		/*
+		 * If we can read the complete record from a single page then just
+		 * point payload data and tuple data into the page otherwise allocate
+		 * the memory.
+		 *
+		 * XXX There is possibility of optimization that instead of always
+		 * allocating the memory whenever tuple is split we can check if any
+		 * of the payload or tuple data falling into the same page then don't
+		 * allocate the memory for that.
+		 */
+		if (!is_undo_splited &&
+			uur->uur_payload.len + uur->uur_tuple.len <= (endptr - readptr))
+		{
+			uur->uur_payload.data = readptr;
+			readptr += uur->uur_payload.len;
+
+			uur->uur_tuple.data = readptr;
+		}
+		else
+		{
+			if (uur->uur_payload.len > 0 && uur->uur_payload.data == NULL)
+				uur->uur_payload.data = (char *) palloc0(uur->uur_payload.len);
+
+			if (uur->uur_tuple.len > 0 && uur->uur_tuple.data == NULL)
+				uur->uur_tuple.data = (char *) palloc0(uur->uur_tuple.len);
+
+			if (!ReadUndoBytes((char *) uur->uur_payload.data,
+							   uur->uur_payload.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+
+			if (!ReadUndoBytes((char *) uur->uur_tuple.data,
+							   uur->uur_tuple.len, &readptr, endptr,
+							   &my_bytes_decoded, already_decoded, false))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/*
+ * Read undo bytes into a particular destination,
+ *
+ * 'destptr' points to the source data, and 'readlen' is the length of
+ * that data to be read in bytes.
+ *
+ * 'readptr' points to the read point for these bytes, and is updated
+ * for how much we read.  The read point must not pass 'endptr', which
+ * represents the end of the buffer from which we are reading.
+ *
+ * 'my_bytes_read' is a pointer to the count of previous-read bytes
+ * from this and following structures in this undo record; that is, any
+ * bytes that are part of previous structures in the record have already
+ * been subtracted out.
+ *
+ * 'total_bytes_read' points to the count of all previously-read bytes,
+ * and must likewise be updated for the bytes we read.
+ *
+ * nocopy if this flag is set true then it will just skip the readlen
+ * size in undo but it will not copy into the buffer.
+ *
+ * The return value is false if we ran out of space before read all
+ * the bytes, and otherwise true.
+ */
+static bool
+ReadUndoBytes(char *destptr, int readlen, char **readptr, char *endptr,
+			  int *my_bytes_read, int *total_bytes_read, bool nocopy)
+{
+	int			can_read;
+	int			remaining;
+
+	if (*my_bytes_read >= readlen)
+	{
+		*my_bytes_read -= readlen;
+		return true;
+	}
+
+	/* Compute number of bytes we can read. */
+	remaining = readlen - *my_bytes_read;
+	can_read = Min(remaining, endptr - *readptr);
+
+	/* Bail out if no bytes can be read. */
+	if (can_read == 0)
+		return false;
+
+	/* Copy the bytes we can read. */
+	if (!nocopy)
+		memcpy(destptr + *my_bytes_read, *readptr, can_read);
+
+	/* Update bookkeeping information. */
+	*readptr += can_read;
+	*total_bytes_read += can_read;
+	*my_bytes_read = 0;
+
+	/* Return true only if we wrote the whole thing. */
+	return (can_read == remaining);
+}
+
+/*
+ * Set uur_info for an UnpackedUndoRecord appropriately based on which
+ * other fields are set.
+ */
+void
+UndoRecordSetInfo(UnpackedUndoRecord *uur)
+{
+	if (uur->uur_fork != MAIN_FORKNUM)
+		uur->uur_info |= UREC_INFO_RELATION_DETAILS;
+	if (uur->uur_block != InvalidBlockNumber)
+		uur->uur_info |= UREC_INFO_BLOCK;
+	if (uur->uur_next != InvalidUndoRecPtr)
+		uur->uur_info |= UREC_INFO_TRANSACTION;
+	if (uur->uur_payload.len || uur->uur_tuple.len)
+		uur->uur_info |= UREC_INFO_PAYLOAD;
+}
diff --git a/src/include/access/undoinsert.h b/src/include/access/undoinsert.h
new file mode 100644
index 0000000..c333f00
--- /dev/null
+++ b/src/include/access/undoinsert.h
@@ -0,0 +1,50 @@
+/*-------------------------------------------------------------------------
+ *
+ * undoinsert.h
+ *	  entry points for inserting undo records
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undoinsert.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDOINSERT_H
+#define UNDOINSERT_H
+
+#include "access/undolog.h"
+#include "access/undorecord.h"
+#include "access/xlogdefs.h"
+#include "catalog/pg_class.h"
+
+/*
+ * Typedef for callback function for UndoFetchRecord.
+ *
+ * This checks whether an undorecord satisfies the given conditions.
+ */
+typedef bool (*SatisfyUndoRecordCallback) (UnpackedUndoRecord *urec,
+										   BlockNumber blkno,
+										   OffsetNumber offset,
+										   TransactionId xid);
+
+extern UndoRecPtr PrepareUndoInsert(UnpackedUndoRecord *, TransactionId xid,
+				  UndoPersistence);
+extern void InsertPreparedUndo(void);
+
+extern void RegisterUndoLogBuffers(uint8 first_block_id);
+extern void UndoLogBuffersSetLSN(XLogRecPtr recptr);
+extern void UnlockReleaseUndoBuffers(void);
+
+extern UnpackedUndoRecord *UndoFetchRecord(UndoRecPtr urp,
+				BlockNumber blkno, OffsetNumber offset,
+				TransactionId xid, UndoRecPtr *urec_ptr_out,
+				SatisfyUndoRecordCallback callback);
+extern void UndoRecordRelease(UnpackedUndoRecord *urec);
+extern void UndoRecordSetPrevUndoLen(uint16 len);
+extern void UndoSetPrepareSize(UnpackedUndoRecord *undorecords, int nrecords,
+				   TransactionId xid, UndoPersistence upersistence);
+extern UndoRecPtr UndoGetPrevUndoRecptr(UndoRecPtr urp, uint16 prevlen, UndoRecPtr prevurp);
+extern void ResetUndoBuffers(void);
+
+#endif							/* UNDOINSERT_H */
diff --git a/src/include/access/undorecord.h b/src/include/access/undorecord.h
new file mode 100644
index 0000000..0dcf1b1
--- /dev/null
+++ b/src/include/access/undorecord.h
@@ -0,0 +1,195 @@
+/*-------------------------------------------------------------------------
+ *
+ * undorecord.h
+ *	  encode and decode undo records
+ *
+ * Portions Copyright (c) 1996-2019, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * src/include/access/undorecord.h
+ *
+ *-------------------------------------------------------------------------
+ */
+#ifndef UNDORECORD_H
+#define UNDORECORD_H
+
+#include "access/undolog.h"
+#include "lib/stringinfo.h"
+#include "storage/block.h"
+#include "storage/bufpage.h"
+#include "storage/buf.h"
+#include "storage/off.h"
+
+
+/*
+ * Every undo record begins with an UndoRecordHeader structure, which is
+ * followed by the additional structures indicated by the contents of
+ * urec_info.  All structures are packed into the alignment without padding
+ * bytes, and the undo record itself need not be aligned either, so care
+ * must be taken when reading the header.
+ */
+typedef struct UndoRecordHeader
+{
+	uint8		urec_type;		/* record type code */
+	uint8		urec_info;		/* flag bits */
+	uint16		urec_prevlen;	/* length of previous record in bytes */
+	Oid			urec_reloid;	/* relation OID */
+
+	/*
+	 * Transaction id that has modified the tuple present in this undo record.
+	 * If this is older than oldestXidWithEpochHavingUndo, then we can
+	 * consider the tuple in this undo record as visible.
+	 */
+	TransactionId urec_prevxid;
+
+	/*
+	 * Transaction id that has modified the tuple for which this undo record
+	 * is written.  We use this to skip the undo records.  See comments atop
+	 * function UndoFetchRecord.
+	 */
+	TransactionId urec_xid;		/* Transaction id */
+	CommandId	urec_cid;		/* command id */
+} UndoRecordHeader;
+
+#define SizeOfUndoRecordHeader	\
+	(offsetof(UndoRecordHeader, urec_cid) + sizeof(CommandId))
+
+/*
+ * If UREC_INFO_RELATION_DETAILS is set, an UndoRecordRelationDetails structure
+ * follows.
+ *
+ * If UREC_INFO_BLOCK is set, an UndoRecordBlock structure follows.
+ *
+ * If UREC_INFO_TRANSACTION is set, an UndoRecordTransaction structure
+ * follows.
+ *
+ * If UREC_INFO_PAYLOAD is set, an UndoRecordPayload structure follows.
+ *
+ * When (as will often be the case) multiple structures are present, they
+ * appear in the same order in which the constants are defined here.  That is,
+ * UndoRecordRelationDetails appears first.
+ */
+#define UREC_INFO_RELATION_DETAILS			0x01
+#define UREC_INFO_BLOCK						0x02
+#define UREC_INFO_PAYLOAD					0x04
+#define UREC_INFO_TRANSACTION				0x08
+
+/*
+ * Additional information about a relation to which this record pertains,
+ * namely the fork number.  If the fork number is MAIN_FORKNUM, this structure
+ * can (and should) be omitted.
+ */
+typedef struct UndoRecordRelationDetails
+{
+	ForkNumber	urec_fork;		/* fork number */
+} UndoRecordRelationDetails;
+
+#define SizeOfUndoRecordRelationDetails \
+	(offsetof(UndoRecordRelationDetails, urec_fork) + sizeof(uint8))
+
+/*
+ * Identifying information for a block to which this record pertains, and
+ * a pointer to the previous record for the same block.
+ */
+typedef struct UndoRecordBlock
+{
+	UndoRecPtr	urec_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber urec_block;		/* block number */
+	OffsetNumber urec_offset;	/* offset number */
+} UndoRecordBlock;
+
+#define SizeOfUndoRecordBlock \
+	(offsetof(UndoRecordBlock, urec_offset) + sizeof(OffsetNumber))
+
+/*
+ * Identifying information for a transaction to which this undo belongs.  This
+ * also stores the dbid and the progress of the undo apply during rollback.
+ */
+typedef struct UndoRecordTransaction
+{
+	/*
+	 * This indicates undo action apply progress, 0 means not started, 1 means
+	 * completed.  In future, it can also be used to show the progress of how
+	 * much undo has been applied so far with some formula.
+	 */
+	uint32		urec_progress;
+	Oid			urec_dbid;		/* database id */
+
+	/*
+	 * Transaction's previous undo record pointer when a transaction spans
+	 * across undo logs.  The first undo record in the new log stores the
+	 * previous undo record pointer in the previous log as we can't calculate
+	 * that directly using prevlen during rollback.
+	 */
+	UndoRecPtr	urec_prevurp;
+	UndoRecPtr	urec_next;		/* urec pointer of the next transaction */
+} UndoRecordTransaction;
+
+#define SizeOfUrecNext (sizeof(UndoRecPtr))
+#define SizeOfUndoRecordTransaction \
+	(offsetof(UndoRecordTransaction, urec_next) + SizeOfUrecNext)
+
+/*
+ * Information about the amount of payload data and tuple data present
+ * in this record.  The payload bytes immediately follow the structures
+ * specified by flag bits in urec_info, and the tuple bytes follow the
+ * payload bytes.
+ */
+typedef struct UndoRecordPayload
+{
+	uint16		urec_payload_len;	/* # of payload bytes */
+	uint16		urec_tuple_len; /* # of tuple bytes */
+} UndoRecordPayload;
+
+#define SizeOfUndoRecordPayload \
+	(offsetof(UndoRecordPayload, urec_tuple_len) + sizeof(uint16))
+
+/*
+ * Information that can be used to create an undo record or that can be
+ * extracted from one previously created.  The raw undo record format is
+ * difficult to manage, so this structure provides a convenient intermediate
+ * form that is easier for callers to manage.
+ *
+ * When creating an undo record from an UnpackedUndoRecord, caller should
+ * set uur_info to 0.  It will be initialized by the first call to
+ * UndoRecordSetInfo or InsertUndoRecord.  We do set it in
+ * UndoRecordAllocate for transaction specific header information.
+ *
+ * When an undo record is decoded into an UnpackedUndoRecord, all fields
+ * will be initialized, but those for which no information is available
+ * will be set to invalid or default values, as appropriate.
+ */
+typedef struct UnpackedUndoRecord
+{
+	uint8		uur_type;		/* record type code */
+	uint8		uur_info;		/* flag bits */
+	uint16		uur_prevlen;	/* length of previous record */
+	Oid			uur_reloid;		/* relation OID */
+	TransactionId uur_prevxid;	/* transaction id */
+	TransactionId uur_xid;		/* transaction id */
+	CommandId	uur_cid;		/* command id */
+	ForkNumber	uur_fork;		/* fork number */
+	UndoRecPtr	uur_blkprev;	/* byte offset of previous undo for block */
+	BlockNumber uur_block;		/* block number */
+	OffsetNumber uur_offset;	/* offset number */
+	Buffer		uur_buffer;		/* buffer in which undo record data points */
+	UndoRecPtr	uur_prevurp;	/* urec pointer to the previous record in
+								 * the different log */
+	UndoRecPtr	uur_next;		/* urec pointer of the next transaction */
+	Oid			uur_dbid;		/* database id */
+
+	/* undo applying progress, see detail comment in UndoRecordTransaction */
+	uint32		uur_progress;
+	StringInfoData uur_payload; /* payload bytes */
+	StringInfoData uur_tuple;	/* tuple bytes */
+} UnpackedUndoRecord;
+
+
+extern void UndoRecordSetInfo(UnpackedUndoRecord *uur);
+extern Size UndoRecordExpectedSize(UnpackedUndoRecord *uur);
+extern bool InsertUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_written, bool header_only);
+extern bool UnpackUndoRecord(UnpackedUndoRecord *uur, Page page,
+				 int starting_byte, int *already_decoded, bool header_only);
+
+#endif							/* UNDORECORD_H */
diff --git a/src/include/access/xact.h b/src/include/access/xact.h
index 169cf28..ddaa633 100644
--- a/src/include/access/xact.h
+++ b/src/include/access/xact.h
@@ -14,6 +14,7 @@
 #ifndef XACT_H
 #define XACT_H
 
+#include "access/undolog.h"
 #include "access/xlogreader.h"
 #include "lib/stringinfo.h"
 #include "nodes/pg_list.h"
@@ -430,5 +431,6 @@ extern void ParseAbortRecord(uint8 info, xl_xact_abort *xlrec, xl_xact_parsed_ab
 extern void EnterParallelMode(void);
 extern void ExitParallelMode(void);
 extern bool IsInParallelMode(void);
+extern void SetCurrentUndoLocation(UndoRecPtr urec_ptr);
 
 #endif							/* XACT_H */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index f90a6a9..d18a1cd 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -310,6 +310,7 @@ extern XLogRecPtr GetInsertRecPtr(void);
 extern XLogRecPtr GetFlushRecPtr(void);
 extern XLogRecPtr GetLastImportantRecPtr(void);
 extern void GetNextXidAndEpoch(TransactionId *xid, uint32 *epoch);
+extern uint32 GetEpochForXid(TransactionId xid);
 extern void RemovePromoteSignalFiles(void);
 
 extern bool CheckPromoteSignal(void);
-- 
1.8.3.1

#35

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Dilip Kumar (#34)

Re: Undo logs

Hi,

This thread is curently marked as returned with feedback, set so
2018-12-01. Given there've been several new versions submitted since, is
that accurate?

- Andres

#36

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Andres Freund (#35)

Re: Undo logs

On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote:

This thread is curently marked as returned with feedback, set so
2018-12-01. Given there've been several new versions submitted since, is
that accurate?

From the latest status of this thread, there have been new patches but
no reviews on them, so moved to next CF.
--
Michael

#37

Thomas Munro

thomas.munro@enterprisedb.com

almost 7 years ago

In reply to: Michael Paquier (#36)

Re: Undo logs

On Mon, Feb 4, 2019 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote:

This thread is curently marked as returned with feedback, set so
2018-12-01. Given there've been several new versions submitted since, is
that accurate?

From the latest status of this thread, there have been new patches but
no reviews on them, so moved to next CF.

Thank you. New patches coming soon.

--
Thomas Munro
http://www.enterprisedb.com

#38

Dilip Kumar

dilipbalaut@gmail.com

almost 7 years ago

In reply to: Andres Freund (#35)

Re: Undo logs

On Sun, Feb 3, 2019 at 3:53 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

This thread is curently marked as returned with feedback, set so
2018-12-01. Given there've been several new versions submitted since, is
that accurate?

As part of this thread we have been reviewing and fixing the comment
for undo-interface patch. Now, Michael have already moved to new
commitfest with status need review so I guess as of now the status is
correct.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#39

Alvaro Herrera

alvherre@2ndquadrant.com

almost 7 years ago

In reply to: Thomas Munro (#37)

Re: Undo logs

On 2019-Feb-04, Thomas Munro wrote:

On Mon, Feb 4, 2019 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote:

This thread is curently marked as returned with feedback, set so
2018-12-01. Given there've been several new versions submitted since, is
that accurate?

From the latest status of this thread, there have been new patches but
no reviews on them, so moved to next CF.

Thank you. New patches coming soon.

This series is for pg13, right? We're not considering any of this for
pg12?

--
ï¿½lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#40

Thomas Munro

thomas.munro@enterprisedb.com

almost 7 years ago

In reply to: Alvaro Herrera (#39)

Re: Undo logs

On Thu, Feb 7, 2019 at 1:16 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

On 2019-Feb-04, Thomas Munro wrote:

On Mon, Feb 4, 2019 at 3:55 PM Michael Paquier <michael@paquier.xyz> wrote:

On Sun, Feb 03, 2019 at 02:23:16AM -0800, Andres Freund wrote:

This thread is curently marked as returned with feedback, set so
2018-12-01. Given there've been several new versions submitted since, is
that accurate?

From the latest status of this thread, there have been new patches but
no reviews on them, so moved to next CF.

Thank you. New patches coming soon.

This series is for pg13, right? We're not considering any of this for
pg12?

Correct. Originally the target was 12 but that was a bit too ambitious.

--
Thomas Munro
http://www.enterprisedb.com

#41

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Thomas Munro (#40)

Re: Undo logs

On Thu, Feb 07, 2019 at 03:21:09AM +1100, Thomas Munro wrote:

Correct. Originally the target was 12 but that was a bit too ambitious.

Could it be possible to move the patch set into the first PG-13 commit
fest then? We could use this CF as recipient for now, even if the
schedule for next development cycle is not set in stone:
https://commitfest.postgresql.org/23/
--
Michael

#42

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Michael Paquier (#41)

Re: Undo logs

On February 7, 2019 7:21:49 AM GMT+05:30, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Feb 07, 2019 at 03:21:09AM +1100, Thomas Munro wrote:

Correct. Originally the target was 12 but that was a bit too

ambitious.

Could it be possible to move the patch set into the first PG-13 commit
fest then? We could use this CF as recipient for now, even if the
schedule for next development cycle is not set in stone:
https://commitfest.postgresql.org/23/

We now have the target version as a field, that should make such moves unnecessary, right?

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#43

Michael Paquier

michael@paquier.xyz

almost 7 years ago

In reply to: Andres Freund (#42)

Re: Undo logs

On Thu, Feb 07, 2019 at 07:25:57AM +0530, Andres Freund wrote:

We now have the target version as a field, that should make such
moves unnecessary, right?

Oh, I missed this stuff. Thanks for pointing it out.
--
Michael

#44

Andres Freund

andres@anarazel.de

almost 7 years ago

In reply to: Michael Paquier (#43)

Re: Undo logs

On February 7, 2019 7:34:11 AM GMT+05:30, Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Feb 07, 2019 at 07:25:57AM +0530, Andres Freund wrote:

We now have the target version as a field, that should make such
moves unnecessary, right?

Oh, I missed this stuff. Thanks for pointing it out.

It was JUST added ... :) thought I saw you reply on the other thread about it, but I was wrong...

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#45

Michael Paquier

michael@paquier.xyz

about 6 years ago

In reply to: Andres Freund (#44)

Re: Undo logs

On Thu, Feb 07, 2019 at 07:35:31AM +0530, Andres Freund wrote:

It was JUST added ... :) thought I saw you reply on the other thread
about it, but I was wrong...

Six months later without any activity, I am marking this entry as
returned with feedback. The latest patch set does not apply anymore,
so having a rebase would be nice if submitted again.
--
Michael

#46

Robert Haas

robertmhaas@gmail.com

about 6 years ago

In reply to: Michael Paquier (#45)

Re: Undo logs

On Sat, Nov 30, 2019 at 9:25 PM Michael Paquier <michael@paquier.xyz> wrote:

On Thu, Feb 07, 2019 at 07:35:31AM +0530, Andres Freund wrote:

It was JUST added ... :) thought I saw you reply on the other thread
about it, but I was wrong...

Six months later without any activity, I am marking this entry as
returned with feedback. The latest patch set does not apply anymore,
so having a rebase would be nice if submitted again.

Sounds fair, thanks. Actually, we've rewritten large amounts of this,
but unfortunately not to the point where it's ready to post yet. If
anyone wants to see the development in progress, see
https://github.com/EnterpriseDB/zheap/commits/undo-record-set

This is not really an EnterpriseDB project any more because Andres and
Thomas decided to leave EnterpriseDB, but both expressed an intention
to continue working on the project. So hopefully we'll get there. That
being said, here's what the three of us are working towards:

- Undo locations are identified by a 64-bit UndoRecPtr, which is very
similar to a WAL LSN. However, each undo log (1TB of address space)
has its own insertion point, so that many backends can insert
simultaneously without contending on the insertion point. The code for
this is by Thomas and is mostly the same as before.

- To insert undo data, you create an UndoRecordSet, which has a record
set header followed by any number of records. In the common case, an
UndoRecordSet corresponds to the intersection of a transaction and a
persistence level - that is, XID 12345 could have up to 3
UndoRecordSets, one for permanent undo, one for unlogged undo, and one
for temporary undo. We might in the future have support for other
kinds of UndoRecordSets, e.g. for multixact-like things that are
associated with a group of transactions rather than just one. This
code is new, by Thomas with contributions from me.

- The records that get stored into an UndoRecordSet will be serialized
from an in-memory representation and then deserialized when the data
is read later. Andres is writing the code for this, but hasn't pushed
it to the branch yet. The idea here is to allow a lot of flexibility
about what gets stored, responding to criticisms of the earlier design
from Heikki, while still being efficient about what we actually write
on disk, since we know from testing that undo volume is a significant
performance concern.

- Each transaction that writes permanent or unlogged undo gets an
UndoRequest, which tracks the fact that there is work to do if the
transaction aborts. Undo can be applied either in the foreground right
after abort or in the background. The latter case is necessary because
crashes or FATAL errors can abort transactions, but the former case is
important as a way of keeping the undo work from ballooning out of
control in a workload where people just abort transactions nonstop; we
have to slow things down so that we can keep up. This code is by me,
based on a design sketch from Andres.

Getting all of this working has been harder and slower than I'd hoped,
but I think the new design fixes a lot of things that weren't right in
earlier iterations, so I feel like we are at least headed in the right
direction.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company